Quarto notebooks#78
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds/updates a suite of Quarto notebook templates to support the new “patient” workflow, expanding the reporting pipeline with QC, overlap/dynamics, sharing/publicity, and phenotype-focused analyses (bulk + single-cell).
Changes:
- Introduces new/updated Quarto templates for QC, overlap/clonal dynamics, sharing/publicity, GLIPH, and phenotype analyses.
- Adds “Details” report entry points split into
template_details_part1.qmdandtemplate_details_part2.qmdthat include the new sections. - Updates
.gitignoreto allowlist additional notebook templates undernotebooks/.
Reviewed changes
Copilot reviewed 9 out of 12 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| notebooks/template_sharing.qmd | New sharing/publicity analysis plots (patient-level sharing + Pgen correlation). |
| notebooks/template_qc.qmd | New/expanded QC report with many interactive plots (diversity, overlap, rarefaction, PCA, etc.). |
| notebooks/template_pheno_sc_details.qmd | Single-cell phenotype similarity (Morisita) “details” view. |
| notebooks/template_pheno_sc.qmd | Single-cell phenotype composition + phenotype plasticity (UpSet) analyses. |
| notebooks/template_pheno_bulk.qmd | Bulk phenotype inference via TCRpheno and weighted phenotype composition plots. |
| notebooks/template_overlap.qmd | Longitudinal clonal dynamics: Fisher tests, heatmaps, trajectories, and summary tables. |
| notebooks/template_gliph.qmd | GLIPH2 visualization templates (network, motifs, logos). |
| notebooks/template_details_part1.qmd | “Details” report entry point (part 1) including setup + sample notebook include. |
| notebooks/template_details_part2.qmd | “Details” report entry point (part 2) including overlap + sharing includes. |
| .gitignore | Adds allowlist entries so new templates in notebooks/ can be tracked despite notebooks/* ignore. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| For this section to be rightfully executed, you must have had provided cells and their associated phenotypes. | ||
| ::: | ||
|
|
||
| Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment. |
There was a problem hiding this comment.
Typo: “caracterization” should be “characterization”.
| Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment. | |
| Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression characterization, which allows for cell phenotype assignment. |
| Heatmaps display the pairwise similarity between TCR repertoires using the Jaccard (overlap) index. Hierarchical clustering groups samples based on their shared clonal content, facilitating the identification of biological replicates or potential cross-contamination events. The color intensity reflects the degree of similarity, where darker red indicates a higher proportion of shared clonotypes between sample pairs. | ||
|
|
||
| ::: {.callout-tip title="Note"} | ||
| When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq whas done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA). |
There was a problem hiding this comment.
Typo: “whas” should be “was”.
| When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq whas done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA). | |
| When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq was done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA). |
|
|
||
| - **Full Timeline:** This table displays the clone's frequency across all sequenced timepoints. | ||
| - **Interpretation:** This view allows you to categorize the clone's overall behavior. For example, you can distinguish between a sustained expansion (frequency rises and stays high) versus a transient expansion (frequency rises and then drops back to zero). | ||
| - Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence. |
There was a problem hiding this comment.
There’s a duplicated fragment/typo in this sentence (“absence.w ... absence.”) which will render oddly. Remove the stray “w” and the duplicated clause so the note reads cleanly.
| - Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence. | |
| - Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence. |
| for pat in unique_patients: | ||
| print(f"## {pat}\n") | ||
|
|
||
| # Filter data for the current patient | ||
| pat_df = plot_df[plot_df[subject_col] == pat] | ||
|
|
||
| # Group by clone (cdr3) and aggregate the unique phenotypes it appears in | ||
| # Note: adjust 'cdr3' if your sequence column is named differently | ||
| # (e.g., 'junction_aa', 'clone_id') | ||
| if 'cdr3' not in pat_df.columns: | ||
| print("Error: 'cdr3' column not found in data. Please check your clonotype column name.") | ||
| continue | ||
|
|
||
| clone_phenotypes = pat_df.groupby('cdr3')['phenotype'].apply(lambda x: list(set(x))) |
There was a problem hiding this comment.
This UpSet plot helper checks for a cdr3 column, but earlier in this notebook the clonotype column used is CDR3b. As written, this will print an error for every patient and skip plot generation. Use the correct column name (or make it configurable/detect CDR3b/junction_aa etc.) so the plot actually renders.
| for pat in unique_patients: | |
| print(f"## {pat}\n") | |
| # Filter data for the current patient | |
| pat_df = plot_df[plot_df[subject_col] == pat] | |
| # Group by clone (cdr3) and aggregate the unique phenotypes it appears in | |
| # Note: adjust 'cdr3' if your sequence column is named differently | |
| # (e.g., 'junction_aa', 'clone_id') | |
| if 'cdr3' not in pat_df.columns: | |
| print("Error: 'cdr3' column not found in data. Please check your clonotype column name.") | |
| continue | |
| clone_phenotypes = pat_df.groupby('cdr3')['phenotype'].apply(lambda x: list(set(x))) | |
| clonotype_candidates = ['CDR3b', 'cdr3', 'junction_aa', 'clone_id'] | |
| clonotype_col = next((col for col in clonotype_candidates if col in plot_df.columns), None) | |
| if clonotype_col is None: | |
| print( | |
| "Error: no clonotype column found in data. Expected one of: " | |
| f"{', '.join(clonotype_candidates)}." | |
| ) | |
| print(":::\n") | |
| return | |
| for pat in unique_patients: | |
| print(f"## {pat}\n") | |
| # Filter data for the current patient | |
| pat_df = plot_df[plot_df[subject_col] == pat] | |
| # Group by clone and aggregate the unique phenotypes it appears in | |
| if clonotype_col not in pat_df.columns: | |
| print( | |
| f"Error: '{clonotype_col}' column not found in data for patient {pat}. " | |
| "Please check your clonotype column name." | |
| ) | |
| continue | |
| clone_phenotypes = pat_df.groupby(clonotype_col)['phenotype'].apply(lambda x: list(set(x))) |
| Simulates rarefaction by expanding counts and shuffling. | ||
| """ | ||
| # Flat array where each element is a clone ID, repeated by its count | ||
| # Use integer IDs for memory efficiency instead of strings | ||
| counts = sample_df['counts'].values | ||
| n_reads = counts.sum() | ||
|
|
||
| if n_reads == 0: | ||
| return pd.DataFrame() | ||
|
|
||
| # Generate clone IDs (0 to N-1) | ||
| ids = np.arange(len(counts)) | ||
|
|
||
| # e.g., if clone 0 has 2 counts -> [0, 0] | ||
| # WARNING: High memory usage for very deep sequencing | ||
| flat_repertoire = np.repeat(ids, counts) | ||
|
|
||
| # Shuffle to simulate random sampling | ||
| np.random.shuffle(flat_repertoire) | ||
|
|
||
| # Define depths to check (from 0 to total reads) | ||
| depths = np.linspace(0, n_reads, num=steps, dtype=int) | ||
| depths = np.unique(depths)[1:] # Remove 0 and duplicates | ||
|
|
||
| results = [] | ||
| for d in depths: | ||
| subsample = flat_repertoire[:d] | ||
| n_unique = len(np.unique(subsample)) | ||
| results.append({'sample': sample, 'depth': d, 'richness': n_unique}) | ||
|
|
There was a problem hiding this comment.
This rarefaction implementation materializes the full read-level repertoire with np.repeat(ids, counts), which can allocate O(total_reads) memory per sample (often millions+) and can easily OOM / stall notebooks on real cohorts. Consider an algorithm that samples counts without expanding to read-level (e.g., multinomial/hypergeometric-based subsampling per depth, or iterative reservoir-style sampling) and/or cap max depth / number of samples.
| Simulates rarefaction by expanding counts and shuffling. | |
| """ | |
| # Flat array where each element is a clone ID, repeated by its count | |
| # Use integer IDs for memory efficiency instead of strings | |
| counts = sample_df['counts'].values | |
| n_reads = counts.sum() | |
| if n_reads == 0: | |
| return pd.DataFrame() | |
| # Generate clone IDs (0 to N-1) | |
| ids = np.arange(len(counts)) | |
| # e.g., if clone 0 has 2 counts -> [0, 0] | |
| # WARNING: High memory usage for very deep sequencing | |
| flat_repertoire = np.repeat(ids, counts) | |
| # Shuffle to simulate random sampling | |
| np.random.shuffle(flat_repertoire) | |
| # Define depths to check (from 0 to total reads) | |
| depths = np.linspace(0, n_reads, num=steps, dtype=int) | |
| depths = np.unique(depths)[1:] # Remove 0 and duplicates | |
| results = [] | |
| for d in depths: | |
| subsample = flat_repertoire[:d] | |
| n_unique = len(np.unique(subsample)) | |
| results.append({'sample': sample, 'depth': d, 'richness': n_unique}) | |
| Simulates rarefaction by sampling directly from clone counts without | |
| materializing a read-level array. | |
| """ | |
| counts = sample_df['counts'].to_numpy(dtype=np.int64, copy=False) | |
| n_reads = int(counts.sum()) | |
| if n_reads == 0: | |
| return pd.DataFrame() | |
| # Define depths to check (from 0 to total reads) | |
| depths = np.linspace(0, n_reads, num=steps, dtype=int) | |
| depths = np.unique(depths)[1:] # Remove 0 and duplicates | |
| # Track remaining reads per clone and whether each clone has been seen | |
| remaining = counts.copy() | |
| seen = np.zeros(len(counts), dtype=bool) | |
| richness = 0 | |
| prev_depth = 0 | |
| results = [] | |
| for d in depths: | |
| draw_size = int(d - prev_depth) | |
| if draw_size > 0: | |
| total_remaining = int(remaining.sum()) | |
| draws_left = draw_size | |
| for i in range(len(remaining)): | |
| if draws_left == 0 or total_remaining == 0: | |
| break | |
| clone_remaining = int(remaining[i]) | |
| if clone_remaining == 0: | |
| continue | |
| if i == len(remaining) - 1: | |
| drawn = draws_left | |
| else: | |
| drawn = int( | |
| np.random.hypergeometric( | |
| clone_remaining, | |
| total_remaining - clone_remaining, | |
| draws_left | |
| ) | |
| ) | |
| if drawn > 0: | |
| remaining[i] -= drawn | |
| if not seen[i]: | |
| seen[i] = True | |
| richness += 1 | |
| draws_left -= drawn | |
| total_remaining -= clone_remaining | |
| results.append({'sample': sample, 'depth': int(d), 'richness': richness}) | |
| prev_depth = int(d) |
| **Figure 3: Longitudinal tracking of top contracted TCR clonotypes.** The temporal dynamics of the 15 most significantly downregulated clonotypes, ranked by lowest FDR-adjusted p-value ($q$-value). Blue lines trace the $log_{10}$ frequency of individual clones that exhibited significant contraction ($FC < 0.5, q < 0.05$) across the sampled timepoints. | ||
|
|
||
|
|
||
| ### Clonal Expansion/Contraction Paiwise Comparisons |
There was a problem hiding this comment.
Typo in header: “Paiwise” should be “Pairwise”.
| ### Clonal Expansion/Contraction Paiwise Comparisons | |
| ### Clonal Expansion/Contraction Pairwise Comparisons |
| if 'timepoint_order' in locals(): | ||
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order) | ||
| else: | ||
| ordered_timepoints = sorted(cdr3_df[timepoint_col].unique()) | ||
| timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)} | ||
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback) | ||
|
|
There was a problem hiding this comment.
Timepoint ordering here relies on a timepoint_order mapping that is never defined in this notebook, so it always falls back to sorted(timepoints). That can mis-order timepoints like T1, T2, T10 and make longitudinal plots misleading. Build the ordering map from timepoint_order_col (e.g., from meta or clonotypes_df) and use it consistently.
| if 'timepoint_order' in locals(): | |
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order) | |
| else: | |
| ordered_timepoints = sorted(cdr3_df[timepoint_col].unique()) | |
| timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)} | |
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback) | |
| timepoint_order = None | |
| timepoint_order_col_name = locals().get('timepoint_order_col') | |
| if timepoint_order_col_name is not None: | |
| order_source = None | |
| if 'meta' in locals() and timepoint_col in meta.columns and timepoint_order_col_name in meta.columns: | |
| order_source = meta[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col]) | |
| elif timepoint_col in clonotypes_df.columns and timepoint_order_col_name in clonotypes_df.columns: | |
| order_source = clonotypes_df[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col]) | |
| if order_source is not None and not order_source.empty: | |
| order_source = order_source.sort_values(timepoint_order_col_name) | |
| ordered_timepoints = order_source[timepoint_col].tolist() | |
| timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)} | |
| if timepoint_order is None: | |
| ordered_timepoints = sorted(cdr3_df[timepoint_col].dropna().unique()) | |
| timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)} | |
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order) |
| import json | ||
| from IPython.display import HTML, display | ||
| import warnings | ||
| import matplotlib.pyplot as plt |
There was a problem hiding this comment.
matplotlib.pyplot is imported twice in this block. Drop the duplicate import to reduce noise and keep imports consistent.
| import matplotlib.pyplot as plt |
|
|
||
| This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: **Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?** | ||
|
|
||
| This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category. |
There was a problem hiding this comment.
Typo in narrative text: “axplanation” should be “explanation”.
| This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category. | |
| This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed explanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category. |
|
|
||
| **<u>Downstream biological analyses</u>** | ||
| - **Single-cell**: | ||
| It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity. |
There was a problem hiding this comment.
Typo: “possile” should be “possible”.
| It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity. | |
| It is possible to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity. |
Adapt notebooks to new "patient" workflow.
Phenotypic plasticity still to be fixed when input data is single-cell