Quarto notebooks by KevinMLanderos · Pull Request #78 · KarchinLab/TCRtoolkit

KevinMLanderos · 2026-04-22T14:59:52Z

Adapt notebooks to new "patient" workflow.
Phenotypic plasticity still to be fixed when input data is single-cell

Copilot

Pull request overview

This PR adds/updates a suite of Quarto notebook templates to support the new “patient” workflow, expanding the reporting pipeline with QC, overlap/dynamics, sharing/publicity, and phenotype-focused analyses (bulk + single-cell).

Changes:

Introduces new/updated Quarto templates for QC, overlap/clonal dynamics, sharing/publicity, GLIPH, and phenotype analyses.
Adds “Details” report entry points split into template_details_part1.qmd and template_details_part2.qmd that include the new sections.
Updates .gitignore to allowlist additional notebook templates under notebooks/.

Reviewed changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
notebooks/template_sharing.qmd	New sharing/publicity analysis plots (patient-level sharing + Pgen correlation).
notebooks/template_qc.qmd	New/expanded QC report with many interactive plots (diversity, overlap, rarefaction, PCA, etc.).
notebooks/template_pheno_sc_details.qmd	Single-cell phenotype similarity (Morisita) “details” view.
notebooks/template_pheno_sc.qmd	Single-cell phenotype composition + phenotype plasticity (UpSet) analyses.
notebooks/template_pheno_bulk.qmd	Bulk phenotype inference via TCRpheno and weighted phenotype composition plots.
notebooks/template_overlap.qmd	Longitudinal clonal dynamics: Fisher tests, heatmaps, trajectories, and summary tables.
notebooks/template_gliph.qmd	GLIPH2 visualization templates (network, motifs, logos).
notebooks/template_details_part1.qmd	“Details” report entry point (part 1) including setup + sample notebook include.
notebooks/template_details_part2.qmd	“Details” report entry point (part 2) including overlap + sharing includes.
.gitignore	Adds allowlist entries so new templates in `notebooks/` can be tracked despite `notebooks/*` ignore.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T18:25:25Z

+For this section to be rightfully executed, you must have had provided cells and their associated phenotypes.
+:::
+
+Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment.


Typo: “caracterization” should be “characterization”.

Suggested change

Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment.

Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression characterization, which allows for cell phenotype assignment.

Copilot · 2026-04-22T18:25:26Z

+Heatmaps display the pairwise similarity between TCR repertoires using the Jaccard (overlap) index. Hierarchical clustering groups samples based on their shared clonal content, facilitating the identification of biological replicates or potential cross-contamination events. The color intensity reflects the degree of similarity, where darker red indicates a higher proportion of shared clonotypes between sample pairs.
+
+::: {.callout-tip title="Note"}
+When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq whas done using DNA as input**.  Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA).


Typo: “whas” should be “was”.

Suggested change

When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq whas done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA).

When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq was done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA).

Copilot · 2026-04-22T18:25:26Z

+
+- **Full Timeline:** This table displays the clone's frequency across all sequenced timepoints.
+- **Interpretation:** This view allows you to categorize the clone's overall behavior. For example, you can distinguish between a sustained expansion (frequency rises and stays high) versus a transient expansion (frequency rises and then drops back to zero).
+    - Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence.


There’s a duplicated fragment/typo in this sentence (“absence.w ... absence.”) which will render oddly. Remove the stray “w” and the duplicated clause so the note reads cleanly.

Suggested change

- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence.

- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.

Copilot · 2026-04-22T18:25:26Z

+    for pat in unique_patients:
+        print(f"## {pat}\n")
+
+        # Filter data for the current patient
+        pat_df = plot_df[plot_df[subject_col] == pat]
+
+        # Group by clone (cdr3) and aggregate the unique phenotypes it appears in
+        # Note: adjust 'cdr3' if your sequence column is named differently 
+        # (e.g., 'junction_aa', 'clone_id')
+        if 'cdr3' not in pat_df.columns:
+            print("Error: 'cdr3' column not found in data. Please check your clonotype column name.")
+            continue
+
+        clone_phenotypes = pat_df.groupby('cdr3')['phenotype'].apply(lambda x: list(set(x)))


This UpSet plot helper checks for a cdr3 column, but earlier in this notebook the clonotype column used is CDR3b. As written, this will print an error for every patient and skip plot generation. Use the correct column name (or make it configurable/detect CDR3b/junction_aa etc.) so the plot actually renders.

Suggested change

for pat in unique_patients:

print(f"## {pat}\n")

# Filter data for the current patient

pat_df = plot_df[plot_df[subject_col] == pat]

# Group by clone (cdr3) and aggregate the unique phenotypes it appears in

# Note: adjust 'cdr3' if your sequence column is named differently

# (e.g., 'junction_aa', 'clone_id')

if 'cdr3' not in pat_df.columns:

print("Error: 'cdr3' column not found in data. Please check your clonotype column name.")

continue

clone_phenotypes = pat_df.groupby('cdr3')['phenotype'].apply(lambda x: list(set(x)))

clonotype_candidates = ['CDR3b', 'cdr3', 'junction_aa', 'clone_id']

clonotype_col = next((col for col in clonotype_candidates if col in plot_df.columns), None)

if clonotype_col is None:

print(

"Error: no clonotype column found in data. Expected one of: "

f"{', '.join(clonotype_candidates)}."

)

print(":::\n")

return

for pat in unique_patients:

print(f"## {pat}\n")

# Filter data for the current patient

pat_df = plot_df[plot_df[subject_col] == pat]

# Group by clone and aggregate the unique phenotypes it appears in

if clonotype_col not in pat_df.columns:

print(

f"Error: '{clonotype_col}' column not found in data for patient {pat}. "

"Please check your clonotype column name."

)

continue

clone_phenotypes = pat_df.groupby(clonotype_col)['phenotype'].apply(lambda x: list(set(x)))

Copilot · 2026-04-22T18:25:26Z

+    Simulates rarefaction by expanding counts and shuffling.
+    """
+    # Flat array where each element is a clone ID, repeated by its count
+    # Use integer IDs for memory efficiency instead of strings
+    counts = sample_df['counts'].values
+    n_reads = counts.sum()
+
+    if n_reads == 0:
+        return pd.DataFrame()
+
+    # Generate clone IDs (0 to N-1)
+    ids = np.arange(len(counts))
+
+    # e.g., if clone 0 has 2 counts -> [0, 0]
+    # WARNING: High memory usage for very deep sequencing
+    flat_repertoire = np.repeat(ids, counts)
+
+    # Shuffle to simulate random sampling
+    np.random.shuffle(flat_repertoire)
+
+    # Define depths to check (from 0 to total reads)
+    depths = np.linspace(0, n_reads, num=steps, dtype=int)
+    depths = np.unique(depths)[1:] # Remove 0 and duplicates
+
+    results = []
+    for d in depths:
+        subsample = flat_repertoire[:d]
+        n_unique = len(np.unique(subsample))
+        results.append({'sample': sample, 'depth': d, 'richness': n_unique})
+


This rarefaction implementation materializes the full read-level repertoire with np.repeat(ids, counts), which can allocate O(total_reads) memory per sample (often millions+) and can easily OOM / stall notebooks on real cohorts. Consider an algorithm that samples counts without expanding to read-level (e.g., multinomial/hypergeometric-based subsampling per depth, or iterative reservoir-style sampling) and/or cap max depth / number of samples.

Suggested change

Simulates rarefaction by expanding counts and shuffling.

"""

# Flat array where each element is a clone ID, repeated by its count

# Use integer IDs for memory efficiency instead of strings

counts = sample_df['counts'].values

n_reads = counts.sum()

if n_reads == 0:

return pd.DataFrame()

# Generate clone IDs (0 to N-1)

ids = np.arange(len(counts))

# e.g., if clone 0 has 2 counts -> [0, 0]

# WARNING: High memory usage for very deep sequencing

flat_repertoire = np.repeat(ids, counts)

# Shuffle to simulate random sampling

np.random.shuffle(flat_repertoire)

# Define depths to check (from 0 to total reads)

depths = np.linspace(0, n_reads, num=steps, dtype=int)

depths = np.unique(depths)[1:] # Remove 0 and duplicates

results = []

for d in depths:

subsample = flat_repertoire[:d]

n_unique = len(np.unique(subsample))

results.append({'sample': sample, 'depth': d, 'richness': n_unique})

Simulates rarefaction by sampling directly from clone counts without

materializing a read-level array.

"""

counts = sample_df['counts'].to_numpy(dtype=np.int64, copy=False)

n_reads = int(counts.sum())

if n_reads == 0:

return pd.DataFrame()

# Define depths to check (from 0 to total reads)

depths = np.linspace(0, n_reads, num=steps, dtype=int)

depths = np.unique(depths)[1:] # Remove 0 and duplicates

# Track remaining reads per clone and whether each clone has been seen

remaining = counts.copy()

seen = np.zeros(len(counts), dtype=bool)

richness = 0

prev_depth = 0

results = []

for d in depths:

draw_size = int(d - prev_depth)

if draw_size > 0:

total_remaining = int(remaining.sum())

draws_left = draw_size

for i in range(len(remaining)):

if draws_left == 0 or total_remaining == 0:

break

clone_remaining = int(remaining[i])

if clone_remaining == 0:

continue

if i == len(remaining) - 1:

drawn = draws_left

else:

drawn = int(

np.random.hypergeometric(

clone_remaining,

total_remaining - clone_remaining,

draws_left

)

)

if drawn > 0:

remaining[i] -= drawn

if not seen[i]:

seen[i] = True

richness += 1

draws_left -= drawn

total_remaining -= clone_remaining

results.append({'sample': sample, 'depth': int(d), 'richness': richness})

prev_depth = int(d)

Copilot · 2026-04-22T18:25:28Z

+**Figure 3: Longitudinal tracking of top contracted TCR clonotypes.** The temporal dynamics of the 15 most significantly downregulated clonotypes, ranked by lowest FDR-adjusted p-value ($q$-value). Blue lines trace the $log_{10}$ frequency of individual clones that exhibited significant contraction ($FC < 0.5, q < 0.05$) across the sampled timepoints.
+
+
+### Clonal Expansion/Contraction Paiwise Comparisons    


Typo in header: “Paiwise” should be “Pairwise”.

Suggested change

### Clonal Expansion/Contraction Paiwise Comparisons

### Clonal Expansion/Contraction Pairwise Comparisons

Copilot · 2026-04-22T18:25:28Z

+if 'timepoint_order' in locals():
+    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
+else:
+    ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
+    timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
+    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)
+


Timepoint ordering here relies on a timepoint_order mapping that is never defined in this notebook, so it always falls back to sorted(timepoints). That can mis-order timepoints like T1, T2, T10 and make longitudinal plots misleading. Build the ordering map from timepoint_order_col (e.g., from meta or clonotypes_df) and use it consistently.

Suggested change

if 'timepoint_order' in locals():

cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)

else:

ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())

timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}

cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)

timepoint_order = None

timepoint_order_col_name = locals().get('timepoint_order_col')

if timepoint_order_col_name is not None:

order_source = None

if 'meta' in locals() and timepoint_col in meta.columns and timepoint_order_col_name in meta.columns:

order_source = meta[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col])

elif timepoint_col in clonotypes_df.columns and timepoint_order_col_name in clonotypes_df.columns:

order_source = clonotypes_df[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col])

if order_source is not None and not order_source.empty:

order_source = order_source.sort_values(timepoint_order_col_name)

ordered_timepoints = order_source[timepoint_col].tolist()

timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)}

if timepoint_order is None:

ordered_timepoints = sorted(cdr3_df[timepoint_col].dropna().unique())

timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)}

cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)

Copilot · 2026-04-22T18:25:29Z

+import json
+from IPython.display import HTML, display
+import warnings
+import matplotlib.pyplot as plt


matplotlib.pyplot is imported twice in this block. Drop the duplicate import to reduce noise and keep imports consistent.

Suggested change

import matplotlib.pyplot as plt

Copilot · 2026-04-22T18:25:29Z

+
+This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: **Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?**
+
+This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.


Typo in narrative text: “axplanation” should be “explanation”.

Suggested change

This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed explanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

Copilot · 2026-04-22T18:25:29Z

+
+**<u>Downstream biological analyses</u>**  
+- **Single-cell**:  
+  It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.  


Typo: “possile” should be “possible”.

Suggested change

It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.

It is possible to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.

KevinMLanderos added 4 commits March 23, 2026 10:02

Add notebook visualizations

3db9250

Notebook updates

95bd761

Changes were done in the main branch to Add the PATIENT workflow.

01e0243

Adapt notebooks to patient workflow

8ba95bd

KevinMLanderos merged commit e694ce2 into main Apr 22, 2026
1 check passed

dimalvovs requested a review from Copilot April 22, 2026 18:19

Copilot started reviewing on behalf of dimalvovs April 22, 2026 18:19 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

This was referenced Apr 26, 2026

Revert "Quarto notebooks" #79

Closed

Revert "Quarto notebooks" #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quarto notebooks#78

Quarto notebooks#78
KevinMLanderos merged 4 commits into
mainfrom
quarto-notebooks

KevinMLanderos commented Apr 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment.
	Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression characterization, which allows for cell phenotype assignment.

	- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence.
	- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.

		Figure 3: Longitudinal tracking of top contracted TCR clonotypes. The temporal dynamics of the 15 most significantly downregulated clonotypes, ranked by lowest FDR-adjusted p-value ($q$-value). Blue lines trace the $log_{10}$ frequency of individual clones that exhibited significant contraction ($FC < 0.5, q < 0.05$) across the sampled timepoints.


		### Clonal Expansion/Contraction Paiwise Comparisons

	### Clonal Expansion/Contraction Paiwise Comparisons
	### Clonal Expansion/Contraction Pairwise Comparisons

-if 'timepoint_order' in locals():
-    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
-else:
-    ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
-    timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
-    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)
+timepoint_order = None
+timepoint_order_col_name = locals().get('timepoint_order_col')
+if timepoint_order_col_name is not None:
+    order_source = None
+    if 'meta' in locals() and timepoint_col in meta.columns and timepoint_order_col_name in meta.columns:
+        order_source = meta[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col])
+    elif timepoint_col in clonotypes_df.columns and timepoint_order_col_name in clonotypes_df.columns:
+        order_source = clonotypes_df[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col])
+    if order_source is not None and not order_source.empty:
+        order_source = order_source.sort_values(timepoint_order_col_name)
+        ordered_timepoints = order_source[timepoint_col].tolist()
+        timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)}
+if timepoint_order is None:
+    ordered_timepoints = sorted(cdr3_df[timepoint_col].dropna().unique())
+    timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)}
+cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)


		This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?

		This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation). The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

	It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.
	It is possible to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.

Conversation

KevinMLanderos commented Apr 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants