scRNA-seq Reference Projection Pipeline

A reproducible, Papermill-parameterised pipeline for tumor-infiltrating lymphocyte (TIL) analysis with a primary focus on cell type annotation: QC → reference projection → clonotype × exhaustion integration.

This repository now also includes a lightweight Model Context Protocol (MCP) layer that wraps the existing projection workflow as callable tools for agent clients such as Claude Code. The notebooks and pipeline scripts remain the source of truth for computation; the MCP layer only orchestrates execution, discovers compatible inputs, and summarizes outputs.

Accurate identification of exhausted CD8+ T cells from TIL single-cell data is a critical bottleneck in TCR-based cell therapy development; this pipeline provides a reproducible, benchmarked framework for reference-based annotation with direct application to clinical program prioritization.

NB04 and NB05 are currently under development as placeholder downstream prediction modules (TCR reactivity selection and PPV validation) and are not the main focus of this repository. Labelled reactivity training data from wet lab validation experiments is required to complete this module.

Two projection methods run in parallel and can be compared head-to-head:

Method	Language	Script
CCA — Seurat Canonical Correlation Analysis	R	`run_cca_pipeline.sh`
scVI/scANVI — variational autoencoder	Python	`run_scvi_pipeline.sh`

This demo is designed around leave-one-out (LOO) cross-validation, where one patient is held out as query and the remaining patients form the reference for direct method benchmarking.

MCP Layer

The MCP wrapper is intended for cases where an agent is given a new query dataset and needs to:

detect that this repository can project query cells into the TIL reference space
run either the scVI/scANVI or CCA projection path with repo-managed reference artifacts
summarize projected labels and optional clonotype outputs in a structured response

The MCP layer does not replace the existing notebooks. Instead, it calls them through subprocess and papermill so the computational workflow stays aligned with the original analysis code.

Key MCP files:

mcp_server/server.py — MCP server entrypoint
mcp_server/tools/project_query.py — projection wrapper for scVI and CCA
mcp_server/tools/get_summary.py — output summarization helper
mcp_server/tools/compare_methods.py — dual-method execution and recommendation helper
mcp_server/requirements.txt — MCP-layer Python dependencies
README_MCP.md — detailed setup and client wiring instructions
CLAUDE.md — contributor notes for MCP usage in this repo

Getting Started

This is a full analysis pipeline, not a quick toy demo. A complete run (CCA + scVI + LOO benchmarking) typically takes multiple hours (around 4-5 hours on Apple Silicon for all patients). The core demonstration objective is LOO benchmarking of CCA vs scVI/scANVI with one patient left out per fold.

Start from a clean machine with the following sequence:

# 1) Clone
git clone https://github.com/zhuy16/scRNA-seq_reference-projection.git
cd scRNA-seq_reference-projection

# 2) Create Python/R runtime environment
conda env create -f environment.yml
conda activate scrnaseq

# 3) Install pinned R packages (first run only)
Rscript setup_r_env.R

# 4) Run pipelines
conda run -n scrnaseq bash run_cca_pipeline.sh
conda run -n scrnaseq bash run_scvi_pipeline.sh

# 5) Optional: run leave-one-out benchmark (~4-5 h)
conda run -n scrnaseq bash run_loo_benchmark.sh

For fastest inspection without running everything, start with the precomputed examples:

For detailed notebook-by-notebook execution and parameter guidance, use:

Demo Dataset

Yost et al. 2019 BCC (GSE123813, GEO open access) — paired scRNA-seq + TCR-seq from 11 basal cell carcinoma patients (pre/post anti-PD1). Used as a fully reproducible stand-in for clinical TIL data; replace GEO paths in config/params.yaml to run on in-house data.

Yost KE et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat Med. 2019. https://doi.org/10.1038/s41591-019-0522-3

Clinical Context

In TCR-T and related adoptive cell therapy programs, selecting the right T cell clones is limited by how accurately we can characterize exhausted and antigen-experienced CD8+ states in tumor samples. This repository focuses on that upstream bottleneck: robust, reproducible annotation and benchmarked projection across patients, so downstream wet-lab prioritization starts from higher-confidence cell-state calls.

Pipeline Overview

flowchart TD
    A[Raw GEO data] --> CCA0[CCA NB00 data acquisition]
    A --> SCVI0[scVI NB00 convert]
    CCA0 --> CCA1[CCA NB01 preprocessing]
    CCA1 --> CCA2[CCA NB02 reference projection]
    CCA2 --> CCA3[CCA NB03 clonotype x exhaustion]
    SCVI0 --> SCVI1[scVI NB01 train SCANVI]
    SCVI1 --> SCVI2[scVI NB02 project query]
    SCVI2 --> SCVI3[scVI NB03 clonotype x exhaustion]
    CCA3 --> NB04[NB04 TCR selection placeholder]
    SCVI3 --> NB04
    NB04 --> NB05[NB05 PPV validation placeholder]

NB03 has parallel R (Seurat AddModuleScore) and Python (scanpy sc.tl.score_genes) implementations — both produce identical output schemas. NB04–05 are placeholder implementations; completion requires labelled reactivity data from wet-lab validation experiments. See docs/pipeline-reference.md for per-step details.

Repository Structure

├── run_cca_pipeline.sh / run_scvi_pipeline.sh / run_loo_benchmark.sh
├── Makefile                   # make setup / run-cca / run-scvi / run-loo / …
├── Dockerfile                 # self-contained alternative to conda (→ docs/docker.md)
├── mcp_server/                # MCP wrapper layer for agent-discoverable tools
│   ├── server.py
│   ├── requirements.txt
│   └── tools/
│       ├── project_query.py
│       ├── get_summary.py
│       └── compare_methods.py
├── README_MCP.md              # MCP-specific setup and usage guide
├── CLAUDE.md                  # MCP contributor notes and repo conventions
├── config/params.yaml         # single source of truth for all paths + QC params
│
├── notebooks/
│   ├── cca/                   # NB00–03 in R/Seurat
│   ├── scvi/                  # NB00–03 in Python/scvi-tools
│   ├── benchmarking/          # benchmark notebooks + LOO summary outputs
│   ├── 04_tcr_reactivity_selection.ipynb
│   ├── 05_ppv_validation.ipynb
│   └── executed_example_notebooks/  # selected executed notebooks with outputs (see below)
│
├── data/exhaustion_gene_panel.txt  # 18-gene panel (→ docs/exhaustion-gene-panel.md)
├── environment.yml / renv.lock     # exact Python + R environments
└── docs/                           # detailed references (see below)

Benchmarking: CCA vs scVI/scANVI

Leave-one-out benchmark across 11 patients. Each patient held out as query; remaining 10 as reference. Full results: notebooks/benchmarking/loo_summary.csv, plots: notebooks/benchmarking/loo_*.png.

Metric	CCA	scANVI
Overall accuracy	86.2 ± 4.8%	85.9 ± 5.0%
Macro F1	73.7 ± 7.5%	73.7 ± 4.3%
CD8_ex recall	56.3 ± 42.9%	88.3 ± 13.5%
CD8_ex F1	51.2 ± 42.5%	68.3 ± 20.9%

Overall accuracy is equivalent. The key difference is exhausted CD8 recovery: scANVI achieves higher recall (88% vs 56%) and F1 (68% vs 51%) with much lower patient-to-patient variance — the critical subtype for TIL selection. CCA is a fast, GPU-free baseline suited for development and interpretability.

LOO metric distribution across patients for CCA vs scANVI.

Example Notebooks

Selected executed notebooks with full outputs are committed to illustrate code style, logic flow, and visualisations without running the full pipeline:

Notebook	What to look for
explore_yost2019_bcc	Dataset structure, cell-type composition, TCR overlap overview
benchmark_celltype	CCA vs scANVI accuracy, per-patient LOO results, confusion matrices
loo_su001_cca/02_reference_projection	LOO reference projection behavior with held-out patient
loo_su001_scvi/02_project_query	LOO SCANVI inference on held-out patient

Quick Start

# 1. Create environment
conda env create -f environment.yml && conda activate scrnaseq
Rscript setup_r_env.R          # first run only — installs R packages via renv

# 2. Download GEO data (GSE123813) into data/raw_downloads/

# 3. Run CCA pipeline (R)
conda run -n scrnaseq bash run_cca_pipeline.sh

# 4. Run scVI pipeline (Python) — requires CCA NB00 outputs
conda run -n scrnaseq bash run_scvi_pipeline.sh

# 5. LOO benchmark across all 11 patients (~4 h)
conda run -n scrnaseq bash run_loo_benchmark.sh

All steps have make shortcuts (make setup, make run-cca, make run-scvi, make run-loo). For Docker, see docs/docker.md.

To run the MCP server instead of the full benchmark workflow, see README_MCP.md.

Documentation

File	Contents
docs/pipeline-reference.md	Data flow diagram, per-step logic, technical stack, adding new methods
docs/configuration.md	`params.yaml` reference, demo vs production comparison
docs/exhaustion-gene-panel.md	18-gene panel, per-gene rationale, source references
docs/docker.md	Docker build, JupyterLab, GPU notes

Citation / Acknowledgements

Yost KE et al. (2019) Clonal replacement of tumor-specific T cells following PD-1 blockade — Nature Medicine
Zheng et al. (2021) Pan-cancer single-cell landscape of tumor-infiltrating T cells — Science
Sade-Feldman et al. (2018) Defining T cell states associated with response to checkpoint immunotherapy — Cell
Oliveira et al. (2021) Phenotype, specificity and avidity of antitumour CD8+ T cells — Nature
Caushi et al. (2021) Transcriptional programs of neoantigen-specific TIL in lung cancer — Nature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scRNA-seq Reference Projection Pipeline

MCP Layer

Getting Started

Demo Dataset

Clinical Context

Pipeline Overview

Repository Structure

Benchmarking: CCA vs scVI/scANVI

Example Notebooks

Quick Start

Documentation

Citation / Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
mcp_server		mcp_server
models		models
notebooks		notebooks
renv		renv
.Rprofile		.Rprofile
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_MCP.md		README_MCP.md
environment.yml		environment.yml
monitor_live.py		monitor_live.py
renv.lock		renv.lock
run_cca_pipeline.sh		run_cca_pipeline.sh
run_loo_benchmark.sh		run_loo_benchmark.sh
run_scvi_pipeline.sh		run_scvi_pipeline.sh
setup_r_env.R		setup_r_env.R

Folders and files

Latest commit

History

Repository files navigation

scRNA-seq Reference Projection Pipeline

MCP Layer

Getting Started

Demo Dataset

Clinical Context

Pipeline Overview

Repository Structure

Benchmarking: CCA vs scVI/scANVI

Example Notebooks

Quick Start

Documentation

Citation / Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages