scTCR-Guide

scTCR-Guide predicts CD8 T-cell clonal expansion state from single-cell RNA-seq expression profiles. During model development, paired scRNA-seq/scTCR-seq data are used to define clonal expansion labels. During inference, the model uses scRNA-seq expression only.

The repository contains the model code, inference workflow, training and evaluation scripts, benchmark utilities, and the released CD8 model assets. Raw data, processed training bundles, prediction tables, figures, and study result tables are not included.

What the model predicts

For each input CD8 T cell, scTCR-Guide returns:

p_high_expansion_raw: raw model probability for high clonal expansion
p_high_expansion_calibrated: calibrated probability
predicted_clone_expansion: High or Low
sample-level summaries when sample_id is available

The locked decision rule for the released CD8 model is:

High if p_high_expansion_raw >= 0.44

The model does not reconstruct TCR sequences and does not estimate exact clone sizes.

Repository layout

configs/                     Final model and training configuration
docs/                        Input format and reproducibility notes
examples/                    Minimal command-line examples
models/scTCR-Guide-CD8/      Released CD8 model checkpoint and preprocessing assets
scripts/preprocess/          Bundle construction, labeling, and feature selection
scripts/train/               Training entry points
scripts/evaluate/            Evaluation and benchmark utilities
scripts/inference/           Inference entry point for new CD8 scRNA-seq bundles
src/sctcr_guide/             Python package
tests/                       Lightweight import and forward-pass tests

Installation

Create a Python environment and install the package in editable mode:

git clone https://github.com/tangaode/scTCR_Guide.git
cd scTCR_Guide
pip install -e .

For GPU inference or training, install a PyTorch build that matches your CUDA version before installing the package.

Input format

The inference script expects a prepared bundle:

bundle/
  metadata.json
  rna.npy
  train.parquet

rna.npy stores raw count values with cells as rows and genes as columns. metadata.json must contain gene_names, the column order in rna.npy. The split table, for example train.parquet, must contain rna_index and may contain cell_id, barcode, sample_id, donor_id, study_id, and source.

The released model expects high-confidence CD8 T cells as input. Whole-tissue matrices should be quality controlled and annotated first, then subset to CD8 T cells before inference.

More details are in docs/input_format.md.

Run inference

python scripts/inference/predict_cd8_expansion.py \
  --bundle-dir /path/to/cd8_bundle \
  --split train \
  --deployment-manifest models/scTCR-Guide-CD8/deployment_manifest.json \
  --output-parquet predictions.parquet \
  --output-csv predictions.csv \
  --sample-summary-csv sample_summary.csv

The model maps input genes to the fixed 1024-gene panel, fills missing genes with zero before transformation, applies the same asinh transform and training-set standardization used during model development, and writes one prediction row per cell.

Training and evaluation

Training a model from a prepared labeled bundle:

python scripts/train/train_clone_state.py \
  --data-dir /path/to/labeled_bundle \
  --model-config configs/model/clone_state_cross_gated_mlp_v1.yaml \
  --train-config configs/train/clone_state_cross_gated_mlp_v1.yaml \
  --output-dir runs/example_model

Tune the probability threshold on validation data and evaluate on an internal test split:

python scripts/evaluate/evaluate_single_binary_threshold_tuned.py \
  --data-dir /path/to/labeled_bundle \
  --checkpoint runs/example_model/model_best.pt \
  --model-config configs/model/clone_state_cross_gated_mlp_v1.yaml \
  --train-config configs/train/clone_state_cross_gated_mlp_v1.yaml \
  --output-json runs/example_model/threshold_tuned_eval.json

External benchmark scripts are provided under scripts/evaluate/. They operate on local bundles and write result files to user-specified output directories. No benchmark result files are shipped in this repository.

Clonal expansion labels

Training labels are derived from paired scTCR-seq data. For each sample, clone size is counted within the sample. The high-expansion threshold is 1.5 times the median clone size among non-singleton clonotypes. A cell is labeled High if its clone size is greater than this sample-specific threshold; otherwise it is labeled Low.

This sample-relative rule is intended to reduce bias from differences in sample size, biopsy site, and TCR capture depth.

Released model assets

The released CD8 model is stored in models/scTCR-Guide-CD8/:

model_best.pt: model weights
model_config.yaml: model architecture
train_config.yaml: training-time preprocessing and batch settings
preprocessing.json: 1024-gene order and training-set standardization parameters
gene_panel.json: selected CD8 gene panel
calibration.json: probability calibration and decision threshold
deployment_manifest.json: relative-path manifest used by the inference script

Data availability

This repository does not contain raw or processed single-cell datasets. Users should obtain datasets from the original studies and prepare bundles in the documented format.

Citation

If you use scTCR-Guide, please cite the associated publication when available.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
docs		docs
examples		examples
models/scTCR-Guide-CD8		models/scTCR-Guide-CD8
scripts		scripts
src/sctcr_guide		src/sctcr_guide
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scTCR-Guide

What the model predicts

Repository layout

Installation

Input format

Run inference

Training and evaluation

Clonal expansion labels

Released model assets

Data availability

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scTCR-Guide

What the model predicts

Repository layout

Installation

Input format

Run inference

Training and evaluation

Clonal expansion labels

Released model assets

Data availability

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages