scTCR-Guide predicts CD8 T-cell clonal expansion state from single-cell RNA-seq expression profiles. During model development, paired scRNA-seq/scTCR-seq data are used to define clonal expansion labels. During inference, the model uses scRNA-seq expression only.
The repository contains the model code, inference workflow, training and evaluation scripts, benchmark utilities, and the released CD8 model assets. Raw data, processed training bundles, prediction tables, figures, and study result tables are not included.
For each input CD8 T cell, scTCR-Guide returns:
p_high_expansion_raw: raw model probability for high clonal expansionp_high_expansion_calibrated: calibrated probabilitypredicted_clone_expansion:HighorLow- sample-level summaries when
sample_idis available
The locked decision rule for the released CD8 model is:
High if p_high_expansion_raw >= 0.44
The model does not reconstruct TCR sequences and does not estimate exact clone sizes.
configs/ Final model and training configuration
docs/ Input format and reproducibility notes
examples/ Minimal command-line examples
models/scTCR-Guide-CD8/ Released CD8 model checkpoint and preprocessing assets
scripts/preprocess/ Bundle construction, labeling, and feature selection
scripts/train/ Training entry points
scripts/evaluate/ Evaluation and benchmark utilities
scripts/inference/ Inference entry point for new CD8 scRNA-seq bundles
src/sctcr_guide/ Python package
tests/ Lightweight import and forward-pass tests
Create a Python environment and install the package in editable mode:
git clone https://github.com/tangaode/scTCR_Guide.git
cd scTCR_Guide
pip install -e .For GPU inference or training, install a PyTorch build that matches your CUDA version before installing the package.
The inference script expects a prepared bundle:
bundle/
metadata.json
rna.npy
train.parquet
rna.npy stores raw count values with cells as rows and genes as columns. metadata.json must contain gene_names, the column order in rna.npy. The split table, for example train.parquet, must contain rna_index and may contain cell_id, barcode, sample_id, donor_id, study_id, and source.
The released model expects high-confidence CD8 T cells as input. Whole-tissue matrices should be quality controlled and annotated first, then subset to CD8 T cells before inference.
More details are in docs/input_format.md.
python scripts/inference/predict_cd8_expansion.py \
--bundle-dir /path/to/cd8_bundle \
--split train \
--deployment-manifest models/scTCR-Guide-CD8/deployment_manifest.json \
--output-parquet predictions.parquet \
--output-csv predictions.csv \
--sample-summary-csv sample_summary.csvThe model maps input genes to the fixed 1024-gene panel, fills missing genes with zero before transformation, applies the same asinh transform and training-set standardization used during model development, and writes one prediction row per cell.
Training a model from a prepared labeled bundle:
python scripts/train/train_clone_state.py \
--data-dir /path/to/labeled_bundle \
--model-config configs/model/clone_state_cross_gated_mlp_v1.yaml \
--train-config configs/train/clone_state_cross_gated_mlp_v1.yaml \
--output-dir runs/example_modelTune the probability threshold on validation data and evaluate on an internal test split:
python scripts/evaluate/evaluate_single_binary_threshold_tuned.py \
--data-dir /path/to/labeled_bundle \
--checkpoint runs/example_model/model_best.pt \
--model-config configs/model/clone_state_cross_gated_mlp_v1.yaml \
--train-config configs/train/clone_state_cross_gated_mlp_v1.yaml \
--output-json runs/example_model/threshold_tuned_eval.jsonExternal benchmark scripts are provided under scripts/evaluate/. They operate on local bundles and write result files to user-specified output directories. No benchmark result files are shipped in this repository.
Training labels are derived from paired scTCR-seq data. For each sample, clone size is counted within the sample. The high-expansion threshold is 1.5 times the median clone size among non-singleton clonotypes. A cell is labeled High if its clone size is greater than this sample-specific threshold; otherwise it is labeled Low.
This sample-relative rule is intended to reduce bias from differences in sample size, biopsy site, and TCR capture depth.
The released CD8 model is stored in models/scTCR-Guide-CD8/:
model_best.pt: model weightsmodel_config.yaml: model architecturetrain_config.yaml: training-time preprocessing and batch settingspreprocessing.json: 1024-gene order and training-set standardization parametersgene_panel.json: selected CD8 gene panelcalibration.json: probability calibration and decision thresholddeployment_manifest.json: relative-path manifest used by the inference script
This repository does not contain raw or processed single-cell datasets. Users should obtain datasets from the original studies and prepare bundles in the documented format.
If you use scTCR-Guide, please cite the associated publication when available.