Skip to content

tangaode/scTCR_Guide

Repository files navigation

scTCR-Guide

scTCR-Guide predicts CD8 T-cell clonal expansion state from single-cell RNA-seq expression profiles. During model development, paired scRNA-seq/scTCR-seq data are used to define clonal expansion labels. During inference, the model uses scRNA-seq expression only.

The repository contains the model code, inference workflow, training and evaluation scripts, benchmark utilities, and the released CD8 model assets. Raw data, processed training bundles, prediction tables, figures, and study result tables are not included.

What the model predicts

For each input CD8 T cell, scTCR-Guide returns:

  • p_high_expansion_raw: raw model probability for high clonal expansion
  • p_high_expansion_calibrated: calibrated probability
  • predicted_clone_expansion: High or Low
  • sample-level summaries when sample_id is available

The locked decision rule for the released CD8 model is:

High if p_high_expansion_raw >= 0.44

The model does not reconstruct TCR sequences and does not estimate exact clone sizes.

Repository layout

configs/                     Final model and training configuration
docs/                        Input format and reproducibility notes
examples/                    Minimal command-line examples
models/scTCR-Guide-CD8/      Released CD8 model checkpoint and preprocessing assets
scripts/preprocess/          Bundle construction, labeling, and feature selection
scripts/train/               Training entry points
scripts/evaluate/            Evaluation and benchmark utilities
scripts/inference/           Inference entry point for new CD8 scRNA-seq bundles
src/sctcr_guide/             Python package
tests/                       Lightweight import and forward-pass tests

Installation

Create a Python environment and install the package in editable mode:

git clone https://github.com/tangaode/scTCR_Guide.git
cd scTCR_Guide
pip install -e .

For GPU inference or training, install a PyTorch build that matches your CUDA version before installing the package.

Input format

The inference script expects a prepared bundle:

bundle/
  metadata.json
  rna.npy
  train.parquet

rna.npy stores raw count values with cells as rows and genes as columns. metadata.json must contain gene_names, the column order in rna.npy. The split table, for example train.parquet, must contain rna_index and may contain cell_id, barcode, sample_id, donor_id, study_id, and source.

The released model expects high-confidence CD8 T cells as input. Whole-tissue matrices should be quality controlled and annotated first, then subset to CD8 T cells before inference.

More details are in docs/input_format.md.

Run inference

python scripts/inference/predict_cd8_expansion.py \
  --bundle-dir /path/to/cd8_bundle \
  --split train \
  --deployment-manifest models/scTCR-Guide-CD8/deployment_manifest.json \
  --output-parquet predictions.parquet \
  --output-csv predictions.csv \
  --sample-summary-csv sample_summary.csv

The model maps input genes to the fixed 1024-gene panel, fills missing genes with zero before transformation, applies the same asinh transform and training-set standardization used during model development, and writes one prediction row per cell.

Training and evaluation

Training a model from a prepared labeled bundle:

python scripts/train/train_clone_state.py \
  --data-dir /path/to/labeled_bundle \
  --model-config configs/model/clone_state_cross_gated_mlp_v1.yaml \
  --train-config configs/train/clone_state_cross_gated_mlp_v1.yaml \
  --output-dir runs/example_model

Tune the probability threshold on validation data and evaluate on an internal test split:

python scripts/evaluate/evaluate_single_binary_threshold_tuned.py \
  --data-dir /path/to/labeled_bundle \
  --checkpoint runs/example_model/model_best.pt \
  --model-config configs/model/clone_state_cross_gated_mlp_v1.yaml \
  --train-config configs/train/clone_state_cross_gated_mlp_v1.yaml \
  --output-json runs/example_model/threshold_tuned_eval.json

External benchmark scripts are provided under scripts/evaluate/. They operate on local bundles and write result files to user-specified output directories. No benchmark result files are shipped in this repository.

Clonal expansion labels

Training labels are derived from paired scTCR-seq data. For each sample, clone size is counted within the sample. The high-expansion threshold is 1.5 times the median clone size among non-singleton clonotypes. A cell is labeled High if its clone size is greater than this sample-specific threshold; otherwise it is labeled Low.

This sample-relative rule is intended to reduce bias from differences in sample size, biopsy site, and TCR capture depth.

Released model assets

The released CD8 model is stored in models/scTCR-Guide-CD8/:

  • model_best.pt: model weights
  • model_config.yaml: model architecture
  • train_config.yaml: training-time preprocessing and batch settings
  • preprocessing.json: 1024-gene order and training-set standardization parameters
  • gene_panel.json: selected CD8 gene panel
  • calibration.json: probability calibration and decision threshold
  • deployment_manifest.json: relative-path manifest used by the inference script

Data availability

This repository does not contain raw or processed single-cell datasets. Users should obtain datasets from the original studies and prepare bundles in the documented format.

Citation

If you use scTCR-Guide, please cite the associated publication when available.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages