An LLM-powered pipeline to find and summarize relevant papers to interpret the results of omics analyses.
- 𧬠Multi-Omics Support: Transcriptomics, proteomics, metabolomics, genomics, metagenomics, epigenomics, lipidomics
- π Smart Auto-Detection: Automatically detects omics type from your data columns
- π Advanced Literature Mining: PMC integration with multiple scoring algorithms
- π·οΈ MeSH Enhancement: Claude Haiku generates Medical Subject Headings for precise searches
- π― Gene-Aware Scoring: Specialized scoring methods optimized for omics data
- π€ AI-Powered Synthesis: Uses Claude Sonnet to generate coherent, referenced discussions
- π¬ Disease-Specific Context: Prompts tailor interpretations to your experimental conditions
- π Professional Reports: Generates structured markdown reports with citations
- π Web Interface: User-friendly Streamlit interface for easy analysis
# Clone the repository
git clone https://github.com/AartikSarma/de_interpreter.git
cd de_interpreter
# Install dependencies (includes all features: scoring, MeSH enhancement, web interface)
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys:
# ANTHROPIC_API_KEY=your_claude_api_key# Basic analysis (auto-detects omics type)
python -m de_interpreter \
--de-file results.csv \
--metadata metadata.json \
--output my_analysis \
--max-features 25
# Advanced analysis with all features
python -m de_interpreter \
--de-file results.csv \
--metadata metadata.json \
--output my_analysis \
--max-features 25 \
--use-scoring \
--scorer-type gene_query_similarity \
--use-mesh \
--mesh-terms-count 3
# Launch web interface
python -m de_interpreter --web# Start the Streamlit web interface
streamlit run streamlit_app.py
# OR
python -m de_interpreter --webimport asyncio
from de_interpreter.main import SimplifiedPipeline, AnalysisConfig
async def analyze():
# Configure analysis
config = AnalysisConfig(
max_features=25,
use_scoring=True,
scorer_type="gene_query_similarity",
use_mesh_enhancement=True,
anthropic_api_key="your-api-key"
)
# Create pipeline
pipeline = SimplifiedPipeline(config)
# Run analysis
report_path = await pipeline.run_analysis(
de_file="results.csv",
metadata_file="metadata.json",
output_name="my_analysis"
)
print(f"Report generated: {report_path}")
asyncio.run(analyze())Required columns:
- Gene identifier (gene_id, ensembl_id, or similar)
- log2FoldChange (or log2FC, logFC)
- pvalue (or p_value)
- padj (or p_adj, FDR)
{
"disease": "Alzheimer's disease",
"tissue": "hippocampus",
"cell_type": "neurons",
"treatment": "amyloid-beta oligomers",
"control": "vehicle",
"time_point": "24 hours",
"organism": "human",
"sample_size": {
"treatment": 6,
"control": 6
}
}The tool generates:
-
Markdown Report (
output/my_analysis.md)- Executive summary
- Gene-by-gene discussions with citations
- Cluster analyses
- Methods description
-
Metadata JSON (
output/my_analysis_metadata.json)- Analysis parameters
- Summary statistics
The repository includes example COVID-19 ARDS transcriptomics data from:
Sarma, A., Christenson, S.A., Byrne, A. et al. Tracheal aspirate RNA sequencing identifies distinct immunological features of COVID-19 ARDS. Nat Commun 12, 5152 (2021). https://doi.org/10.1038/s41467-021-25040-5
This dataset contains differential gene expression results from tracheal aspirate RNA sequencing comparing COVID-19 ARDS patients to controls.
# Using the web interface (recommended)
streamlit run streamlit_app.py
# Then click "π¦ COVID-19 ARDS" example button
# Or via command line
python -m de_interpreter.main \
--de-file covid_data/covid_deg_fixed.csv \
--metadata covid_data/covid_metadata.json \
--output covid_analysis \
--max-features 25Choose from multiple scoring algorithms optimized for different use cases:
- TF-IDF: Fast baseline similarity scoring
- BM25: Balanced relevance ranking
- BioBERT: Semantic similarity using biomedical language models
- Gene-Query Similarity: Enhanced scoring specifically designed for omics data
Automatically generate Medical Subject Headings using Claude Haiku to improve literature search precision:
python -m de_interpreter \
--de-file data.csv \
--use-mesh \
--mesh-terms-count 4 \
--use-scoring \
--scorer-type biobert# Benchmark different scoring methods
python -m de_interpreter.utils.benchmarking
# Run usage examples
python -m de_interpreter.utils.examples scoring# Run tests
pytest tests/
# Run specific test categories
pytest tests/unit/ # Unit tests
pytest tests/integration/ # Integration tests
# Format code
black de_interpreter/
ruff check de_interpreter/ --fix
# Type checking
mypy de_interpreter/Input Processing β Gene Prioritization β Literature Mining β AI Synthesis β Report Generation
β β β β β
DE Parser Statistical & PMC/Literature Claude API Markdown
Metadata Parser Biological Scoring Scoring Reports
The project includes several standalone utility scripts for gene-query literature analysis:
# Basic gene scoring against a literature query
python scripts/gene_query_similarity_scorer.py \
--genes TP53 BRCA1 MYC \
--query "cancer progression" \
--top-papers 20
# With output file
python scripts/gene_query_similarity_scorer.py \
--genes FAM71A P2RY14 CAB39L \
--query "COVID-19 inflammatory response" \
--output results.json# Use Claude Haiku to generate MeSH terms for enhanced searches
python scripts/claude_enhanced_gene_scorer.py \
--genes IFNG TNF IL6 \
--query "immune response" \
--mesh-terms 5# Example showing how to integrate with de_interpreter components
python scripts/example_gene_query_scoring.pySee scripts/README.md for detailed documentation and docs/ for comprehensive guides.
# Benchmark different scoring methods
python scripts/benchmark_scoring.py
# Test individual paper scoring
python scripts/score_query_vs_pmid.py "cancer gene expression" 12345678MIT License
Contributions are welcome! Please submit pull requests or open issues for bugs and feature requests.
If you use DE Interpreter in your research, please cite:
Omics Interpreter: AI-Powered Omics Analysis
https://github.com/AartikSarma/de_interpreter