Python library and CLI tool for biological database ID conversion and annotation using TogoID.
- ID Conversion: Convert IDs between biological databases
- ID Conversion with Annotations: Add annotation columns during conversion
- ID Conversion with Filtering: Filter conversion results by annotation values
- Ortholog Retrieval: Get orthologs through round-trip conversion and taxonomy filtering
- Label to ID: Convert biological labels (gene names, etc.) to database IDs with dataset-based API selection
- Annotations: Get labels and annotations for database IDs
- Multiple Formats: Support for JSON, CSV, TSV, dict, table, and pandas DataFrame
- Dual Interface: Use as Python library or command-line tool
- Comprehensive: Search databases, find routes, get configurations
uv is a blazingly fast Python package installer and resolver (10-100x faster than pip).
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/togoid/togoid-lib-python.git
cd togoid-lib-python
# Create virtual environment and install
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package
uv pip install -e .
# With pandas support
uv pip install -e ".[pandas]"
# With development tools
uv pip install -e ".[dev]"💡 Tip: See QUICKSTART_UV.md for a detailed uv quick start guide.
# From source
git clone https://github.com/togoid/togoid-lib-python.git
cd togoid-lib-python
pip install -e .
# With pandas support
pip install -e ".[pandas]"from togoid import TogoIDConverter, AnnotationsConverter, LabelConverter
# ID Conversion
converter = TogoIDConverter()
# JSON format (default)
result = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"])
# Dict format
result_dict = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"], format="dict")
# Output: {'ids': ['1', '9'], 'route': ['ncbigene', 'ensembl_gene'], 'results': {'1': ['ENSG00000121410'], '9': ['ENSG00000171428']}}
# Table format
result_table = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"], format="table")
# Output: [["1", "ENSG00000121410"], ["9", "ENSG00000075624"]]
# DataFrame format (requires pandas)
result_df = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"], format="dataframe")
# ID Conversion with Annotations
result_with_annotations = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
format="table",
annotate=[("ncbigene", "label")] # Add ncbigene label as annotation column
)
# ID Conversion with Filtering
result_filtered = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
format="table",
annotate=[("ncbigene", "label")],
filter=[("ensembl_transcript", "transcript_flag", ["MANE Select"])] # Only MANE Select transcripts
)
# Get Orthologs
orthologs = converter.get_ortholog(
ids=["1", "9"],
route=["ncbigene", "homologene"],
target_taxids=["10090", "10116"] # Mouse and Rat
)
# Label to ID Conversion
label_converter = LabelConverter()
# Convert labels with dataset specification
results = label_converter.convert(
labels=["BRCA1", "TP53"],
dataset="ncbigene",
taxonomy="9606" # Human
)
# Convert labels for other datasets
results = label_converter.convert(
labels=["caffeine"],
dataset="chebi",
label_types=["togoid_chebi_label"]
)
# Get Annotations
annotator = AnnotationsConverter()
annotations = annotator.execute_query(
dataset_name="ncbigene",
ids=["672", "7157"],
fields=["label", "gene_synonym"],
filters={}
)# Basic ID Conversion
togoid convert --ids 1,9 --route ncbigene,ensembl_gene
# Convert with different output formats
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format table
# ID Conversion with Annotations
togoid convert --ids 1,9 --route ncbigene,ensembl_gene,ensembl_transcript \
--format table \
--annotate ncbigene label \
--annotate ncbigene full_name
# ID Conversion with Filtering
togoid convert --ids 1,9 --route ncbigene,ensembl_gene,ensembl_transcript \
--format table \
--annotate ncbigene label \
--filter ensembl_transcript transcript_flag "MANE Select"
# Get Orthologs
togoid get-ortholog --ids 672,7157 \
--route ncbigene,homologene \
--target-taxids 10090,10116 \
--format table
# Label to ID Conversion
togoid label2id --labels "BRCA1,TP53,EGFR" --dataset ncbigene --taxonomy 9606
# Get Annotations
togoid annotate --dataset ncbigene --ids 672,7157 \
--field gene_synonym \
--field full_name
# List available annotation fields
togoid annotate --dataset ncbigene --list-fields
# Configuration
togoid config dataset ncbigene
togoid config descriptions
togoid count ncbigene ensembl_gene --ids 1,91. label_types parameter now requires list format
The label_types parameter in LabelConverter.convert() has been changed from string to list type.
# ❌ Old (will not work)
label_converter.convert(
labels=["BRCA1"],
dataset="ncbigene",
label_types="symbol,synonym" # String format
)
# ✅ New (correct)
label_converter.convert(
labels=["BRCA1"],
dataset="ncbigene",
label_types=["symbol", "synonym"] # List format
)2. format="dict" deprecated for routes with 3+ datasets
When using routes with 3 or more datasets, format="dict" is no longer supported. Use format="table" or format="dataframe" instead.
# ❌ Old (will raise error)
converter.convert(
ids=["1"],
route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
format="dict"
)
# ✅ New (correct)
converter.convert(
ids=["1"],
route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
format="table" # or "dataframe"
)3. annotator.execute_query filters parameter is now optional
The filters parameter in AnnotationsConverter.execute_query() is now optional and defaults to an empty dictionary.
# Both work now
annotations = annotator.execute_query(
dataset_name="ncbigene",
ids=["672"],
fields=["label"],
filters={} # Can be omitted
)
annotations = annotator.execute_query(
dataset_name="ncbigene",
ids=["672"],
fields=["label"] # No filters parameter needed
)from togoid import TogoIDConverter
converter = TogoIDConverter()
# JSON (default) - raw API response
json_result = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene"]
)
# Dict - Includes ids, route, and results mapping {source_id: [target_ids]}
dict_result = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene"],
format="dict"
)
# Table - [[source_id, target_id], ...] 2D array
table_result = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene"],
format="table"
)
# DataFrame - pandas DataFrame with source_id and target_id columns
df_result = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene"],
format="dataframe"
)# Add annotation columns to conversion results
result = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
format="table",
annotate=[
("ncbigene", "label"), # Add gene label from ncbigene
("ncbigene", "full_name"), # Add full gene name from ncbigene
("ensembl_gene", "label") # Add gene label from ensembl_gene
]
)
# Result includes original conversion + 3 annotation columns# Filter conversion results by annotation values
result = converter.convert(
ids=["1", "9"],
route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
format="table",
annotate=[("ncbigene", "label")],
filter=[
("ensembl_transcript", "transcript_flag", ["MANE Select"])
]
)
# Only returns transcripts with "MANE Select" flag
# 15 transcripts → 2 transcripts (filtered)# Get orthologs through round-trip conversion and taxonomy filtering
# Process: ncbigene -> homologene -> ncbigene -> taxonomy -> filter by taxid
result = converter.get_ortholog(
ids=["1", "9"], # Human genes
route=["ncbigene", "homologene"], # Via homologene
target_taxids=["10090", "10116"] # Mouse and Rat
)
# Returns: [
# ['1', '11167', '117586', '10090'], # source_id, homologene_id, mouse_gene_id, taxid
# ['1', '11167', '140656', '10116'], # same source via same homologene group
# ['9', '37329', '116632', '10116'],
# ['9', '37329', '17961', '10090']
# ]
# Rows are ordered as: [source_id, homologene_id, target_gene_id, taxonomy_id]# Search databases by name
databases = converter.search_databases("uniprot")
# Find routes between databases
routes = converter.route(src="ncbigene", dst="uniprot", max_hops=3)
# Lookup which tables contain an ID
tables = converter.lookup_id("672")from togoid import LabelConverter
converter = LabelConverter(verbose=True)
# Convert gene symbols (uses SPARQList API based on dataset config)
results = converter.convert(
labels=["BRCA1", "TP53", "EGFR"],
dataset="ncbigene",
taxonomy="9606" # Human
)
# Returns: [{"input": "BRCA1", "match_type": "symbol", "symbol": "BRCA1", "identifier": "672"}, ...]
# Convert chemical names (uses PubDictionaries API based on dataset config)
results = converter.convert(
labels=["caffeine"],
dataset="chebi",
label_types=["togoid_chebi_label"] # Optional: override dataset config (list format)
)
# Convert disease names
results = converter.convert(
labels=["breast cancer"],
dataset="mondo",
threshold=0.5 # PubDictionaries matching threshold
)
# Label types are auto-configured from dataset, or can be manually specified (as list)
results = converter.convert(
labels=["BRCA1"],
dataset="ncbigene",
label_types=["symbol"], # Override: only search by symbol (list format)
taxonomy="9606"
)from togoid import AnnotationsConverter
annotator = AnnotationsConverter()
# List available fields for a dataset
fields = annotator.list_fields("ncbigene")
for field_name, field_meta in fields:
print(f"{field_name}: {field_meta['label']}")
# Get annotations for IDs
result = annotator.execute_query(
dataset_name="ncbigene",
ids=["672", "7157"],
fields=["label", "gene_synonym", "type_of_gene"],
filters={"type_of_gene": ["protein-coding"]}
)
for id, annotations in result.items():
print(f"{id}: {annotations}")# Basic conversion
togoid convert --ids 1,9 --route ncbigene,ensembl_gene
# With output format
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict
# Save to file
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format csv --output results.csv
# With additional parameters
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --report pair --limit 100# Basic conversion
togoid label2id --dataset ncbigene --labels "BRCA1,TP53,EGFR" --taxon 9606
togoid label2id --dataset chebi --labels 'caffeine' --label_types 'togoid_chebi_label'
# From file
echo -e "BRCA1\nTP53\nEGFR" > genes.txt
togoid label2id --dataset ncbigene --label-file genes.txt --taxon 9606
# CSV output
togoid label2id --dataset ncbigene --labels "BRCA1,TP53" --taxon 9606 --format csv --output results.csv
# With PubDictionaries (for non-gene labels)
togoid label2id --dataset chebi --labels "breast cancer" --label_types "togoid_mondo_label"
# Verbose mode
togoid label2id --dataset ncbigene --labels "BRCA1,TP53" --taxon 9606 --verbose# Get annotations
togoid annotate --dataset ncbigene --ids 672,7157 --field gene_synonym --field full_name
# List available fields
togoid annotate --dataset ncbigene --list-fields
# With filters
togoid annotate --dataset ncbigene --ids 672,7157 \
--field type_of_gene --field gene_synonym \
--filter type_of_gene=protein-coding
# CSV output
togoid annotate --dataset ncbigene --ids 672,7157 \
--field gene_synonym --format csv --output genes.csv
# From file
togoid annotate --dataset ncbigene --ids-file gene_ids.txt --field gene_synonym# Search databases
togoid search databases uniprot
togoid search id NM_001110
# Lookup ID
togoid lookup id 672
# Find routes
togoid route ncbigene uniprot --max-hops 3
# Count mappings
togoid count ncbigene ensembl_gene --ids 1,9
# Get configuration
togoid config dataset ncbigene
togoid config relation ncbigene-ensembl_gene
togoid config descriptions
togoid config statistics
togoid config taxonomyMain class for ID conversion operations.
Methods:
convert(route, ids, format='json', **kwargs)- Convert IDs between databasescount(src, dst, ids, link=None)- Count mappingssearch_databases(name)- Search databases by namesearch_id(id_string)- Search databases by ID patternlookup_id(id_string)- Lookup which tables contain an IDroute(src, dst, max_hops=3)- Find routes between databasesconfig_dataset(name=None)- Get dataset configurationconfig_relation(src=None, dst=None)- Get relation configurationconfig_descriptions()- Get database descriptionsconfig_statistics()- Get database statisticsconfig_taxonomy()- Get taxonomy list
Main class for converting biological labels to database IDs with automatic API detection.
Methods:
convert(labels, dataset, label_types=None, tags=None, threshold=0.5, preferred_dictionary=None, taxonomy=None, format='json')- Convert labels to IDs (auto-selects API based on dataset config)convert_pubdictionaries(labels, dictionaries, tags=None, threshold=0.5, preferred_dictionary=None)- Convert using PubDictionaries APIconvert_sparqlist(labels, sparqlist, label_types, taxonomy=None)- Convert using SPARQList API
Auto-detection Logic:
- If labels are gene symbols (non-numeric) → Uses SPARQList API for ncbigene
- If labels are numeric IDs or other formats → Uses PubDictionaries API
- ncbigene regex pattern is fetched from TogoID API dynamically
Main class for getting annotations and labels for IDs.
Methods:
list_fields(dataset_name)- List available annotation fieldsexecute_query(dataset_name, ids, fields, filters)- Execute GraphQL query to get annotationsbuild_rows(dataset_label, fields, field_meta, records, filters, compact)- Build table rows from query results
# Convert IDs between databases
togoid convert --ids 1,9 --route ncbigene,ensembl_gene
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict
# Label to ID conversion
togoid label2id --labels "BRCA1,TP53" --dataset ncbigene --taxonomy 9606
# Get annotations
togoid annotate --dataset ncbigene --ids 672,7157 --field label --field gene_synonym
togoid annotate --dataset ncbigene --list-fields
# Utilities
togoid count ncbigene ensembl_gene --ids 1,9
# Configuration
togoid config dataset ncbigene
togoid config descriptionsAdd annotation columns to your conversion results:
# Add single annotation
togoid convert --ids 1,9 \
--route ncbigene,ensembl_gene,ensembl_transcript \
--format table \
--annotate ncbigene label
# Add multiple annotations
togoid convert --ids 1,9 \
--route ncbigene,ensembl_gene,ensembl_transcript \
--format table \
--annotate ncbigene label \
--annotate ncbigene full_name \
--annotate ensembl_transcript transcript_flagFilter results by annotation values:
# Filter by single value
togoid convert --ids 1,9 \
--route ncbigene,ensembl_gene,ensembl_transcript \
--format table \
--filter ensembl_transcript transcript_flag "MANE Select"
# Combine annotations and filtering
togoid convert --ids 1,9 \
--route ncbigene,ensembl_gene,ensembl_transcript \
--format table \
--annotate ncbigene label \
--filter ensembl_transcript transcript_flag "MANE Select"Get orthologs using round-trip conversion:
# Get mouse and rat orthologs for human genes
togoid get-ortholog \
--ids 672,7157 \
--route ncbigene,homologene \
--target-taxids 10090,10116 \
--format table
# Output as JSON
togoid get-ortholog \
--ids 672,7157 \
--route ncbigene,homologene \
--target-taxids 10090 \
--format jsonTable output columns are [source_id, homologene_id, target_id, taxonomy_id].
# Read IDs from file
echo "1\n9\n672" > ids.txt
togoid convert --ids-file ids.txt --route ncbigene,ensembl_gene
# Save output to file
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --output result.json
# Different output formats
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format json
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format table
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format csvControl what information is returned:
# Only target IDs (default)
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --report target
# Source-target pairs
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --report pair
# Full path including intermediate IDs
togoid convert --ids 1,9 --route ncbigene,ensembl_gene,ensembl_transcript --report fullNote: When using routes with 3+ datasets or annotations, the library automatically uses report=full to include all intermediate IDs.
Get a list of datasets that are reachable from a source dataset in one hop:
# CLI
togoid config list-targets ncbigene
# Python
converter = TogoIDConverter()
targets = converter.config_list_targets("ncbigene")
print(targets) # ['ensembl_gene', 'hgnc', 'mgi', ...]When datasets are not directly connected, the library automatically suggests alternative routes:
# If ncbigene → chembl_compound isn't directly connected
converter.convert(ids=["1"], route=["ncbigene", "chembl_compound"])
# Error message will suggest alternatives:
# RuntimeError: No direct connection between 'ncbigene' and 'chembl_compound'.
#
# Suggested routes (2 hops):
# - ncbigene → ensembl_gene → chembl_compound
# - ncbigene → uniprot → chembl_compound
#
# Suggested routes (3 hops):
# - ncbigene → ensembl_gene → pdb → chembl_compoundTOGOID_API_ENDPOINT- TogoID API base URL (default: https://api.togoid.dbcls.jp)TOGOID_GRASP_ENDPOINT- GRASP GraphQL endpoint (default: https://dx.dbcls.jp/grasp-dev-togoid)
# Python
converter = TogoIDConverter(api_base_url="http://localhost:5000")
# CLI
togoid --api-url http://localhost:5000 convert --ids 1,9 --route ncbigene,ensembl_gene- Python 3.7+
- requests >= 2.20.0
- pandas >= 1.0.0 (optional, for DataFrame format)
This package includes comprehensive test scripts to verify all functionality:
# Test Python library examples
python3 test_readme_examples.py
# Test CLI examples
bash test_cli_examples.shMIT License
Developed by DBCLS (Database Center for Life Science)