Skip to content

togoid/togoid-lib-python

Repository files navigation

TogoID Python Library

Python library and CLI tool for biological database ID conversion and annotation using TogoID.

Features

  • ID Conversion: Convert IDs between biological databases
  • ID Conversion with Annotations: Add annotation columns during conversion
  • ID Conversion with Filtering: Filter conversion results by annotation values
  • Ortholog Retrieval: Get orthologs through round-trip conversion and taxonomy filtering
  • Label to ID: Convert biological labels (gene names, etc.) to database IDs with dataset-based API selection
  • Annotations: Get labels and annotations for database IDs
  • Multiple Formats: Support for JSON, CSV, TSV, dict, table, and pandas DataFrame
  • Dual Interface: Use as Python library or command-line tool
  • Comprehensive: Search databases, find routes, get configurations

Installation

Using uv (recommended - faster)

uv is a blazingly fast Python package installer and resolver (10-100x faster than pip).

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/togoid/togoid-lib-python.git
cd togoid-lib-python

# Create virtual environment and install
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
uv pip install -e .

# With pandas support
uv pip install -e ".[pandas]"

# With development tools
uv pip install -e ".[dev]"

💡 Tip: See QUICKSTART_UV.md for a detailed uv quick start guide.

Using pip (traditional)

# From source
git clone https://github.com/togoid/togoid-lib-python.git
cd togoid-lib-python
pip install -e .

# With pandas support
pip install -e ".[pandas]"

Quick Start

As a Python Library

from togoid import TogoIDConverter, AnnotationsConverter, LabelConverter

# ID Conversion
converter = TogoIDConverter()

# JSON format (default)
result = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"])

# Dict format
result_dict = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"], format="dict")
# Output: {'ids': ['1', '9'], 'route': ['ncbigene', 'ensembl_gene'], 'results': {'1': ['ENSG00000121410'], '9': ['ENSG00000171428']}}

# Table format
result_table = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"], format="table")
# Output: [["1", "ENSG00000121410"], ["9", "ENSG00000075624"]]

# DataFrame format (requires pandas)
result_df = converter.convert(ids=["1", "9"], route=["ncbigene", "ensembl_gene"], format="dataframe")

# ID Conversion with Annotations
result_with_annotations = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
    format="table",
    annotate=[("ncbigene", "label")]  # Add ncbigene label as annotation column
)

# ID Conversion with Filtering
result_filtered = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
    format="table",
    annotate=[("ncbigene", "label")],
    filter=[("ensembl_transcript", "transcript_flag", ["MANE Select"])]  # Only MANE Select transcripts
)

# Get Orthologs
orthologs = converter.get_ortholog(
    ids=["1", "9"],
    route=["ncbigene", "homologene"],
    target_taxids=["10090", "10116"]  # Mouse and Rat
)

# Label to ID Conversion
label_converter = LabelConverter()

# Convert labels with dataset specification
results = label_converter.convert(
    labels=["BRCA1", "TP53"],
    dataset="ncbigene",
    taxonomy="9606"  # Human
)

# Convert labels for other datasets
results = label_converter.convert(
    labels=["caffeine"],
    dataset="chebi",
    label_types=["togoid_chebi_label"]
)

# Get Annotations
annotator = AnnotationsConverter()
annotations = annotator.execute_query(
    dataset_name="ncbigene",
    ids=["672", "7157"],
    fields=["label", "gene_synonym"],
    filters={}
)

As a Command-Line Tool

# Basic ID Conversion
togoid convert --ids 1,9 --route ncbigene,ensembl_gene

# Convert with different output formats
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format table

# ID Conversion with Annotations
togoid convert --ids 1,9 --route ncbigene,ensembl_gene,ensembl_transcript \
  --format table \
  --annotate ncbigene label \
  --annotate ncbigene full_name

# ID Conversion with Filtering
togoid convert --ids 1,9 --route ncbigene,ensembl_gene,ensembl_transcript \
  --format table \
  --annotate ncbigene label \
  --filter ensembl_transcript transcript_flag "MANE Select"

# Get Orthologs
togoid get-ortholog --ids 672,7157 \
  --route ncbigene,homologene \
  --target-taxids 10090,10116 \
  --format table

# Label to ID Conversion
togoid label2id --labels "BRCA1,TP53,EGFR" --dataset ncbigene --taxonomy 9606

# Get Annotations
togoid annotate --dataset ncbigene --ids 672,7157 \
  --field gene_synonym \
  --field full_name

# List available annotation fields
togoid annotate --dataset ncbigene --list-fields

# Configuration
togoid config dataset ncbigene
togoid config descriptions
togoid count ncbigene ensembl_gene --ids 1,9

Breaking Changes

Version 0.2.0+

1. label_types parameter now requires list format

The label_types parameter in LabelConverter.convert() has been changed from string to list type.

# ❌ Old (will not work)
label_converter.convert(
    labels=["BRCA1"],
    dataset="ncbigene",
    label_types="symbol,synonym"  # String format
)

# ✅ New (correct)
label_converter.convert(
    labels=["BRCA1"],
    dataset="ncbigene",
    label_types=["symbol", "synonym"]  # List format
)

2. format="dict" deprecated for routes with 3+ datasets

When using routes with 3 or more datasets, format="dict" is no longer supported. Use format="table" or format="dataframe" instead.

# ❌ Old (will raise error)
converter.convert(
    ids=["1"],
    route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
    format="dict"
)

# ✅ New (correct)
converter.convert(
    ids=["1"],
    route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
    format="table"  # or "dataframe"
)

3. annotator.execute_query filters parameter is now optional

The filters parameter in AnnotationsConverter.execute_query() is now optional and defaults to an empty dictionary.

# Both work now
annotations = annotator.execute_query(
    dataset_name="ncbigene",
    ids=["672"],
    fields=["label"],
    filters={}  # Can be omitted
)

annotations = annotator.execute_query(
    dataset_name="ncbigene",
    ids=["672"],
    fields=["label"]  # No filters parameter needed
)

Usage Examples

ID Conversion

Different Output Formats

from togoid import TogoIDConverter

converter = TogoIDConverter()

# JSON (default) - raw API response
json_result = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene"]
)

# Dict - Includes ids, route, and results mapping {source_id: [target_ids]}
dict_result = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene"],
    format="dict"
)

# Table - [[source_id, target_id], ...] 2D array
table_result = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene"],
    format="table"
)

# DataFrame - pandas DataFrame with source_id and target_id columns
df_result = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene"],
    format="dataframe"
)

ID Conversion with Annotations

# Add annotation columns to conversion results
result = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
    format="table",
    annotate=[
        ("ncbigene", "label"),           # Add gene label from ncbigene
        ("ncbigene", "full_name"),       # Add full gene name from ncbigene
        ("ensembl_gene", "label")        # Add gene label from ensembl_gene
    ]
)
# Result includes original conversion + 3 annotation columns

ID Conversion with Filtering

# Filter conversion results by annotation values
result = converter.convert(
    ids=["1", "9"],
    route=["ncbigene", "ensembl_gene", "ensembl_transcript"],
    format="table",
    annotate=[("ncbigene", "label")],
    filter=[
        ("ensembl_transcript", "transcript_flag", ["MANE Select"])
    ]
)
# Only returns transcripts with "MANE Select" flag
# 15 transcripts → 2 transcripts (filtered)

Get Orthologs

# Get orthologs through round-trip conversion and taxonomy filtering
# Process: ncbigene -> homologene -> ncbigene -> taxonomy -> filter by taxid
result = converter.get_ortholog(
    ids=["1", "9"],                      # Human genes
    route=["ncbigene", "homologene"],    # Via homologene
    target_taxids=["10090", "10116"]     # Mouse and Rat
)
# Returns: [
#   ['1', '11167', '117586', '10090'],   # source_id, homologene_id, mouse_gene_id, taxid
#   ['1', '11167', '140656', '10116'],   # same source via same homologene group
#   ['9', '37329', '116632', '10116'],
#   ['9', '37329', '17961', '10090']
# ]
# Rows are ordered as: [source_id, homologene_id, target_gene_id, taxonomy_id]

Search and Route

# Search databases by name
databases = converter.search_databases("uniprot")

# Find routes between databases
routes = converter.route(src="ncbigene", dst="uniprot", max_hops=3)

# Lookup which tables contain an ID
tables = converter.lookup_id("672")

Label to ID Conversion

from togoid import LabelConverter

converter = LabelConverter(verbose=True)

# Convert gene symbols (uses SPARQList API based on dataset config)
results = converter.convert(
    labels=["BRCA1", "TP53", "EGFR"],
    dataset="ncbigene",
    taxonomy="9606"  # Human
)
# Returns: [{"input": "BRCA1", "match_type": "symbol", "symbol": "BRCA1", "identifier": "672"}, ...]

# Convert chemical names (uses PubDictionaries API based on dataset config)
results = converter.convert(
    labels=["caffeine"],
    dataset="chebi",
    label_types=["togoid_chebi_label"]  # Optional: override dataset config (list format)
)

# Convert disease names
results = converter.convert(
    labels=["breast cancer"],
    dataset="mondo",
    threshold=0.5  # PubDictionaries matching threshold
)

# Label types are auto-configured from dataset, or can be manually specified (as list)
results = converter.convert(
    labels=["BRCA1"],
    dataset="ncbigene",
    label_types=["symbol"],  # Override: only search by symbol (list format)
    taxonomy="9606"
)

Annotations

from togoid import AnnotationsConverter

annotator = AnnotationsConverter()

# List available fields for a dataset
fields = annotator.list_fields("ncbigene")
for field_name, field_meta in fields:
    print(f"{field_name}: {field_meta['label']}")

# Get annotations for IDs
result = annotator.execute_query(
    dataset_name="ncbigene",
    ids=["672", "7157"],
    fields=["label", "gene_synonym", "type_of_gene"],
    filters={"type_of_gene": ["protein-coding"]}
)

for id, annotations in result.items():
    print(f"{id}: {annotations}")

Command-Line Interface

Convert Command

# Basic conversion
togoid convert --ids 1,9 --route ncbigene,ensembl_gene

# With output format
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict

# Save to file
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format csv --output results.csv

# With additional parameters
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --report pair --limit 100

Label2ID Command

# Basic conversion
togoid label2id --dataset ncbigene --labels "BRCA1,TP53,EGFR" --taxon 9606
togoid label2id --dataset chebi --labels 'caffeine' --label_types 'togoid_chebi_label'

# From file
echo -e "BRCA1\nTP53\nEGFR" > genes.txt
togoid label2id --dataset ncbigene --label-file genes.txt --taxon 9606

# CSV output
togoid label2id --dataset ncbigene --labels "BRCA1,TP53" --taxon 9606 --format csv --output results.csv

# With PubDictionaries (for non-gene labels)
togoid label2id --dataset chebi --labels "breast cancer" --label_types "togoid_mondo_label"

# Verbose mode
togoid label2id --dataset ncbigene --labels "BRCA1,TP53" --taxon 9606 --verbose

Annotate Command

# Get annotations
togoid annotate --dataset ncbigene --ids 672,7157 --field gene_synonym --field full_name

# List available fields
togoid annotate --dataset ncbigene --list-fields

# With filters
togoid annotate --dataset ncbigene --ids 672,7157 \
    --field type_of_gene --field gene_synonym \
    --filter type_of_gene=protein-coding

# CSV output
togoid annotate --dataset ncbigene --ids 672,7157 \
    --field gene_synonym --format csv --output genes.csv

# From file
togoid annotate --dataset ncbigene --ids-file gene_ids.txt --field gene_synonym

Other Commands

# Search databases
togoid search databases uniprot
togoid search id NM_001110

# Lookup ID
togoid lookup id 672

# Find routes
togoid route ncbigene uniprot --max-hops 3

# Count mappings
togoid count ncbigene ensembl_gene --ids 1,9

# Get configuration
togoid config dataset ncbigene
togoid config relation ncbigene-ensembl_gene
togoid config descriptions
togoid config statistics
togoid config taxonomy

API Documentation

TogoIDConverter

Main class for ID conversion operations.

Methods:

  • convert(route, ids, format='json', **kwargs) - Convert IDs between databases
  • count(src, dst, ids, link=None) - Count mappings
  • search_databases(name) - Search databases by name
  • search_id(id_string) - Search databases by ID pattern
  • lookup_id(id_string) - Lookup which tables contain an ID
  • route(src, dst, max_hops=3) - Find routes between databases
  • config_dataset(name=None) - Get dataset configuration
  • config_relation(src=None, dst=None) - Get relation configuration
  • config_descriptions() - Get database descriptions
  • config_statistics() - Get database statistics
  • config_taxonomy() - Get taxonomy list

LabelConverter

Main class for converting biological labels to database IDs with automatic API detection.

Methods:

  • convert(labels, dataset, label_types=None, tags=None, threshold=0.5, preferred_dictionary=None, taxonomy=None, format='json') - Convert labels to IDs (auto-selects API based on dataset config)
  • convert_pubdictionaries(labels, dictionaries, tags=None, threshold=0.5, preferred_dictionary=None) - Convert using PubDictionaries API
  • convert_sparqlist(labels, sparqlist, label_types, taxonomy=None) - Convert using SPARQList API

Auto-detection Logic:

  • If labels are gene symbols (non-numeric) → Uses SPARQList API for ncbigene
  • If labels are numeric IDs or other formats → Uses PubDictionaries API
  • ncbigene regex pattern is fetched from TogoID API dynamically

AnnotationsConverter

Main class for getting annotations and labels for IDs.

Methods:

  • list_fields(dataset_name) - List available annotation fields
  • execute_query(dataset_name, ids, fields, filters) - Execute GraphQL query to get annotations
  • build_rows(dataset_label, fields, field_meta, records, filters, compact) - Build table rows from query results

CLI Command Reference

Basic Commands

# Convert IDs between databases
togoid convert --ids 1,9 --route ncbigene,ensembl_gene
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict

# Label to ID conversion
togoid label2id --labels "BRCA1,TP53" --dataset ncbigene --taxonomy 9606

# Get annotations
togoid annotate --dataset ncbigene --ids 672,7157 --field label --field gene_synonym
togoid annotate --dataset ncbigene --list-fields

# Utilities
togoid count ncbigene ensembl_gene --ids 1,9

# Configuration
togoid config dataset ncbigene
togoid config descriptions

Advanced Features

ID Conversion with Annotations

Add annotation columns to your conversion results:

# Add single annotation
togoid convert --ids 1,9 \
  --route ncbigene,ensembl_gene,ensembl_transcript \
  --format table \
  --annotate ncbigene label

# Add multiple annotations
togoid convert --ids 1,9 \
  --route ncbigene,ensembl_gene,ensembl_transcript \
  --format table \
  --annotate ncbigene label \
  --annotate ncbigene full_name \
  --annotate ensembl_transcript transcript_flag

ID Conversion with Filtering

Filter results by annotation values:

# Filter by single value
togoid convert --ids 1,9 \
  --route ncbigene,ensembl_gene,ensembl_transcript \
  --format table \
  --filter ensembl_transcript transcript_flag "MANE Select"

# Combine annotations and filtering
togoid convert --ids 1,9 \
  --route ncbigene,ensembl_gene,ensembl_transcript \
  --format table \
  --annotate ncbigene label \
  --filter ensembl_transcript transcript_flag "MANE Select"

Ortholog Retrieval

Get orthologs using round-trip conversion:

# Get mouse and rat orthologs for human genes
togoid get-ortholog \
  --ids 672,7157 \
  --route ncbigene,homologene \
  --target-taxids 10090,10116 \
  --format table

# Output as JSON
togoid get-ortholog \
  --ids 672,7157 \
  --route ncbigene,homologene \
  --target-taxids 10090 \
  --format json

Table output columns are [source_id, homologene_id, target_id, taxonomy_id].

Input/Output Options

# Read IDs from file
echo "1\n9\n672" > ids.txt
togoid convert --ids-file ids.txt --route ncbigene,ensembl_gene

# Save output to file
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --output result.json

# Different output formats
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format json
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format dict
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format table
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --format csv

Report Options

Control what information is returned:

# Only target IDs (default)
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --report target

# Source-target pairs
togoid convert --ids 1,9 --route ncbigene,ensembl_gene --report pair

# Full path including intermediate IDs
togoid convert --ids 1,9 --route ncbigene,ensembl_gene,ensembl_transcript --report full

Note: When using routes with 3+ datasets or annotations, the library automatically uses report=full to include all intermediate IDs.

Finding Reachable Datasets

Get a list of datasets that are reachable from a source dataset in one hop:

# CLI
togoid config list-targets ncbigene

# Python
converter = TogoIDConverter()
targets = converter.config_list_targets("ncbigene")
print(targets)  # ['ensembl_gene', 'hgnc', 'mgi', ...]

Route Suggestions

When datasets are not directly connected, the library automatically suggests alternative routes:

# If ncbigene → chembl_compound isn't directly connected
converter.convert(ids=["1"], route=["ncbigene", "chembl_compound"])

# Error message will suggest alternatives:
# RuntimeError: No direct connection between 'ncbigene' and 'chembl_compound'.
#
# Suggested routes (2 hops):
# - ncbigene → ensembl_gene → chembl_compound
# - ncbigene → uniprot → chembl_compound
#
# Suggested routes (3 hops):
# - ncbigene → ensembl_gene → pdb → chembl_compound

Configuration

Environment Variables

Custom API Endpoints

# Python
converter = TogoIDConverter(api_base_url="http://localhost:5000")

# CLI
togoid --api-url http://localhost:5000 convert --ids 1,9 --route ncbigene,ensembl_gene

Requirements

  • Python 3.7+
  • requests >= 2.20.0
  • pandas >= 1.0.0 (optional, for DataFrame format)

Testing

This package includes comprehensive test scripts to verify all functionality:

# Test Python library examples
python3 test_readme_examples.py

# Test CLI examples
bash test_cli_examples.sh

License

MIT License

Links

Credits

Developed by DBCLS (Database Center for Life Science)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors