Skip to content

Add genome data utilities for homology-based metabolic model reconstruction#620

Merged
edkerk merged 4 commits into
develop3from
proccess-gene-data-model
Jun 12, 2026
Merged

Add genome data utilities for homology-based metabolic model reconstruction#620
edkerk merged 4 commits into
develop3from
proccess-gene-data-model

Conversation

@ae-tafur

@ae-tafur ae-tafur commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Main improvements in this PR:

  • Features:

    • downloadGenomeData — new function that retrieves GFF3 (.gff) annotation and protein FASTA (.faa) files for a given NCBI assembly accession (both GCF_ and GCA_ prefixes supported) using the NCBI E-utilities API; skips download if files are already present locally
    • getGeneData — new function that parses the GFF3 annotation file for a given assembly and produces a gene-mapping table with columns locus_tag, old_locus_tag, extra_id, GeneID, gene_name, and GenBank_protein; handles both eukaryote (gene→mRNA) and prokaryote (gene→CDS) annotation styles; calls downloadGenomeData automatically if no local files are provided
    • processProteinFastaFile — new function that renames each sequence header in a protein FASTA file using a user-specified column from the gene table (e.g. locus_tag, gene_name) matched via the GenBank_protein accession; streams the file line-by-line for memory efficiency on large proteomes; sequences with no match in the gene table retain their original header
    • All the functions are updated as the last commit in develop3 branch (Reformat function help blocks to NumPy-style docstrings)
  • Motivation:
    In homology-based metabolic model reconstruction, the gene identifiers in the model must exactly match the sequence identifiers in the protein FASTA file used for the BLAST search; these three functions provide a self-contained pipeline to download, map, and reformat genome data so that both artefacts are consistent before reconstruction begins

Instructions on merging this PR:

  • This PR has develop as target branch, and will be resolved with a squash-merge.
  • This PR has main as target branch, and will be resolved as descriped here.

@edkerk

edkerk commented Jun 11, 2026

Copy link
Copy Markdown
Member

Rework genome-data utilities: fix eukaryote protein mapping, add tests

Functional changes only to the genome-data functions from this PR; the
original formatting is preserved where behaviour is unchanged:

  • getGeneData now takes the protein accession from each CDS protein_id and
    resolves the owning gene through the Parent chain (CDS->mRNA->gene for
    eukaryotes, CDS->gene for prokaryotes). The previous eukaryote path used
    the mRNA transcript accession (NM_/XM_), which never matches the protein
    FASTA headers (NP_/XP_), so eukaryote renaming silently failed. Removes
    the fragile mRNA-scanning organism detection and the organism-specific
    Phytozome 'extra_id' column; %XX decoding no longer uses urldecode (which
    also mangles '+'); writing the .tsv is now opt-in.
  • downloadGenomeData uses the NCBI Datasets v2 API (a single request that
    returns both the GFF and protein FASTA) instead of the deprecated
    Assembly E-utilities, and uses correct error identifiers.
  • renameModelGenes: fix the broken warning (hardcoded count, mismatched
    format args, data used as a format string), drop the O(n^2) fallback
    tracking, and leave genes with no or empty mapping unchanged instead of
    silently substituting locus_tag.
  • Add genomeDataTests with offline GFF3/FASTA fixtures (eukaryote and
    prokaryote) plus a connectivity-guarded download test.

Verified: getGeneData protein accessions match the protein FASTA headers
100% for both S. cerevisiae and E. coli; all tests pass.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

Function test results

196 tests   176 ✅  37s ⏱️
 22 suites   20 💤
  1 files      0 ❌

Results for commit 295d964.

♻️ This comment has been updated with latest results.

Following the develop3 merge, reconcile the genome-data functions with the
toolbox conventions that are now available:

- downloadGenomeData, getGeneData and processProteinFastaFile take their
  optional arguments through parseRAVENargs, so they can be called both
  positionally and by name (matching the rest of the toolbox); the optional
  arguments are documented under a Name-Value Arguments section.
- processProteinFastaFile reads and writes the FASTA with the shared
  readFasta/writeFasta helpers instead of a hand-rolled parser.
- The tests move to the function_tests suite as a RavenTestCase class
  (tGenomeData), replacing the unit_tests/genomeDataTests.m version; the
  offline GFF3/FASTA fixtures are reused from unit_tests/test_data.
@edkerk

edkerk commented Jun 11, 2026

Copy link
Copy Markdown
Member

A final set of changes, to align this PR with recent changes in the develop3 branch, for instance in relation to the format of function_tests, and how optional variables are parsed.

@edkerk edkerk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, very useful, we should include it in https://raven-docs.readthedocs.io/en/latest/protocol/.

@edkerk edkerk merged commit bb0fbeb into develop3 Jun 12, 2026
2 checks passed
@edkerk edkerk deleted the proccess-gene-data-model branch June 12, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants