Add genome data utilities for homology-based metabolic model reconstruction by ae-tafur · Pull Request #620 · SysBioChalmers/RAVEN

ae-tafur · 2026-06-10T21:04:29Z

Main improvements in this PR:

Features:
- downloadGenomeData — new function that retrieves GFF3 (.gff) annotation and protein FASTA (.faa) files for a given NCBI assembly accession (both GCF_ and GCA_ prefixes supported) using the NCBI E-utilities API; skips download if files are already present locally
- getGeneData — new function that parses the GFF3 annotation file for a given assembly and produces a gene-mapping table with columns locus_tag, old_locus_tag, extra_id, GeneID, gene_name, and GenBank_protein; handles both eukaryote (gene→mRNA) and prokaryote (gene→CDS) annotation styles; calls downloadGenomeData automatically if no local files are provided
- processProteinFastaFile — new function that renames each sequence header in a protein FASTA file using a user-specified column from the gene table (e.g. locus_tag, gene_name) matched via the GenBank_protein accession; streams the file line-by-line for memory efficiency on large proteomes; sequences with no match in the gene table retain their original header
- All the functions are updated as the last commit in develop3 branch (Reformat function help blocks to NumPy-style docstrings)
Motivation:
In homology-based metabolic model reconstruction, the gene identifiers in the model must exactly match the sequence identifiers in the protein FASTA file used for the BLAST search; these three functions provide a self-contained pipeline to download, map, and reformat genome data so that both artefacts are consistent before reconstruction begins

Instructions on merging this PR:

This PR has develop as target branch, and will be resolved with a squash-merge.
This PR has main as target branch, and will be resolved as descriped here.

…gy reconstruction

edkerk · 2026-06-11T20:43:41Z

Rework genome-data utilities: fix eukaryote protein mapping, add tests

Functional changes only to the genome-data functions from this PR; the
original formatting is preserved where behaviour is unchanged:

getGeneData now takes the protein accession from each CDS protein_id and
resolves the owning gene through the Parent chain (CDS->mRNA->gene for
eukaryotes, CDS->gene for prokaryotes). The previous eukaryote path used
the mRNA transcript accession (NM_/XM_), which never matches the protein
FASTA headers (NP_/XP_), so eukaryote renaming silently failed. Removes
the fragile mRNA-scanning organism detection and the organism-specific
Phytozome 'extra_id' column; %XX decoding no longer uses urldecode (which
also mangles '+'); writing the .tsv is now opt-in.
downloadGenomeData uses the NCBI Datasets v2 API (a single request that
returns both the GFF and protein FASTA) instead of the deprecated
Assembly E-utilities, and uses correct error identifiers.
renameModelGenes: fix the broken warning (hardcoded count, mismatched
format args, data used as a format string), drop the O(n^2) fallback
tracking, and leave genes with no or empty mapping unchanged instead of
silently substituting locus_tag.
Add genomeDataTests with offline GFF3/FASTA fixtures (eukaryote and
prokaryote) plus a connectivity-guarded download test.

Verified: getGeneData protein accessions match the protein FASTA headers
100% for both S. cerevisiae and E. coli; all tests pass.

…a-model

github-actions · 2026-06-11T20:48:33Z

Function test results

196 tests 176 ✅ 37s ⏱️
22 suites 20 💤
1 files 0 ❌

Results for commit 295d964.

♻️ This comment has been updated with latest results.

Following the develop3 merge, reconcile the genome-data functions with the toolbox conventions that are now available: - downloadGenomeData, getGeneData and processProteinFastaFile take their optional arguments through parseRAVENargs, so they can be called both positionally and by name (matching the rest of the toolbox); the optional arguments are documented under a Name-Value Arguments section. - processProteinFastaFile reads and writes the FASTA with the shared readFasta/writeFasta helpers instead of a hand-rolled parser. - The tests move to the function_tests suite as a RavenTestCase class (tGenomeData), replacing the unit_tests/genomeDataTests.m version; the offline GFF3/FASTA fixtures are reused from unit_tests/test_data.

edkerk · 2026-06-11T21:58:40Z

A final set of changes, to align this PR with recent changes in the develop3 branch, for instance in relation to the format of function_tests, and how optional variables are parsed.

edkerk

Looks good, very useful, we should include it in https://raven-docs.readthedocs.io/en/latest/protocol/.

feat: add functions to update/process gene data in a model for homolo…

a35abb1

…gy reconstruction

ae-tafur assigned edkerk Jun 10, 2026

ae-tafur requested a review from edkerk June 10, 2026 21:05

edkerk mentioned this pull request Jun 11, 2026

Fix tIO/getFullPathReturnsAbsolute on non-Windows CI runner #628

Merged

Rework genome-data utilities: fix eukaryote protein mapping, add tests

01759f6

Merge remote-tracking branch 'origin/develop3' into proccess-gene-dat…

4ac0338

…a-model

edkerk approved these changes Jun 11, 2026

View reviewed changes

edkerk merged commit bb0fbeb into develop3 Jun 12, 2026
2 checks passed

edkerk deleted the proccess-gene-data-model branch June 12, 2026 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genome data utilities for homology-based metabolic model reconstruction#620

Add genome data utilities for homology-based metabolic model reconstruction#620
edkerk merged 4 commits into
develop3from
proccess-gene-data-model

ae-tafur commented Jun 10, 2026 •

edited

Loading

Uh oh!

edkerk commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

edkerk commented Jun 11, 2026

Uh oh!

edkerk left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ae-tafur commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main improvements in this PR:

Instructions on merging this PR:

Uh oh!

edkerk commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Function test results

Uh oh!

edkerk commented Jun 11, 2026

Uh oh!

edkerk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ae-tafur commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading