Add genome data utilities for homology-based metabolic model reconstruction#620
Conversation
…gy reconstruction
|
Rework genome-data utilities: fix eukaryote protein mapping, add tests Functional changes only to the genome-data functions from this PR; the
Verified: getGeneData protein accessions match the protein FASTA headers |
Function test results196 tests 176 ✅ 37s ⏱️ Results for commit 295d964. ♻️ This comment has been updated with latest results. |
Following the develop3 merge, reconcile the genome-data functions with the toolbox conventions that are now available: - downloadGenomeData, getGeneData and processProteinFastaFile take their optional arguments through parseRAVENargs, so they can be called both positionally and by name (matching the rest of the toolbox); the optional arguments are documented under a Name-Value Arguments section. - processProteinFastaFile reads and writes the FASTA with the shared readFasta/writeFasta helpers instead of a hand-rolled parser. - The tests move to the function_tests suite as a RavenTestCase class (tGenomeData), replacing the unit_tests/genomeDataTests.m version; the offline GFF3/FASTA fixtures are reused from unit_tests/test_data.
|
A final set of changes, to align this PR with recent changes in the |
edkerk
left a comment
There was a problem hiding this comment.
Looks good, very useful, we should include it in https://raven-docs.readthedocs.io/en/latest/protocol/.
Main improvements in this PR:
Features:
downloadGenomeData— new function that retrieves GFF3 (.gff) annotation and protein FASTA (.faa) files for a given NCBI assembly accession (bothGCF_andGCA_prefixes supported) using the NCBI E-utilities API; skips download if files are already present locallygetGeneData— new function that parses the GFF3 annotation file for a given assembly and produces a gene-mapping table with columnslocus_tag,old_locus_tag,extra_id,GeneID,gene_name, andGenBank_protein; handles both eukaryote (gene→mRNA) and prokaryote (gene→CDS) annotation styles; callsdownloadGenomeDataautomatically if no local files are providedprocessProteinFastaFile— new function that renames each sequence header in a protein FASTA file using a user-specified column from the gene table (e.g.locus_tag,gene_name) matched via theGenBank_proteinaccession; streams the file line-by-line for memory efficiency on large proteomes; sequences with no match in the gene table retain their original headerMotivation:
In homology-based metabolic model reconstruction, the gene identifiers in the model must exactly match the sequence identifiers in the protein FASTA file used for the BLAST search; these three functions provide a self-contained pipeline to download, map, and reformat genome data so that both artefacts are consistent before reconstruction begins
Instructions on merging this PR:
developas target branch, and will be resolved with a squash-merge.mainas target branch, and will be resolved as descriped here.