feat(annoont): AnnotationOntology.from_prd_input + fix msbuilder get_feature lookup#26
Open
jplfaria wants to merge 1 commit into
Open
Conversation
…feature lookup
Two related changes that unblock the modelseed-api bulk-reconstruction
endpoint (Phase 3 per Chris Henry's PRD).
1. New factory: AnnotationOntology.from_prd_input(genome_id, annotations,
data_dir, translator, ...). Mirrors from_kbase_data but for inputs
whose ontology terms have NOT been pre-translated to ModelSEED
reaction IDs. The translator callable is injected so the factory
stays decoupled from any specific translation backend
(KBUtilLib.KBAnnotationUtils.translate_term_to_modelseed is the
canonical impl; tests use a small in-memory fake).
Input shape: {gene_id: {ontology_type: [{term, score}, ...]}}.
Synthesizes one AnnotationOntologyEvent per ontology type, keeps
priority-list logic untouched. Unmapped terms (translator returns
[]) are retained with an empty msrxns set - per PRD, unmapped genes
must never silently disappear.
2. Fix latent bug at msbuilder.py:789 in build_from_annotaton_ontology.
The line called anno_ont.get_feature(gene.id) but AnnotationOntology
has no such method (features are keyed in genes or cdss dicts). Any
call path that reached this line would AttributeError as soon as it
tried to attach evidence to a built reaction. Replaced with the
correct accessor: anno_ont.genes.get(gene.id) or anno_ont.cdss.get(gene.id).
9 unit tests cover both changes:
- 7 cases for from_prd_input (happy path, multi-gene/multi-ontology,
score recording, default score, unmapped retention, namespaced-term
passing, empty input)
- 2 regression tests for the msbuilder fix (one positive, one negative
lock to flag if get_feature is ever added back without reconciling
the msbuilder line)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jplfaria
added a commit
to ModelSEED/modelseed-api
that referenced
this pull request
Jun 10, 2026
Builds N metabolic models in one job from probabilistic-annotation
inputs. Per-genome COBRApy JSON models plus combined reactions.csv +
genes.csv land in the user's workspace. Implements Chris Henry's
Phase 3 PRD.
Request shape: list of genomes, each carrying
{gene_id: {ontology_type: [{term, score}]}}. Hard cap 100 per call
(Pydantic). gapfill OFF by default, FVA ON by default; both flippable
per call.
Pipeline per genome (sequential, in-process):
1. AnnotationOntology.from_prd_input(...) - the new upstream factory
in cshenry/ModelSEEDpy#26. Translator callable injected from
KBUtilLib.KBAnnotationUtils so the factory stays decoupled.
2. MSReconstructionUtils.compute_ontology_model_changes(...) to
compute reactions_to_add for the build helper.
3. MSReconstructionUtils.build_metabolic_model(..., reactions_to_add)
(Chris confirmed: NOT kb_build_metabolic_models, which is KBase-
wrapper code we don't want.)
4. Optional MSReconstructionUtils.gapfill_metabolic_model(...) when
gapfill=true.
5. Optional FVA (rich + minimal media) when fva=true. Uses
bulk_export.compute_fva_classes.
6. Per-genome cobra JSON written to workspace at
/<user>/modelseed/bulk_<job_id>/model_<genome_id>.json
7. Rows accumulated for the combined CSVs (one workspace write per
CSV at the end, not per-genome appends).
Per-genome try/except so one bad genome surfaces as
{status: "failed", error: "..."} in result.per_genome without
aborting the rest of the batch.
CSV column specs mirror KBDatalakeApps' canonical genome_reaction +
genome_gene_reaction_essentially_test tables. genes.csv carries an
extra `disposition` column (`mapped`|`unmapped`) per the PRD requirement
that unmapped genes never silently disappear.
Workflow body lives in tasks._run_bulk_reconstruct so the Celery task
(thin wrapper that adds progress callbacks) and the subprocess
job-script entry point share one implementation and can't drift.
Schema, route, dispatcher, task, subprocess script, CSV builders +
FVA helpers, 16 unit tests for the CSV layer (full row-shape +
column-order + aggregation coverage), and a user-facing
docs/BULK_RECONSTRUCT.md.
Live deploy is gated on cshenry/ModelSEEDpy#26 being merged + poplar
rebuilt; locally everything imports against the branched fork.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two related changes that unblock the modelseed-api bulk-reconstruction endpoint (Phase 3 per the PRD).
Changes
1. New factory
AnnotationOntology.from_prd_inputMirrors
from_kbase_databut for inputs whose ontology terms have NOT been pre-translated to ModelSEED reaction IDs.The
translatorcallable is injected so the factory stays decoupled from any specific translation backend. The canonical impl isKBUtilLib.KBAnnotationUtils.translate_term_to_modelseed; tests use a small in-memory fake (no data files needed).Synthesizes one
AnnotationOntologyEventper ontology type, keeps priority-list logic untouched. Unmapped terms (translator returns[]) are retained with an emptymsrxnsset per the PRD requirement that unmapped genes never silently disappear.2. Fix latent bug at
msbuilder.py:789inbuild_from_annotaton_ontologyThe line called
anno_ont.get_feature(gene.id)butAnnotationOntologyhas no such method (features are keyed ingenesorcdssdicts). Any call path that reached this line wouldAttributeErrorthe moment it tried to attach evidence to a built reaction. Replaced with the correct accessor:Tests
9 unit tests in
tests/core/test_annotationontology.py:from_prd_input: happy path, multi-gene/multi-ontology, score recording, default score, unmapped retention, namespaced-term passing, empty inputAnnotationOntologystill has noget_featuremethod, so if it ever gets added back the msbuilder accessor must be reconciled)All 9 pass locally.
Context
This branch is the upstream half of Phase 3 (bulk reconstruction endpoint) in ModelSEED/modelseed-api. The endpoint's per-genome loop calls
MSBuilder.build_from_annotaton_ontologydirectly, so themsbuilder.py:789bug is a hard prereq for that work.from_prd_inputis the entry point intoAnnotationOntologyfor the PRD's{gene_id: {ontology_type: [{term, score}]}}input shape, replacing the KBase-shapedfrom_kbase_datapath on this code path.🤖 Generated with Claude Code