Skip to content

feat(annoont): AnnotationOntology.from_prd_input + fix msbuilder get_feature lookup#26

Open
jplfaria wants to merge 1 commit into
cshenry:mainfrom
jplfaria:feat/annotation-ontology-from-prd-input
Open

feat(annoont): AnnotationOntology.from_prd_input + fix msbuilder get_feature lookup#26
jplfaria wants to merge 1 commit into
cshenry:mainfrom
jplfaria:feat/annotation-ontology-from-prd-input

Conversation

@jplfaria

Copy link
Copy Markdown

Two related changes that unblock the modelseed-api bulk-reconstruction endpoint (Phase 3 per the PRD).

Changes

1. New factory AnnotationOntology.from_prd_input

Mirrors from_kbase_data but for inputs whose ontology terms have NOT been pre-translated to ModelSEED reaction IDs.

@staticmethod
def from_prd_input(
    genome_id,
    annotations,        # {gene_id: {ontology_type: [{term, score}, ...]}}
    data_dir,
    translator,         # Callable[[namespaced_term: str], list[msrxn_id: str]]
    method="PRD",
    method_version="1.0",
    timestamp=None,
)

The translator callable is injected so the factory stays decoupled from any specific translation backend. The canonical impl is KBUtilLib.KBAnnotationUtils.translate_term_to_modelseed; tests use a small in-memory fake (no data files needed).

Synthesizes one AnnotationOntologyEvent per ontology type, keeps priority-list logic untouched. Unmapped terms (translator returns []) are retained with an empty msrxns set per the PRD requirement that unmapped genes never silently disappear.

2. Fix latent bug at msbuilder.py:789 in build_from_annotaton_ontology

The line called anno_ont.get_feature(gene.id) but AnnotationOntology has no such method (features are keyed in genes or cdss dicts). Any call path that reached this line would AttributeError the moment it tried to attach evidence to a built reaction. Replaced with the correct accessor:

annoont_gene = anno_ont.genes.get(gene.id) or anno_ont.cdss.get(gene.id)

Tests

9 unit tests in tests/core/test_annotationontology.py:

  • 7 cases for from_prd_input: happy path, multi-gene/multi-ontology, score recording, default score, unmapped retention, namespaced-term passing, empty input
  • 2 regression tests for the msbuilder fix: one positive (the accessor works), one negative lock (asserts AnnotationOntology still has no get_feature method, so if it ever gets added back the msbuilder accessor must be reconciled)

All 9 pass locally.

Context

This branch is the upstream half of Phase 3 (bulk reconstruction endpoint) in ModelSEED/modelseed-api. The endpoint's per-genome loop calls MSBuilder.build_from_annotaton_ontology directly, so the msbuilder.py:789 bug is a hard prereq for that work. from_prd_input is the entry point into AnnotationOntology for the PRD's {gene_id: {ontology_type: [{term, score}]}} input shape, replacing the KBase-shaped from_kbase_data path on this code path.

🤖 Generated with Claude Code

…feature lookup

Two related changes that unblock the modelseed-api bulk-reconstruction
endpoint (Phase 3 per Chris Henry's PRD).

1. New factory: AnnotationOntology.from_prd_input(genome_id, annotations,
   data_dir, translator, ...). Mirrors from_kbase_data but for inputs
   whose ontology terms have NOT been pre-translated to ModelSEED
   reaction IDs. The translator callable is injected so the factory
   stays decoupled from any specific translation backend
   (KBUtilLib.KBAnnotationUtils.translate_term_to_modelseed is the
   canonical impl; tests use a small in-memory fake).

   Input shape: {gene_id: {ontology_type: [{term, score}, ...]}}.
   Synthesizes one AnnotationOntologyEvent per ontology type, keeps
   priority-list logic untouched. Unmapped terms (translator returns
   []) are retained with an empty msrxns set - per PRD, unmapped genes
   must never silently disappear.

2. Fix latent bug at msbuilder.py:789 in build_from_annotaton_ontology.
   The line called anno_ont.get_feature(gene.id) but AnnotationOntology
   has no such method (features are keyed in genes or cdss dicts). Any
   call path that reached this line would AttributeError as soon as it
   tried to attach evidence to a built reaction. Replaced with the
   correct accessor: anno_ont.genes.get(gene.id) or anno_ont.cdss.get(gene.id).

9 unit tests cover both changes:
- 7 cases for from_prd_input (happy path, multi-gene/multi-ontology,
  score recording, default score, unmapped retention, namespaced-term
  passing, empty input)
- 2 regression tests for the msbuilder fix (one positive, one negative
  lock to flag if get_feature is ever added back without reconciling
  the msbuilder line)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jplfaria added a commit to ModelSEED/modelseed-api that referenced this pull request Jun 10, 2026
Builds N metabolic models in one job from probabilistic-annotation
inputs. Per-genome COBRApy JSON models plus combined reactions.csv +
genes.csv land in the user's workspace. Implements Chris Henry's
Phase 3 PRD.

Request shape: list of genomes, each carrying
{gene_id: {ontology_type: [{term, score}]}}. Hard cap 100 per call
(Pydantic). gapfill OFF by default, FVA ON by default; both flippable
per call.

Pipeline per genome (sequential, in-process):

  1. AnnotationOntology.from_prd_input(...) - the new upstream factory
     in cshenry/ModelSEEDpy#26. Translator callable injected from
     KBUtilLib.KBAnnotationUtils so the factory stays decoupled.
  2. MSReconstructionUtils.compute_ontology_model_changes(...) to
     compute reactions_to_add for the build helper.
  3. MSReconstructionUtils.build_metabolic_model(..., reactions_to_add)
     (Chris confirmed: NOT kb_build_metabolic_models, which is KBase-
     wrapper code we don't want.)
  4. Optional MSReconstructionUtils.gapfill_metabolic_model(...) when
     gapfill=true.
  5. Optional FVA (rich + minimal media) when fva=true. Uses
     bulk_export.compute_fva_classes.
  6. Per-genome cobra JSON written to workspace at
     /<user>/modelseed/bulk_<job_id>/model_<genome_id>.json
  7. Rows accumulated for the combined CSVs (one workspace write per
     CSV at the end, not per-genome appends).

Per-genome try/except so one bad genome surfaces as
{status: "failed", error: "..."} in result.per_genome without
aborting the rest of the batch.

CSV column specs mirror KBDatalakeApps' canonical genome_reaction +
genome_gene_reaction_essentially_test tables. genes.csv carries an
extra `disposition` column (`mapped`|`unmapped`) per the PRD requirement
that unmapped genes never silently disappear.

Workflow body lives in tasks._run_bulk_reconstruct so the Celery task
(thin wrapper that adds progress callbacks) and the subprocess
job-script entry point share one implementation and can't drift.

Schema, route, dispatcher, task, subprocess script, CSV builders +
FVA helpers, 16 unit tests for the CSV layer (full row-shape +
column-order + aggregation coverage), and a user-facing
docs/BULK_RECONSTRUCT.md.

Live deploy is gated on cshenry/ModelSEEDpy#26 being merged + poplar
rebuilt; locally everything imports against the branched fork.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant