Skip to content

Query the KEGG KO HMMs in a single hmmsearch over a concatenated library#630

Merged
edkerk merged 3 commits into
develop3from
feat/kegg-hmmsearch
Jun 11, 2026
Merged

Query the KEGG KO HMMs in a single hmmsearch over a concatenated library#630
edkerk merged 3 commits into
develop3from
feat/kegg-hmmsearch

Conversation

@edkerk

@edkerk edkerk commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

getKEGGModelForOrganism previously ran one hmmsearch per KO (thousands of
invocations) against a directory of individual KO HMM files unpacked from
prok90_kegg116.zip / euk90_kegg116.zip. This PR replaces that loop with a
single hmmsearch of the whole query proteome against the concatenated
kegg116_<domain> HMM library, shipped as one gzip-compressed flatfile and
gunzipped on demand.

The profile library is the query and the proteome is the target sequence
database — the same fast search direction as before — so the reported per-hit
full-sequence E-values are identical to the historical per-KO search (same
profiles, same target db, same effective db size). The cut-off, the
minScoreRatioKO / minScoreRatioG filters and the downstream model assembly
are therefore unchanged; the thousands of hmmsearch calls simply collapse
into one.

Highlights:

  • No hmmpress / hmmscan and no new bundled binaryhmmsearch already ships with RAVEN.
  • Per-organism phylogenetic-distance subsampling (getPhylDist) is skipped for the prebuilt prok90/euk90 libraries, whose sequence set is already fixed — so the search path no longer needs a local keggdb/taxonomy.
  • A legacy directory of per-KO HMMs is still honoured when already extracted (<dataDir>/hmms/K00844.hmm), so existing setups keep working.

What changed

reconstruction/kegg/getKEGGModelForOrganism.m:

  • Acquisition — for the RAVEN-provided sets, prefer <dataDir>.hmm (the
    concatenated library); gunzip <dataDir>.hmm.gz if present, otherwise
    download kegg116_<domain>.hmm.gz. Falls back to the legacy .zip/per-KO
    directory when that is what the user has.
  • Search + parse — when the concatenated library is used, run one
    hmmsearch --tblout and parse the table directly into koGeneMat
    (target col 1 = gene, query col 3 = KO, col 5 = full-sequence E-value),
    reusing the existing scoring/assembly downstream.

Validation

  • MATLAB checkcode clean (no syntax errors; no new messages introduced).
  • Search equivalence — on a mini library (8 KO profiles split out of
    prok90_kegg116) searched against a 300-sequence proteome, the per-KO
    hmmsearch loop and the single-library hmmsearch produce identical
    (gene, KO, E-value) hits (34 vs 34, zero differences).

Follow-up (release management)

The download URL points at
…/releases/download/v2.11.0/kegg116_<domain>.hmm.gz. The two library assets
(kegg116_prokaryotes.hmm.gz, kegg116_eukaryotes.hmm.gz) need to be attached
to the release before the download path is live. They are the same artefacts
published for raven-python.

edkerk added 2 commits June 11, 2026 15:19
…search

Replace the per-KO hmmsearch loop in getKEGGModelForOrganism with a single
hmmsearch of the whole query proteome against the concatenated kegg116_<domain>
HMM library, downloaded as a gzip-compressed flatfile and gunzipped on demand.

The profile library is the query and the proteome the target sequence database,
so the reported per-hit E-values match RAVEN's historical per-KO hmmsearch
(same search direction, same effective database size). The cut-off,
minScoreRatioKO/minScoreRatioG filters and model assembly are therefore
unchanged - thousands of hmmsearch invocations simply collapse into one, with
no hmmpress/hmmscan and no new bundled binary (hmmsearch already ships).

A legacy directory of per-KO HMMs is still honoured when already extracted, and
per-organism phylogenetic-distance subsampling is skipped for the fixed
prebuilt prok90/euk90 libraries.
Added timeout to function tests to prevent hanging.
@github-actions

Copy link
Copy Markdown

Function test results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit a86e78a.

@edkerk edkerk merged commit d89a82a into develop3 Jun 11, 2026
@edkerk edkerk deleted the feat/kegg-hmmsearch branch June 11, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant