Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs by rchikhi · Pull Request #16 · blab/trajectories

rchikhi · 2026-03-14T13:17:36Z

Summary

Dynamic mem_gb estimates in Snakefile based on actual input file sizes
provision_alignment uses JSON size x seq_length from config to predict peak memory
trajectory.py: 10x alignment size + 5 GB base (observed 12x peak on odb fungi)
pairwise: 6x alignment size (observed ~5x peak)
Other rules: 3x input file size
Added dataset configs for new phyla/clades

New configs

defaults/odb-fungi.yaml: 1,113 OrthoDB fungal markers (24.5k genomes, 23M tips, 816 GB on S3)
defaults/bac120-actinomycetota.yaml: 116 markers (49k genomes)
defaults/bac120-bacillota.yaml: 122 markers (122k genomes)

Context

At 122k-taxa scale (Bacillota) and 24k-taxa scale (odb fungi), fixed mem_gb values are both too conservative for small markers and too optimistic for large ones. Observed:

trajectory.py on 2.3 GB alignment: 28.7 GB peak RSS (12x file size)
pairwise on 2.3 GB alignment: ~12 GB peak RSS (5x file size)
Provision rules scale with JSON/alignment size at ~3x

Test plan

Verified on Bacillota pipeline (122 markers, 122k taxa)
Verified on odb fungi pipeline (1,113 markers, 24.5k taxa, 23M tips)

Previous fixed mem_gb values (5-15 GB per rule) poorly matched actual memory usage, limiting concurrency on small markers while under-reserving for large ones. Now each rule estimates memory from its input files: - provision_alignment: estimates from JSON size × seq_length (JSON is compact mutations; script reconstructs full sequences for all nodes) - branches, trajectories, pairwise: 3× alignment FASTA size - metadata, label_auspice_json: 3× JSON size - train_test_split: 3× branches TSV size With --resources mem_gb=160, this allows ~25 concurrent small markers (1 GB each) vs the old limit of 10 (15 GB each), while correctly reserving 12-46 GB for large markers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- defaults/odb-fungi.yaml: 1,113 OrthoDB fungal markers (24.5k genomes, 23M tips) - defaults/bac120-actinomycetota.yaml: 116 markers (49k genomes) - defaults/bac120-bacillota.yaml: 122 markers (122k genomes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

trajectory.py: 3x → 10x + 5GB base (observed 12x peak on 2.3GB alignments) pairwise: 3x → 6x (observed ~5x peak) _mem_gb: add base parameter for fixed overhead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rayan Chikhi and others added 3 commits March 14, 2026 13:12

Tune mem_gb multipliers from odb fungi observations

d2f2dca

trajectory.py: 3x → 10x + 5GB base (observed 12x peak on 2.3GB alignments) pairwise: 3x → 6x (observed ~5x peak) _mem_gb: add base parameter for fixed overhead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rchikhi changed the title ~~Add dynamic mem_gb resource estimates~~ Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs#16

Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs#16
rchikhi wants to merge 3 commits intomainfrom
dynamic-mem-gb

rchikhi commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rchikhi commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New configs

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rchikhi commented Mar 14, 2026 •

edited

Loading