Skip to content

Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs#16

Open
rchikhi wants to merge 3 commits intomainfrom
dynamic-mem-gb
Open

Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs#16
rchikhi wants to merge 3 commits intomainfrom
dynamic-mem-gb

Conversation

@rchikhi
Copy link
Copy Markdown
Collaborator

@rchikhi rchikhi commented Mar 14, 2026

Summary

  • Dynamic mem_gb estimates in Snakefile based on actual input file sizes
  • provision_alignment uses JSON size x seq_length from config to predict peak memory
  • trajectory.py: 10x alignment size + 5 GB base (observed 12x peak on odb fungi)
  • pairwise: 6x alignment size (observed ~5x peak)
  • Other rules: 3x input file size
  • Added dataset configs for new phyla/clades

New configs

  • defaults/odb-fungi.yaml: 1,113 OrthoDB fungal markers (24.5k genomes, 23M tips, 816 GB on S3)
  • defaults/bac120-actinomycetota.yaml: 116 markers (49k genomes)
  • defaults/bac120-bacillota.yaml: 122 markers (122k genomes)

Context

At 122k-taxa scale (Bacillota) and 24k-taxa scale (odb fungi), fixed mem_gb values are both too conservative for small markers and too optimistic for large ones. Observed:

  • trajectory.py on 2.3 GB alignment: 28.7 GB peak RSS (12x file size)
  • pairwise on 2.3 GB alignment: ~12 GB peak RSS (5x file size)
  • Provision rules scale with JSON/alignment size at ~3x

Test plan

  • Verified on Bacillota pipeline (122 markers, 122k taxa)
  • Verified on odb fungi pipeline (1,113 markers, 24.5k taxa, 23M tips)

Rayan Chikhi and others added 3 commits March 14, 2026 13:12
Previous fixed mem_gb values (5-15 GB per rule) poorly matched actual
memory usage, limiting concurrency on small markers while under-reserving
for large ones. Now each rule estimates memory from its input files:

- provision_alignment: estimates from JSON size × seq_length (JSON is
  compact mutations; script reconstructs full sequences for all nodes)
- branches, trajectories, pairwise: 3× alignment FASTA size
- metadata, label_auspice_json: 3× JSON size
- train_test_split: 3× branches TSV size

With --resources mem_gb=160, this allows ~25 concurrent small markers
(1 GB each) vs the old limit of 10 (15 GB each), while correctly
reserving 12-46 GB for large markers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- defaults/odb-fungi.yaml: 1,113 OrthoDB fungal markers (24.5k genomes, 23M tips)
- defaults/bac120-actinomycetota.yaml: 116 markers (49k genomes)
- defaults/bac120-bacillota.yaml: 122 markers (122k genomes)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
trajectory.py: 3x → 10x + 5GB base (observed 12x peak on 2.3GB alignments)
pairwise: 3x → 6x (observed ~5x peak)
_mem_gb: add base parameter for fixed overhead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rchikhi rchikhi changed the title Add dynamic mem_gb resource estimates Add dynamic mem_gb estimates and odb-fungi/bac120 dataset configs Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant