Drop all-gap alignment columns per trajectory by rchikhi · Pull Request #17 · blab/trajectories

rchikhi · 2026-04-18T12:39:36Z

Summary

For forwards trajectories (trajectory.py): drop columns where every node on the root-to-tip path has - or N.
For pairwise trajectories (pairwise_trajectory.py): drop columns where both tips have - or N.

Why

With diverse-phylum alignments (bac120 cyano with 2470 species), MAFFT creates an alignment column for every lineage-specific insertion. For any given tip/pair, columns from other lineages' insertions are - throughout the trajectory and carry zero signal.

Example: a cyano TIGR00810 trajectory pre-fix is 841 chars × 12 frames, with 72% gaps per frame. The first ~180 characters of every frame are identical pure-gap blocks that only exist because some other cyano lineage has an insertion there.

Post-fix, the same trajectory is 237 chars × 12 frames, 0% gaps, and every mutation is still visible.

What's preserved

Any column where at least one node on the path has a real base is kept, so lineage-specific insertion/deletion events are preserved (a deletion base→- on some branch stays visible because the pre-deletion ancestors have a base).
Hamming distances in headers are unchanged (trimmed columns contribute 0 to Hamming by construction).

Impact

Dataset	Before	After
cyano TIGR00810 trajectory (typical tip)	841 chars, 72% gap	237 chars, 0% gap
spike-xs trajectory	2055 chars, ~0% gap	2055 chars, ~0% gap (unchanged)

Viral datasets (spike, RdRp, cytb, flu, n450) are unaffected — their alignments have almost no insertion columns so there are no dead-gap columns to drop.

Test plan

Regenerate cyano trajectories from the clean auspice JSONs and verify expected length reduction
Regenerate bacteroidota trajectories and verify expected length reduction
Spot-check that per-frame mutations remain the same (only positions shift due to column removal)
Confirm trajectories for viral datasets remain essentially unchanged

🤖 Generated with Claude Code

For forwards trajectories (trajectory.py): drop columns where every node on the root-to-tip path has '-' or 'N'. For pairwise trajectories (pairwise_trajectory.py): drop columns where both tips have '-' or 'N'. Rationale: with diverse-phylum alignments (e.g. bac120 cyano with 2470 species), MAFFT creates alignment columns for every lineage-specific insertion. For any given tip/pair these foreign-lineage insertions are '-' throughout the trajectory and carry zero signal — they only exist because OTHER lineages have bases there. For cyano TIGR00810 this cuts trajectories from 841 -> ~237 chars per frame (72% gaps removed). Any column where at least one node on the path has a real base is kept, so lineage-specific insertion/deletion events are preserved. Hamming distances are unchanged (trimmed columns contribute 0 to Hamming by construction). Viral datasets (spike, RdRp, etc.) are unaffected — their alignments have almost no insertions so there are no dead-gap columns to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pure-Python char-by-char loop was O(path_len * L) per tip which made bacteroidota (22k tips x 2000 cols per marker x ~25 nodes per path) take 30+ hours where it used to take 60 minutes. Use numpy: build a boolean mask per node via ord('-')/ord('N') comparison (vectorized), OR them together across the path, then index. Same semantics, ~100x faster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

200x speedup. The previous version applied the trim to every node in the tree for every tip, when build_trajectory_content only reads sequences[node] for nodes in path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Only the path nodes are ever read by the caller (build_trajectory_content uses sequences.get(node) for node in path), so returning a fresh dict keyed only by path is correct and skips the O(|tree|) copy. Critical for spike-lg (8M+ nodes, ~80MB per copy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure-Python char-by-char loop with membership tests was ~115us per call on 2055-char sequences. build_trajectory_content calls hamming twice per frame x ~20 frames x 8M tips (for spike-lg) = 320M calls, which was a ~10-hour bottleneck. Numpy version: ~15us per call, cuts to ~1.4 hours. Also applied the same fix in branches.py for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rayan Chikhi and others added 5 commits April 18, 2026 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop all-gap alignment columns per trajectory#17

Drop all-gap alignment columns per trajectory#17
rchikhi wants to merge 5 commits intomainfrom
fix/trajectory-gap-trim

rchikhi commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rchikhi commented Apr 18, 2026

Summary

Why

What's preserved

Impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant