Skip to content

Drop all-gap alignment columns per trajectory#17

Open
rchikhi wants to merge 5 commits intomainfrom
fix/trajectory-gap-trim
Open

Drop all-gap alignment columns per trajectory#17
rchikhi wants to merge 5 commits intomainfrom
fix/trajectory-gap-trim

Conversation

@rchikhi
Copy link
Copy Markdown
Collaborator

@rchikhi rchikhi commented Apr 18, 2026

Summary

  • For forwards trajectories (trajectory.py): drop columns where every node on the root-to-tip path has - or N.
  • For pairwise trajectories (pairwise_trajectory.py): drop columns where both tips have - or N.

Why

With diverse-phylum alignments (bac120 cyano with 2470 species), MAFFT creates an alignment column for every lineage-specific insertion. For any given tip/pair, columns from other lineages' insertions are - throughout the trajectory and carry zero signal.

Example: a cyano TIGR00810 trajectory pre-fix is 841 chars × 12 frames, with 72% gaps per frame. The first ~180 characters of every frame are identical pure-gap blocks that only exist because some other cyano lineage has an insertion there.

Post-fix, the same trajectory is 237 chars × 12 frames, 0% gaps, and every mutation is still visible.

What's preserved

  • Any column where at least one node on the path has a real base is kept, so lineage-specific insertion/deletion events are preserved (a deletion base→- on some branch stays visible because the pre-deletion ancestors have a base).
  • Hamming distances in headers are unchanged (trimmed columns contribute 0 to Hamming by construction).

Impact

Dataset Before After
cyano TIGR00810 trajectory (typical tip) 841 chars, 72% gap 237 chars, 0% gap
spike-xs trajectory 2055 chars, ~0% gap 2055 chars, ~0% gap (unchanged)

Viral datasets (spike, RdRp, cytb, flu, n450) are unaffected — their alignments have almost no insertion columns so there are no dead-gap columns to drop.

Test plan

  • Regenerate cyano trajectories from the clean auspice JSONs and verify expected length reduction
  • Regenerate bacteroidota trajectories and verify expected length reduction
  • Spot-check that per-frame mutations remain the same (only positions shift due to column removal)
  • Confirm trajectories for viral datasets remain essentially unchanged

🤖 Generated with Claude Code

Rayan Chikhi and others added 5 commits April 18, 2026 12:38
For forwards trajectories (trajectory.py): drop columns where every
node on the root-to-tip path has '-' or 'N'. For pairwise trajectories
(pairwise_trajectory.py): drop columns where both tips have '-' or 'N'.

Rationale: with diverse-phylum alignments (e.g. bac120 cyano with 2470
species), MAFFT creates alignment columns for every lineage-specific
insertion. For any given tip/pair these foreign-lineage insertions are
'-' throughout the trajectory and carry zero signal — they only exist
because OTHER lineages have bases there.

For cyano TIGR00810 this cuts trajectories from 841 -> ~237 chars per
frame (72% gaps removed). Any column where at least one node on the
path has a real base is kept, so lineage-specific insertion/deletion
events are preserved. Hamming distances are unchanged (trimmed columns
contribute 0 to Hamming by construction).

Viral datasets (spike, RdRp, etc.) are unaffected — their alignments
have almost no insertions so there are no dead-gap columns to drop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pure-Python char-by-char loop was O(path_len * L) per tip which
made bacteroidota (22k tips x 2000 cols per marker x ~25 nodes per
path) take 30+ hours where it used to take 60 minutes.

Use numpy: build a boolean mask per node via ord('-')/ord('N')
comparison (vectorized), OR them together across the path, then index.
Same semantics, ~100x faster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
200x speedup. The previous version applied the trim to every node in
the tree for every tip, when build_trajectory_content only reads
sequences[node] for nodes in path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only the path nodes are ever read by the caller
(build_trajectory_content uses sequences.get(node) for node in path),
so returning a fresh dict keyed only by path is correct and skips the
O(|tree|) copy. Critical for spike-lg (8M+ nodes, ~80MB per copy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure-Python char-by-char loop with membership tests was ~115us per
call on 2055-char sequences. build_trajectory_content calls hamming
twice per frame x ~20 frames x 8M tips (for spike-lg) = 320M calls,
which was a ~10-hour bottleneck.

Numpy version: ~15us per call, cuts to ~1.4 hours.

Also applied the same fix in branches.py for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant