Open
Conversation
For forwards trajectories (trajectory.py): drop columns where every node on the root-to-tip path has '-' or 'N'. For pairwise trajectories (pairwise_trajectory.py): drop columns where both tips have '-' or 'N'. Rationale: with diverse-phylum alignments (e.g. bac120 cyano with 2470 species), MAFFT creates alignment columns for every lineage-specific insertion. For any given tip/pair these foreign-lineage insertions are '-' throughout the trajectory and carry zero signal — they only exist because OTHER lineages have bases there. For cyano TIGR00810 this cuts trajectories from 841 -> ~237 chars per frame (72% gaps removed). Any column where at least one node on the path has a real base is kept, so lineage-specific insertion/deletion events are preserved. Hamming distances are unchanged (trimmed columns contribute 0 to Hamming by construction). Viral datasets (spike, RdRp, etc.) are unaffected — their alignments have almost no insertions so there are no dead-gap columns to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pure-Python char-by-char loop was O(path_len * L) per tip which
made bacteroidota (22k tips x 2000 cols per marker x ~25 nodes per
path) take 30+ hours where it used to take 60 minutes.
Use numpy: build a boolean mask per node via ord('-')/ord('N')
comparison (vectorized), OR them together across the path, then index.
Same semantics, ~100x faster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
200x speedup. The previous version applied the trim to every node in the tree for every tip, when build_trajectory_content only reads sequences[node] for nodes in path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only the path nodes are ever read by the caller (build_trajectory_content uses sequences.get(node) for node in path), so returning a fresh dict keyed only by path is correct and skips the O(|tree|) copy. Critical for spike-lg (8M+ nodes, ~80MB per copy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure-Python char-by-char loop with membership tests was ~115us per call on 2055-char sequences. build_trajectory_content calls hamming twice per frame x ~20 frames x 8M tips (for spike-lg) = 320M calls, which was a ~10-hour bottleneck. Numpy version: ~15us per call, cuts to ~1.4 hours. Also applied the same fix in branches.py for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
trajectory.py): drop columns where every node on the root-to-tip path has-orN.pairwise_trajectory.py): drop columns where both tips have-orN.Why
With diverse-phylum alignments (bac120 cyano with 2470 species), MAFFT creates an alignment column for every lineage-specific insertion. For any given tip/pair, columns from other lineages' insertions are
-throughout the trajectory and carry zero signal.Example: a cyano TIGR00810 trajectory pre-fix is 841 chars × 12 frames, with 72% gaps per frame. The first ~180 characters of every frame are identical pure-gap blocks that only exist because some other cyano lineage has an insertion there.
Post-fix, the same trajectory is 237 chars × 12 frames, 0% gaps, and every mutation is still visible.
What's preserved
base→-on some branch stays visible because the pre-deletion ancestors have a base).Impact
Viral datasets (spike, RdRp, cytb, flu, n450) are unaffected — their alignments have almost no insertion columns so there are no dead-gap columns to drop.
Test plan
🤖 Generated with Claude Code