Skip to content

Feat/expression anchors#13

Merged
tzeitim merged 5 commits into
masterfrom
feat/expression_anchors
Feb 10, 2026
Merged

Feat/expression anchors#13
tzeitim merged 5 commits into
masterfrom
feat/expression_anchors

Conversation

@tzeitim

@tzeitim tzeitim commented Dec 25, 2025

Copy link
Copy Markdown
Owner

No description provided.

The assemble_sequences_with_anchors function is implemented and working.

  Summary of Changes

  Rust (src/expressions.rs:325-410):
  - Added assemble_sequences_with_anchors_expr function
  - Takes 3 input series: sequences, start_anchor, end_anchor
  - Extracts anchor strings from first row of each anchor column
  - Only supports shortest_path method (compression/auto don't need
anchors)

  Python (rogtk/__init__.py:155-231):
  - Added assemble_sequences_with_anchors() wrapper function
  - Takes expr, start_anchor_col, end_anchor_col as expression arguments
  - Same kwargs as original: k, min_coverage, method, etc.

  Usage

  segments.group_by(['umi', 'start_meta', 'end_meta']).agg(
      rogtk.assemble_sequences_with_anchors(
          expr=pl.col('segment_seq'),
          start_anchor_col=pl.first('start_meta_seq'),
          end_anchor_col=pl.first('end_meta_seq'),
          k=15,
          min_coverage=20,
      ).alias('consensus_seq')
  )
Add Rust expressions and Python wrappers to extract actual insertion
sequences from CIGAR strings and enrich cassiopeia allele notation.

Rust (src/expressions.rs):
- extract_insertions_from_cigar(): Parse CIGAR to extract insertion
sequences
- enrich_allele_with_insertions(): Transform [pos:NI] to [pos:NI:SEQ]
- enrich_allele_insertions_expr: Polars expression (allele, seq, cigar)
- extract_cigar_insertions_expr: Polars expression (seq, cigar)
- Handle 1-based to 0-based position conversion for allele notation

Python (rogtk/__init__.py):
- CigarNamespace: Registered as pl.Expr.cigar namespace
- enrich_insertions(): Method to enrich allele strings
- extract_cigar_insertions(): Standalone function for extraction

Example: [78:5I] → [78:5I:ACTTA]
Process multiple BAM files into a single parquet file with bounded
memory.
Streams batches sequentially from each BAM, writing directly to parquet
without intermediate files or memory spikes.

Features:
- Bounded memory usage via batch streaming (configurable batch_size)
- Optional source_file column to track origin BAM per record
- Input validation, progress reporting, and Python interrupt handling
- Supports all existing options (compression, sequence/quality
inclusion, limit)

Performance: ~43k records/sec (1.86M records from 46 BAMs in 43s)

Usage:
  rogtk.bams_to_parquet(bam_list, "output.parquet",
include_source_file=True)
  lf = pl.scan_parquet("output.parquet")  # lazy interface for
downstream ops
@tzeitim tzeitim merged commit 23180df into master Feb 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant