Investigate Aho-Corasick pre-tokenization for ASCII throughput

## Context

Our current throughput on ASCII-dominant text is **120-143 MiB/s** (CountTokens/Encode on Python code and Bitcoin whitepaper). On cached multilingual/CJK text we reach **571 MiB/s**, which is competitive with top Rust implementations.

However, GitHub's [`bpe` crate](https://github.com/github/rust-gems/tree/main/crates/bpe) achieves **~400-500 MiB/s** on general text by using Aho-Corasick for pre-tokenization, giving it O(n) worst-case complexity vs the standard O(n²) BPE merge loop.

## Goal

Close the gap on ASCII throughput from ~140 MiB/s to 300+ MiB/s.

## Research areas

1. **Aho-Corasick pre-tokenization** — Build an Aho-Corasick automaton from the BPE vocabulary to match longest tokens in a single linear scan, bypassing iterative merge. GitHub's `bpe` crate proves this approach works.

2. **SIMD-accelerated byte processing** — Use `System.Runtime.Intrinsics` (AVX2/NEON) for:
   - Fast ASCII detection (skip UTF-8 decode for pure ASCII runs)
   - Vectorized pattern matching in the pre-tokenizer regex
   - Bulk byte classification

3. **Pre-computed merge tables** — For common token pairs, store pre-computed merge results to skip the BPE priority queue entirely.

## References

- [GitHub blog: So many tokens, so little time](https://github.blog/ai-and-ml/llms/so-many-tokens-so-little-time-introducing-a-faster-more-flexible-byte-pair-tokenizer/)
- [GitHub rust-gems/bpe](https://github.com/github/rust-gems/tree/main/crates/bpe)
- [TokenDagger](https://github.com/M4THYOU/TokenDagger) — 2-4x over tiktoken via PCRE2 JIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate Aho-Corasick pre-tokenization for ASCII throughput #118

Context

Goal

Research areas

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Investigate Aho-Corasick pre-tokenization for ASCII throughput #118

Description

Context

Goal

Research areas

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions