Context
Our current throughput on ASCII-dominant text is 120-143 MiB/s (CountTokens/Encode on Python code and Bitcoin whitepaper). On cached multilingual/CJK text we reach 571 MiB/s, which is competitive with top Rust implementations.
However, GitHub's bpe crate achieves ~400-500 MiB/s on general text by using Aho-Corasick for pre-tokenization, giving it O(n) worst-case complexity vs the standard O(n²) BPE merge loop.
Goal
Close the gap on ASCII throughput from ~140 MiB/s to 300+ MiB/s.
Research areas
-
Aho-Corasick pre-tokenization — Build an Aho-Corasick automaton from the BPE vocabulary to match longest tokens in a single linear scan, bypassing iterative merge. GitHub's bpe crate proves this approach works.
-
SIMD-accelerated byte processing — Use System.Runtime.Intrinsics (AVX2/NEON) for:
- Fast ASCII detection (skip UTF-8 decode for pure ASCII runs)
- Vectorized pattern matching in the pre-tokenizer regex
- Bulk byte classification
-
Pre-computed merge tables — For common token pairs, store pre-computed merge results to skip the BPE priority queue entirely.
References
Context
Our current throughput on ASCII-dominant text is 120-143 MiB/s (CountTokens/Encode on Python code and Bitcoin whitepaper). On cached multilingual/CJK text we reach 571 MiB/s, which is competitive with top Rust implementations.
However, GitHub's
bpecrate achieves ~400-500 MiB/s on general text by using Aho-Corasick for pre-tokenization, giving it O(n) worst-case complexity vs the standard O(n²) BPE merge loop.Goal
Close the gap on ASCII throughput from ~140 MiB/s to 300+ MiB/s.
Research areas
Aho-Corasick pre-tokenization — Build an Aho-Corasick automaton from the BPE vocabulary to match longest tokens in a single linear scan, bypassing iterative merge. GitHub's
bpecrate proves this approach works.SIMD-accelerated byte processing — Use
System.Runtime.Intrinsics(AVX2/NEON) for:Pre-computed merge tables — For common token pairs, store pre-computed merge results to skip the BPE priority queue entirely.
References