Skip to content

Investigate Aho-Corasick pre-tokenization for ASCII throughput #118

@HavenDV

Description

@HavenDV

Context

Our current throughput on ASCII-dominant text is 120-143 MiB/s (CountTokens/Encode on Python code and Bitcoin whitepaper). On cached multilingual/CJK text we reach 571 MiB/s, which is competitive with top Rust implementations.

However, GitHub's bpe crate achieves ~400-500 MiB/s on general text by using Aho-Corasick for pre-tokenization, giving it O(n) worst-case complexity vs the standard O(n²) BPE merge loop.

Goal

Close the gap on ASCII throughput from ~140 MiB/s to 300+ MiB/s.

Research areas

  1. Aho-Corasick pre-tokenization — Build an Aho-Corasick automaton from the BPE vocabulary to match longest tokens in a single linear scan, bypassing iterative merge. GitHub's bpe crate proves this approach works.

  2. SIMD-accelerated byte processing — Use System.Runtime.Intrinsics (AVX2/NEON) for:

    • Fast ASCII detection (skip UTF-8 decode for pure ASCII runs)
    • Vectorized pattern matching in the pre-tokenizer regex
    • Bulk byte classification
  3. Pre-computed merge tables — For common token pairs, store pre-computed merge results to skip the BPE priority queue entirely.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions