A from-scratch implementation of a BPE tokenizer in .NET, using GPT-4's regex split pattern. Built as a learning project to understand how modern LLM tokenizers actually work under the hood.
Before a language model can process text, it needs to convert it into numbers. A tokenizer is that bridge — it takes a string and returns a sequence of integer IDs that the model can work with.
The naive approach is to use characters or raw bytes. That works, but it's inefficient. A sentence like "Hello, world!" becomes 13 separate tokens. A smarter approach groups common sequences together — so "Hello" might be a single token. That's what BPE does.
Byte Pair Encoding was originally a data compression algorithm. The idea is simple: find the most frequent pair of adjacent symbols, merge them into a new symbol, and repeat.
Training:
- Start with the text split into individual UTF-8 bytes (256 possible tokens)
- Count how often every adjacent pair of tokens appears
- Merge the most frequent pair into a new single token (ID 256, then 257, etc.)
- Repeat until you hit your target vocabulary size
After training you have a list of ordered merge rules and a vocabulary that maps every token ID to its byte sequence.
Encoding:
Take new text, split it into bytes, then replay the merge rules in exactly the same order they were learned. The result is a compact sequence of token IDs.
Decoding:
Look up each token ID in the vocabulary, concatenate the bytes, and decode as UTF-8.
A naive BPE implementation would happily learn a merge for "dog." and another for "dog!" and another for "dog?". Each punctuation variant becomes its own token, even though the word "dog" is the same in all three cases. This wastes vocabulary slots and makes the model less efficient.
The fix is to pre-split the text before learning merges, so that merges can never cross a word boundary.
This implementation uses the same split pattern as GPT-4's cl100k_base tokenizer:
'(?i:[sdmt]|ll|ve|re)|(?>[^\r\n\p{L}\p{N}]?)\p{L}+|\p{N}{1,3}| ?(?>[^\s\p{L}\p{N}]+)[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+
It looks intimidating, but each branch handles a specific case:
| Branch | What it matches | Why |
|---|---|---|
'(?i:[sdmt]|ll|ve|re) |
Contractions: 's, 't, 'd, 'm, 'll, 've, 're |
Splits "don't" into "don" + "'t" so the word root is separate |
(?>[^\r\n\p{L}\p{N}]?)\p{L}+ |
Words, with optional leading punctuation | Keeps words as single chunks |
\p{N}{1,3} |
Numbers, max 3 digits at a time | Prevents merging across number groups like "123456" |
?(?>[^\s\p{L}\p{N}]+)[\r\n]* |
Punctuation sequences | Isolates symbols so they don't bleed into words |
\s*[\r\n] |
Newlines with optional leading whitespace | Handles line breaks cleanly |
\s+(?!\S) |
Trailing whitespace | Eats trailing spaces without consuming non-whitespace |
\s+ |
Any other whitespace | Catch-all for remaining whitespace |
One .NET-specific gotcha: the original pattern uses possessive quantifiers (?+, ++) which aren't supported in .NET's regex engine. They're converted to atomic groups (?>...) which have the same effect — they prevent backtracking into the group, which is important for both correctness and performance.
Train(text, vocabSize)
└── SplitIntoChunks(text) // Regex splits text into safe-to-merge chunks
└── GetPairFrequencies(chunk) // Count adjacent pairs across all chunks
└── MergePair(chunk, pair, id) // Apply the best merge to all chunks
└── Repeat until vocabSize reached
Encode(text)
└── SplitIntoChunks(text) // Same regex split as training
└── foreach merge rule: MergePair(chunk, pair, id) // Replay in training order
Decode(ids)
└── Vocab[id] → bytes // Look up each token's byte sequence
└── UTF8.GetString(allBytes) // Decode the concatenated bytes
The key insight is that encoding replays training-order merges. It doesn't just greedily match the longest token — it applies each merge rule one at a time, in the same sequence they were discovered. This guarantees consistency between what the model was trained on and what it sees at inference time.
var tokenizer = new BytePairEncodingTokenizer();
// Train on some text
tokenizer.Train("the quick brown fox jumps over the lazy dog", vocabSize: 300);
// Encode
int[] tokens = tokenizer.Encode("the fox");
// e.g. [116, 104, 101, 32, 258, 120] (with some merges applied)
// Decode
string text = tokenizer.Decode(tokens);
// "the fox"Enable verbose output during training to watch merges happen in real time:
tokenizer.Train(corpus, vocabSize: 500, verbose: true);
// Merge 1/244: (101, 32) -> 256 ("e ") [frequency: 312]
// Merge 2/244: (116, 104) -> 257 ("th") [frequency: 290]
// ...Vocabulary size must be ≥ 256. The first 256 slots are reserved for raw bytes. You can't have fewer.
More vocabulary = fewer tokens per string. With a larger vocab, common words get compressed into a single token. The trade-off is memory and the number of examples needed to train each token well.
The tokenizer is stateless between calls. The learned Merges and Vocab are just public properties — you could serialize them and reload them later without retraining.
Training on short or non-repetitive text won't exhaust the merge budget. If there are no more pairs to merge, training stops early, even if vocabSize is higher than what was reached.
Encoding unknown text still works. Because the base vocabulary covers all 256 possible bytes, any UTF-8 string can always be encoded — worst case, it falls back to individual bytes for tokens it has never seen.
dotnet testThe test suite covers parameter validation, vocabulary growth, round-trip encode/decode, regex split behavior, Unicode and emoji handling, and compression properties across varying vocabulary sizes.
- Andrej Karpathy's minbpe — the Python implementation this project was based on, with an excellent accompanying lecture
- OpenAI tiktoken — the production BPE tokenizer used by GPT models
- Original BPE paper — Sennrich et al., 2015