SharpTokens — A Byte Pair Encoding Tokenizer in C#

A from-scratch implementation of a BPE tokenizer in .NET, using GPT-4's regex split pattern. Built as a learning project to understand how modern LLM tokenizers actually work under the hood.

What is a Tokenizer?

Before a language model can process text, it needs to convert it into numbers. A tokenizer is that bridge — it takes a string and returns a sequence of integer IDs that the model can work with.

The naive approach is to use characters or raw bytes. That works, but it's inefficient. A sentence like "Hello, world!" becomes 13 separate tokens. A smarter approach groups common sequences together — so "Hello" might be a single token. That's what BPE does.

How BPE Works

Byte Pair Encoding was originally a data compression algorithm. The idea is simple: find the most frequent pair of adjacent symbols, merge them into a new symbol, and repeat.

Training:

Start with the text split into individual UTF-8 bytes (256 possible tokens)
Count how often every adjacent pair of tokens appears
Merge the most frequent pair into a new single token (ID 256, then 257, etc.)
Repeat until you hit your target vocabulary size

After training you have a list of ordered merge rules and a vocabulary that maps every token ID to its byte sequence.

Encoding:

Take new text, split it into bytes, then replay the merge rules in exactly the same order they were learned. The result is a compact sequence of token IDs.

Decoding:

Look up each token ID in the vocabulary, concatenate the bytes, and decode as UTF-8.

The Problem That Makes This Interesting

A naive BPE implementation would happily learn a merge for "dog." and another for "dog!" and another for "dog?". Each punctuation variant becomes its own token, even though the word "dog" is the same in all three cases. This wastes vocabulary slots and makes the model less efficient.

The fix is to pre-split the text before learning merges, so that merges can never cross a word boundary.

The GPT-4 Regex Pattern

This implementation uses the same split pattern as GPT-4's cl100k_base tokenizer:

'(?i:[sdmt]|ll|ve|re)|(?>[^\r\n\p{L}\p{N}]?)\p{L}+|\p{N}{1,3}| ?(?>[^\s\p{L}\p{N}]+)[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+

It looks intimidating, but each branch handles a specific case:

Branch	What it matches	Why
`'(?i:[sdmt]\|ll\|ve\|re)`	Contractions: `'s`, `'t`, `'d`, `'m`, `'ll`, `'ve`, `'re`	Splits `"don't"` into `"don"` + `"'t"` so the word root is separate
`(?>[^\r\n\p{L}\p{N}]?)\p{L}+`	Words, with optional leading punctuation	Keeps words as single chunks
`\p{N}{1,3}`	Numbers, max 3 digits at a time	Prevents merging across number groups like `"123456"`
`?(?>[^\s\p{L}\p{N}]+)[\r\n]*`	Punctuation sequences	Isolates symbols so they don't bleed into words
`\s*[\r\n]`	Newlines with optional leading whitespace	Handles line breaks cleanly
`\s+(?!\S)`	Trailing whitespace	Eats trailing spaces without consuming non-whitespace
`\s+`	Any other whitespace	Catch-all for remaining whitespace

One .NET-specific gotcha: the original pattern uses possessive quantifiers (?+, ++) which aren't supported in .NET's regex engine. They're converted to atomic groups (?>...) which have the same effect — they prevent backtracking into the group, which is important for both correctness and performance.

Architecture

Train(text, vocabSize)
  └── SplitIntoChunks(text)       // Regex splits text into safe-to-merge chunks
  └── GetPairFrequencies(chunk)   // Count adjacent pairs across all chunks
  └── MergePair(chunk, pair, id)  // Apply the best merge to all chunks
  └── Repeat until vocabSize reached

Encode(text)
  └── SplitIntoChunks(text)       // Same regex split as training
  └── foreach merge rule: MergePair(chunk, pair, id)   // Replay in training order

Decode(ids)
  └── Vocab[id] → bytes            // Look up each token's byte sequence
  └── UTF8.GetString(allBytes)     // Decode the concatenated bytes

The key insight is that encoding replays training-order merges. It doesn't just greedily match the longest token — it applies each merge rule one at a time, in the same sequence they were discovered. This guarantees consistency between what the model was trained on and what it sees at inference time.

Getting Started

var tokenizer = new BytePairEncodingTokenizer();

// Train on some text
tokenizer.Train("the quick brown fox jumps over the lazy dog", vocabSize: 300);

// Encode
int[] tokens = tokenizer.Encode("the fox");
// e.g. [116, 104, 101, 32, 258, 120]  (with some merges applied)

// Decode
string text = tokenizer.Decode(tokens);
// "the fox"

Enable verbose output during training to watch merges happen in real time:

tokenizer.Train(corpus, vocabSize: 500, verbose: true);
// Merge 1/244: (101, 32) -> 256 ("e ") [frequency: 312]
// Merge 2/244: (116, 104) -> 257 ("th") [frequency: 290]
// ...

Things to Pay Attention To

Vocabulary size must be ≥ 256. The first 256 slots are reserved for raw bytes. You can't have fewer.

More vocabulary = fewer tokens per string. With a larger vocab, common words get compressed into a single token. The trade-off is memory and the number of examples needed to train each token well.

The tokenizer is stateless between calls. The learned Merges and Vocab are just public properties — you could serialize them and reload them later without retraining.

Training on short or non-repetitive text won't exhaust the merge budget. If there are no more pairs to merge, training stops early, even if vocabSize is higher than what was reached.

Encoding unknown text still works. Because the base vocabulary covers all 256 possible bytes, any UTF-8 string can always be encoded — worst case, it falls back to individual bytes for tokens it has never seen.

Running the Tests

dotnet test

The test suite covers parameter validation, vocabulary growth, round-trip encode/decode, regex split behavior, Unicode and emoji handling, and compression properties across varying vocabulary sizes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SharpTokens.Core		SharpTokens.Core
SharpTokens.Tests		SharpTokens.Tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SharpTokens.sln		SharpTokens.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SharpTokens — A Byte Pair Encoding Tokenizer in C#

What is a Tokenizer?

How BPE Works

The Problem That Makes This Interesting

The GPT-4 Regex Pattern

Architecture

Getting Started

Things to Pay Attention To

Running the Tests

Further Reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SharpTokens — A Byte Pair Encoding Tokenizer in C#

What is a Tokenizer?

How BPE Works

The Problem That Makes This Interesting

The GPT-4 Regex Pattern

Architecture

Getting Started

Things to Pay Attention To

Running the Tests

Further Reading

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages