Understanding Tokenization Strategies for Large Language Models: BPE, WordPiece, and Unigram
Every time you type a sentence into a chatbot, the model doesn’t see words like you do. It sees numbers. And before it gets to those numbers, it has to break your words apart-sometimes into pieces. That’s tokenization. It’s the invisible first step that makes large language models work. Without it, models would struggle with rare words, long sentences, or languages that build words differently than English. The way a model splits text into tokens can make or break its performance, speed, and ability to handle languages beyond English.
Why Tokenization Matters More Than You Think
Think of tokenization like chopping wood. If you cut too big, you can’t fit it in the stove. Too small, and you waste time stacking kindling. Tokenization finds the sweet spot: breaking text into chunks that are small enough to process efficiently, but large enough to carry meaning.
Early models used whole words. But what happens when a word isn’t in the dictionary? Like "unbelievable" or "supercalifragilisticexpialidocious"? They’d fail. Or worse-they’d guess. That’s why modern LLMs use subword tokenization. Instead of memorizing every word, they learn to build words from smaller pieces. The model learns that "un-", "-believe", and "-able" are common parts. Even if it’s never seen "unbelievability", it can still understand it by combining known pieces.
This isn’t just a technical trick. It’s what lets models handle typos, slang, technical terms, and even code. A tokenizer that can split "500GB" into "500" and "GB" helps a model understand both the number and the unit. A bad tokenizer might treat it as one unknown token and lose half the meaning.
Byte-Pair Encoding (BPE): The Most Common Approach
BPE is the workhorse of modern LLMs. It’s used in OpenAI’s GPT models, Anthropic’s Claude, and many open-source projects. Here’s how it works: start with every character as a token. Then, look at your training data and find the two most common pairs of tokens. Merge them into one. Repeat until you hit your target vocabulary size-usually 30,000 to 50,000 tokens.
For example, if "th" and "e" appear together a lot, BPE merges them into "the". Later, if "ing" and "s" show up together often, it makes "ings". Over time, it builds a vocabulary of common subwords: "un", "believ", "able", "##ing", "##ed".
The "##" prefix means the token isn’t a start of a word-it’s a continuation. So "un##believ##able" tells the model this is one word built from three parts.
BPE is simple, fast, and effective. It gives good coverage for English and works well on general text. But it doesn’t care about language structure. It just counts pairs. That’s why it’s not always the best for languages with complex grammar.
WordPiece: The Google Way
Google’s BERT model uses WordPiece instead of BPE. The difference? WordPiece doesn’t pick the most frequent pairs. It picks the ones that increase the likelihood of the whole sentence.
Imagine you have the sentence: "I love running."
BPE might merge "run" and "ning" because they appear together often. WordPiece asks: "Does merging them make the whole sentence more probable?" It uses a statistical model to calculate the probability of each possible merge and picks the one that improves the overall score the most.
This makes WordPiece better at preserving meaningful linguistic units. It’s more likely to keep "running" as one token if that’s a common word in context-even if "run" and "ning" appear separately elsewhere.
Studies show WordPiece has a higher average token count per word: about 1.7 tokens per word, compared to BPE’s 1.4. That means it splits words into more pieces. That can be good for tasks needing fine-grained understanding, like question answering. But it also means longer sequences, which slow down processing.
Unigram: The Compression Champion
Unigram tokenization flips the script. Instead of starting small and merging, it starts big-every possible word and subword-and then removes the least useful ones.
It uses a probability model. Each possible token gets a score based on how likely it is to appear in the training data. Then it removes the token with the lowest score, recalculates probabilities, and repeats until it hits the target vocabulary size.
This approach is more efficient. On machine code and technical text, Unigram uses 22% fewer tokens than BPE and 31% fewer than WordPiece. That means faster inference and lower memory use.
It’s especially good at handling repetitive patterns. In code, for example, "for", "while", "if" appear often. Unigram learns to keep them as single tokens. But it also knows that "var_name_123" is less common and might split it into "var", "_", "name", "_", "123".
Unigram is less popular in big commercial models-but it’s growing in research and specialized applications. Hugging Face’s tokenizer library shows it performs best on datasets with high repetition, like logs, code, or structured data.
How Vocabulary Size Affects Performance
Choosing how many tokens to keep isn’t arbitrary. It’s a trade-off.
A small vocabulary (like 8,000 tokens) means faster processing. But it also means more splitting. "unbelievable" becomes "un", "##be", "##liev", "##able"-four tokens. That’s more work for the model.
A large vocabulary (like 128,000 tokens, like Llama 3.2) keeps more words whole. "unbelievable" might be one token. That’s faster. But it uses more memory. And if you train on English-heavy data, you waste space on rare English words that never appear in other languages.
Most open-source models use 32,000 tokens. GPT-4 uses 50,000. Llama 3.2 uses 128,000. Why the jump? Meta wanted better multilingual support. More tokens means more room for non-English words, symbols, and morphological variations.
But there’s a catch: larger vocabularies don’t always mean better performance. A 2024 arXiv study found that beyond 50,000 tokens, gains in accuracy plateau. The real benefit is in handling rare languages, not boosting English accuracy.
Language Bias Is Real-and Costly
Most LLMs are trained mostly on English. That’s not a coincidence. English is the most common language on the internet. But it creates a hidden problem: tokenization bias.
English words average 1.3 tokens per word. Turkish? 2.1. Swahili? Up to 2.5. Why? Turkish builds long words with suffixes. Swahili has complex verb forms. But if your tokenizer was trained on English, it doesn’t know how to handle them efficiently. It splits them into more pieces than needed.
That means longer sequences. Longer sequences mean slower inference. Slower inference means higher costs. Cisco found non-English requests cost up to 43% more in compute resources. Reddit users reported a 23% performance drop on Swahili text in Llama 3 models because the tokenizer didn’t have enough space for Swahili-specific subwords.
OpenAI and Google have started addressing this. GPT-4’s tokenization is more balanced. Gemini uses a 1.25 tokens-per-word ratio across languages. Llama 3.2 cut the English-to-Swahili gap from 37% to 22% by expanding its vocabulary and retraining on diverse data.
But most models still favor English. If you’re building a global app, don’t assume your tokenizer will work well in Hindi, Arabic, or Finnish. Test it. Measure the token count. Watch for inflated sequence lengths.
When to Use What
So which one should you pick?
- Use BPE if you’re building a general-purpose model. It’s reliable, fast, and widely supported. Great for chatbots, content generation, and English-heavy tasks.
- Use WordPiece if you need fine-grained understanding. Good for search, question answering, or tasks where word structure matters. Used in BERT-style models for NLP benchmarks.
- Use Unigram if you care about efficiency. Code analysis, log parsing, or embedded systems? Unigram saves memory and speed. It’s the quiet winner in specialized domains.
And here’s a pro tip: don’t train your own tokenizer from scratch unless you have to. Start with a pre-trained one from Hugging Face. Fine-tune it on your data. That’s what companies like Nebius do-and it cuts implementation time by 63%.
What’s Next? The Future of Tokenization
Tokenization is evolving. Researchers are testing morphologically aware tokenizers that start with known prefixes and suffixes. Instead of letting BPE guess, they feed it linguistic rules. Early results show 9.7% better performance on morphological tasks.
Others are experimenting with neural tokenization-models that learn to split text directly from raw bytes, without a fixed vocabulary. Imagine a tokenizer that doesn’t need to store 128,000 tokens. It just learns how to chop text on the fly. Gartner predicts this could eliminate fixed vocabularies by 2026.
Dynamic allocation is another frontier. Instead of giving 50,000 slots to English and 1,000 to Swahili, the model adjusts on the fly. If a user types in Korean, the tokenizer shifts resources to Korean subwords. That could reduce non-English token inflation by 15-20%.
One thing’s clear: tokenization is no longer a side note. It’s a core design choice. The right tokenizer can make your model faster, cheaper, and more inclusive. The wrong one? It’ll silently hurt performance-especially for users who don’t speak English.
Practical Tips for Implementation
- Always test your tokenizer on your target languages. Don’t assume it works.
- Normalize your data. Replace variable addresses in code with placeholders like "ADDR_1". This cuts token outliers by 9%.
- Watch whitespace. Extra spaces can inflate token counts by 15-20%. Clean them before tokenizing.
- Start with pre-trained tokenizers. Hugging Face has them for BPE, WordPiece, and Unigram. Fine-tune, don’t train from scratch.
- Measure token count per word in your dataset. If it’s over 2.0 for English, your tokenizer is too aggressive.
Tokenization isn’t glamorous. But it’s the foundation. Get it right, and your model works better for everyone. Get it wrong, and it works poorly-for everyone but English speakers.
What is tokenization in LLMs?
Tokenization is the process of breaking down text into smaller units-called tokens-that a language model can process. Instead of treating whole words as single units, modern models use subword tokenization to handle rare, unknown, or complex words by splitting them into common parts like "un-", "-believ", and "-able".
What’s the difference between BPE and WordPiece?
BPE merges the most frequent pairs of tokens, regardless of context. WordPiece chooses merges based on statistical likelihood-how much each merge improves the probability of the full sentence. WordPiece tends to preserve more meaningful word parts, while BPE is simpler and faster.
Which tokenization method is best for code?
Unigram tokenization performs best on code and technical text because it prioritizes compression efficiency. It uses fewer tokens per instruction, reducing memory and processing load. Studies show it’s 22% more efficient than BPE on machine code datasets.
Why do non-English languages need more tokens?
Languages like Turkish, Finnish, and Swahili build long words using prefixes and suffixes. A single word can carry the meaning of an entire English sentence. If the tokenizer was trained mostly on English, it doesn’t have enough subword tokens to represent these structures efficiently-so it splits them into more pieces, increasing token count by up to 61%.
Can I train my own tokenizer?
Yes, but it’s rarely necessary. Training a tokenizer from scratch takes 2-3 weeks of engineering work. Most developers start with a pre-trained tokenizer from Hugging Face and fine-tune it on their domain data. This reduces time and improves performance without the complexity of full training.
Does vocabulary size affect model speed?
Yes. Larger vocabularies (like 128,000 tokens) require more memory and increase the size of the embedding layer, slowing down inference. Smaller vocabularies (like 32,000) are faster but may split words too much. The sweet spot for most models is 30,000-50,000 tokens, balancing speed and coverage.
- Jan, 24 2026
- Collin Pace
- 5
- Permalink
Written by Collin Pace
View all posts by: Collin Pace