BPE (Byte Pair Encoding)
Also known as: BPE / Byte Pair Encoding / バイトペアエンコーディング
A subword tokenization algorithm that iteratively merges the most frequent byte/character pairs to build a vocabulary. Used in most major LLMs including the GPT family as the foundation of their tokenizers.
Overview
BPE originated as a data compression algorithm and was adopted for NLP subword segmentation in 2016. It iteratively merges the most frequent character pairs to build a fixed-size vocabulary, handling out-of-vocabulary words by decomposing them into learned subwords. Almost all major LLMs (GPT family, Llama, etc.) use BPE-based tokenizers.
Comparison with SentencePiece / Unigram LM
Google's models (T5, Gemma) use SentencePiece with Unigram LM. Both BPE and Unigram LM are subword methods, but differ in vocabulary construction: BPE greedily merges; Unigram LM prunes from a large initial set. Both handle multilingual input and unknown words well.
Related Columns
Feel free to contact us
Contact Us