株式会社オブライト
AI2026-05-17

BPE (Byte Pair Encoding)

Also known as: BPE / Byte Pair Encoding / バイトペアエンコーディング

A subword tokenization algorithm that iteratively merges the most frequent byte/character pairs to build a vocabulary. Used in most major LLMs including the GPT family as the foundation of their tokenizers.


Overview

BPE originated as a data compression algorithm and was adopted for NLP subword segmentation in 2016. It iteratively merges the most frequent character pairs to build a fixed-size vocabulary, handling out-of-vocabulary words by decomposing them into learned subwords. Almost all major LLMs (GPT family, Llama, etc.) use BPE-based tokenizers.

Comparison with SentencePiece / Unigram LM

Google's models (T5, Gemma) use SentencePiece with Unigram LM. Both BPE and Unigram LM are subword methods, but differ in vocabulary construction: BPE greedily merges; Unigram LM prunes from a large initial set. Both handle multilingual input and unknown words well.

Related Columns

Related Terms

Feel free to contact us

Contact Us