AI2026-05-17

BPE (Byte Pair Encoding)

Also known as: BPE / Byte Pair Encoding / バイトペアエンコーディング

A subword tokenization algorithm that iteratively merges the most frequent byte/character pairs to build a vocabulary. Used in most major LLMs including the GPT family as the foundation of their tokenizers.

Overview

BPE originated as a data compression algorithm and was adopted for NLP subword segmentation in 2016. It iteratively merges the most frequent character pairs to build a fixed-size vocabulary, handling out-of-vocabulary words by decomposing them into learned subwords. Almost all major LLMs (GPT family, Llama, etc.) use BPE-based tokenizers.

Comparison with SentencePiece / Unigram LM

Google's models (T5, Gemma) use SentencePiece with Unigram LM. Both BPE and Unigram LM are subword methods, but differ in vocabulary construction: BPE greedily merges; Unigram LM prunes from a large initial set. Both handle multilingual input and unknown words well.

Comprehensive guide to AI API cost optimization in the pay-per-use era. Covers Claude, GPT, Gemini pricing comparisons, 5 reduction techniques including prompt caching, batch APIs, local LLM hybrid operations, monthly cost simulations, and ROI calculation methods.

Software Development

Building Internal Knowledge Search with OpenClaw: RAG-Powered AI Agent Guide

Learn how to build a high-accuracy internal knowledge search system using OpenClaw and RAG (Retrieval-Augmented Generation). This guide covers local vector database setup with ChromaDB, Qdrant, and Weaviate, document indexing strategies, and practical deployment for searching across PDFs, Word documents, and internal wikis.

Feel free to contact us

BPE (Byte Pair Encoding)

Overview

Comparison with SentencePiece / Unigram LM

Related Columns

Related Terms