Tokenization
Also known as: Tokenization / トークン化 / トークナイゼーション
The pre-processing step that splits text into tokens (sub-word units, characters, or symbols) that the LLM operates on. Tokenization design affects model performance, cost, and multilingual capability.
Overview
Tokenization converts raw text into a sequence of token IDs that the LLM processes. In English, 1 token ≈ 4 characters. In Japanese and Chinese, a single character often maps to multiple tokens, reducing effective context-window capacity and increasing API cost per character. BPE (Byte Pair Encoding) is the dominant algorithm.
Practical note
API pricing is per-token, so Japanese-language content consumes more tokens per character than English. When building RAG with Japanese documents, set chunk sizes in token units rather than character counts to avoid context-window overruns.
Related Columns
Feel free to contact us
Contact Us