株式会社オブライト
AI2026-05-17

Tokenization

Also known as: Tokenization / トークン化 / トークナイゼーション

The pre-processing step that splits text into tokens (sub-word units, characters, or symbols) that the LLM operates on. Tokenization design affects model performance, cost, and multilingual capability.


Overview

Tokenization converts raw text into a sequence of token IDs that the LLM processes. In English, 1 token ≈ 4 characters. In Japanese and Chinese, a single character often maps to multiple tokens, reducing effective context-window capacity and increasing API cost per character. BPE (Byte Pair Encoding) is the dominant algorithm.

Practical note

API pricing is per-token, so Japanese-language content consumes more tokens per character than English. When building RAG with Japanese documents, set chunk sizes in token units rather than character counts to avoid context-window overruns.

Related Columns

Related Terms

Feel free to contact us

Contact Us