KV Cache (Key-Value Cache)
Also known as: KV Cache / Key-Value Cache / キーバリューキャッシュ
A memory-level optimization that caches the Key and Value vectors computed during Transformer attention, avoiding recomputation of earlier tokens and speeding up autoregressive inference.
Overview
In Transformer self-attention, generating each new token normally requires recomputing Key and Value projections for all prior tokens. KV Cache stores these vectors in GPU memory after the first computation and reuses them in subsequent generation steps. The benefit grows with context length, but so does VRAM consumption.
Prefix caching
Prefix caching extends KV Cache across requests sharing a common prefix (a System Prompt or a large document). Anthropic's Prompt Caching and OpenAI's equivalent feature use this mechanism to significantly cut costs when the same context is reused across many queries.
Related Columns
Feel free to contact us
Contact Us