Speculative Decoding
Also known as: Speculative Decoding / 投機的デコーディング / スペキュレイティブデコーディング
An inference acceleration technique where a small draft model generates multiple candidate tokens, which a large target model then verifies in a single forward pass — achieving 3-4x speedups without quality loss.
Overview
Speculative Decoding addresses the latency of autoregressive token-by-token generation. A lightweight draft model generates multiple candidate tokens in parallel; the large target model verifies them in a single forward pass. Accepted tokens are used as-is; rejected tokens are resampled by the target. The output distribution is theoretically unchanged, so quality is preserved.
Practical significance
2-4x throughput gains on the same GPU hardware are achievable. The technique is used in commercial APIs (Claude, Gemini) and local inference engines (llama.cpp). It significantly improves user experience for interactive LLM applications.
Related Columns
Feel free to contact us
Contact Us