株式会社オブライト
AI2026-05-17

Speculative Decoding

Also known as: Speculative Decoding / 投機的デコーディング / スペキュレイティブデコーディング

An inference acceleration technique where a small draft model generates multiple candidate tokens, which a large target model then verifies in a single forward pass — achieving 3-4x speedups without quality loss.


Overview

Speculative Decoding addresses the latency of autoregressive token-by-token generation. A lightweight draft model generates multiple candidate tokens in parallel; the large target model verifies them in a single forward pass. Accepted tokens are used as-is; rejected tokens are resampled by the target. The output distribution is theoretically unchanged, so quality is preserved.

Practical significance

2-4x throughput gains on the same GPU hardware are achievable. The technique is used in commercial APIs (Claude, Gemini) and local inference engines (llama.cpp). It significantly improves user experience for interactive LLM applications.

Related Columns

Related Terms

Feel free to contact us

Contact Us