Inference
Also known as: Inference / 推論 / モデル推論
The process of running a trained AI model on new inputs to produce predictions or generated outputs. In LLMs, this is the text-generation step — distinct from the training process.
Overview
Inference is the process of running a trained model on new inputs to produce outputs. For LLMs, this means token-by-token generation in response to a user prompt. Model parameters are not updated during inference. Inference cost, latency, and throughput determine the practical viability of an LLM-based product.
Inference optimization
Key optimizations include KV Cache, Speculative Decoding, quantization, FlashAttention, and request batching. Cloud APIs apply these transparently. For local inference, engines such as llama.cpp, vLLM, and TGI handle optimization.
Related Columns
Related Terms
Feel free to contact us
Contact Us