Quantization
Also known as: Quantization / 量子化 / モデル量子化
Converting model weights from 32-bit or 16-bit floats to lower-precision formats (8-bit, 4-bit, etc.) to reduce model size and memory footprint, enabling faster inference and local execution.
Overview
Quantization represents model weights in lower bit-width formats (8-bit, 4-bit) to shrink file size and VRAM footprint. Common formats include GGUF (llama.cpp), GPTQ, AWQ, and NF4 (QLoRA). 4-bit quantization reduces model size roughly 4x vs float16 with minimal accuracy loss.
Enabling local LLM execution
Tools like Ollama distribute pre-quantized models, allowing 7B-13B class models to run on 8-16 GB VRAM consumer GPUs or Apple Silicon. This keeps business data local, addressing privacy concerns while eliminating cloud API costs.
Related Columns
Related Terms
Feel free to contact us
Contact Us