What is Quantization?

Deployment

Quantization

A technique that reduces the precision of a model's numerical weights (e.g., from 16-bit to 4-bit), dramatically decreasing memory usage and increasing inference speed with only a small loss in output quality.

Quantization reduces the numerical precision of a model's parameters to make it smaller and faster. A typical language model stores each weight as a 16-bit floating-point number (2 bytes). Quantization converts these to lower precision — 8-bit (INT8), 4-bit (INT4), or even lower — reducing memory requirements proportionally. A 70 billion parameter model that requires 140 GB in float16 needs only 70 GB in 8-bit or 35 GB in 4-bit, making it feasible to run on significantly less expensive hardware.

The most popular quantization methods include GPTQ (post-training quantization optimized for GPUs), AWQ (activation-aware quantization that preserves important weights), and GGUF/GGML (CPU-friendly quantization used by llama.cpp). Each method takes a different approach to deciding which weights can tolerate lower precision. More sophisticated methods like AWQ identify weights that are more important for model quality and keep those at higher precision, achieving better quality at the same bit-width.

The quality impact of quantization depends on the method, the model, and the bit-width. Going from float16 to 8-bit (INT8) typically has negligible quality loss — often less than 0.5% on benchmarks. The step from 8-bit to 4-bit shows more degradation but is still remarkably small for well-quantized large models. Below 4-bit, quality drops become more noticeable. Larger models tend to be more robust to quantization than smaller ones, because they have more redundancy in their parameters.

Quantization has been transformative for the open-source AI ecosystem. It allows enthusiasts to run 70B+ parameter models on consumer GPUs, something previously requiring enterprise hardware. Combined with efficient inference frameworks like llama.cpp, vLLM, and TensorRT-LLM, quantization makes self-hosting competitive with API-based deployment for many use cases. When browsing open-source models on GPTCrunch, you will often see multiple quantized versions available — choosing the right quantization level is a key part of the deployment decision.

Prompt Engineering

RAG (Retrieval-Augmented Generation)

Explore more AI concepts in the glossary

Browse Full Glossary