What is Distillation?

Training

Distillation

A training technique where a smaller "student" model learns to replicate the behavior of a larger "teacher" model, producing a compact model that retains much of the teacher's capability at lower cost.

Knowledge distillation transfers the capabilities of a large, expensive model (the teacher) into a smaller, more efficient model (the student). Rather than training the student model from scratch on raw data, distillation trains it to match the teacher's output distributions — learning not just the right answers but the teacher's confidence levels across all possible outputs. This richer training signal allows the student to absorb patterns that would be difficult to learn from data alone.

The distillation process typically works by running many prompts through the teacher model, collecting its outputs (and sometimes its internal probability distributions), and using these as training data for the student. The student learns to produce similar outputs to the teacher across a diverse range of inputs. Some approaches also incorporate the teacher's intermediate representations or attention patterns, providing even more learning signal.

Distillation has produced some of the most cost-effective models available. OpenAI's GPT-4o-mini is widely believed to be distilled from GPT-4o. Microsoft's Phi series achieves remarkable performance for its size through distillation from larger models. Many of the best open-source models in the 7B-13B range have been improved through distillation from frontier models. The result is a proliferation of small, capable models that can run on modest hardware while delivering surprisingly strong performance.

The practical benefit of distillation is the ability to deploy AI capabilities at a fraction of the cost. A distilled 7B model might achieve 90% of a 70B model's performance while being 10x cheaper to run and 10x faster to respond. For applications with high request volumes, strict latency requirements, or edge deployment constraints, distilled models offer the best performance-per-dollar. The tradeoff is typically in the long tail — distilled models handle common tasks well but may falter on unusual or highly complex inputs where the full-size teacher model excels.

Context Window

Embeddings

Explore more AI concepts in the glossary

Browse Full Glossary