TL;DR
Distillation is the technique of training a small, efficient "student" model to replicate the behavior of a large, expensive "teacher" model. It's one of the foundational techniques that has made modern AI practical β almost every small model you run locally is partially a product of distillation. In 2026, it's also at the centre of one of the most significant legal disputes in AI: Anthropic accused Alibaba of systematically distilling Claude Fable 5 using 25,000 fake accounts and 28.8 million unauthorized API calls.
The Core Idea: Teacher and Student
The original knowledge distillation paper β Hinton, Vinyals, and Dean, 2015 β introduced a deceptively simple idea: you can learn more from a model's probability distribution than from its final answer.
When a classifier labels an image of a dog, the hard answer is just "dog." But the model's full probability output across all classes might look like: dog 92%, wolf 5%, cat 2%, other 1%. That soft distribution encodes something the hard label doesn't β the model knows a dog is more similar to a wolf than to a cat. A student trained on these soft probabilities learns more efficiently than one trained on raw labels alone.
Geoffrey Hinton called this "dark knowledge" β the information embedded in the wrong answers that reveals how the model internally represents the world.
How Distillation Works in Practice
Step 1: Pick a teacher
The teacher is a large, capable model β GPT-4, Claude Fable 5, Llama 4 70B, whatever. It's expensive to run but highly capable.
Step 2: Generate outputs from the teacher
You feed inputs through the teacher and collect its outputs. For a language model, this means collecting the token probability distributions (logits) across thousands or millions of prompts. The richer and more diverse the prompt distribution, the more of the teacher's knowledge you can transfer.
Step 3: Train the student
The student is a smaller architecture β maybe 7B parameters instead of 70B. You train it to minimize the difference between its own output distributions and the teacher's. The loss function combines:
- Distillation loss β how closely the student's soft outputs match the teacher's
- Task loss β how well the student performs on ground truth labels (optional, depending on the approach)
Step 4: Evaluate and iterate
The student won't be as capable as the teacher on everything, but it will be much faster and cheaper to run. The goal is maximizing capability retention per unit of compute cost.
A Brief History of Distillation
2006 β Model compression
BuciluΔ, Caruana, and Niculescu-Mizil showed you could compress an ensemble of models into a single smaller model without much accuracy loss. The key insight: the ensemble's soft predictions are richer training signal than hard labels.
2015 β "Distilling the Knowledge in a Neural Network" (Hinton et al.)
Hinton formalized the teacher-student framework and named it knowledge distillation. The "temperature" softmax trick β dividing logits by a temperature parameter before softmax to soften the distribution β became standard practice. This paper kicked off the modern distillation era.
2019 β DistilBERT (Hugging Face)
Hugging Face released DistilBERT, a distilled version of BERT that was 40% smaller, 60% faster, and retained 97% of BERT's performance on GLUE benchmarks. This was the moment distillation went from a research technique to a production standard. Millions of applications run on DistilBERT or its descendants today.
2020β2022 β GPT distillation at scale
With GPT-3's API access but no open weights, researchers explored "black-box distillation" β collecting API outputs to train smaller models. Papers like "GPT3Mix" and "Self-Instruct" used GPT-3 completions as synthetic training data. OpenAI's terms of service prohibited this, but enforcement was minimal at smaller scales.
2023 β Alpaca and the era of instruction distillation
Stanford's Alpaca model was trained on 52,000 instruction-following examples generated by GPT-3.5. It showed that a 7B LLaMA model could behave like a capable instruction-following assistant after being fine-tuned on GPT-generated data β at a cost of about $600. OpenAI promptly updated its terms of service to prohibit this use. Alpaca was taken down but the technique proliferated.
2024 β Phi models (Microsoft)
Microsoft's Phi series demonstrated something remarkable: small models (1.3Bβ7B parameters) trained on high-quality synthetic data generated by larger models could punch far above their weight. Phi-3-mini achieved GPT-3.5-level performance with 3.8B parameters. The key was carefully curated "textbook-quality" synthetic data β effectively structured distillation from GPT-4.
2025 β DeepSeek R1 and reasoning distillation
DeepSeek released R1, a reasoning model, alongside smaller distilled versions (1.5B, 7B, 14B, 32B, 70B). The distilled versions were trained on reasoning traces generated by the full R1 model β transferring not just factual knowledge but how to reason. DeepSeek R1-Distill-70B matched or beat some closed models on math benchmarks despite being dramatically cheaper to run.
2026 β Anthropic sues Alibaba over Fable 5 distillation
The largest legal action around distillation to date. Anthropic alleged that operators linked to Alibaba's Qwen team ran 25,000 fraudulent accounts and 28.8 million Claude API interactions to collect training data for Qwen. This was large-scale API distillation β not a research experiment but an industrial extraction operation.
Types of Distillation
Response distillation (black-box)
The student learns from the teacher's final outputs β text completions, classifications, answers. You don't need access to the teacher's internal weights or probabilities, just its API. This is what Alpaca did with GPT-3.5 and what Alibaba allegedly did with Claude.
Limitation: You only get the teacher's "hard" outputs, not the full probability distribution. You miss the dark knowledge.
Logit distillation (white-box)
The student learns from the teacher's full probability distributions over tokens. Requires access to the model's logits β possible with open-weight models but not with closed APIs. Much more efficient transfer per training example.
Feature distillation
The student is trained to match the teacher's internal representations (hidden states, attention patterns) at intermediate layers, not just the final output. Used in DistilBERT. Requires white-box access.
Chain-of-thought distillation
The teacher generates step-by-step reasoning traces, and the student is trained on those traces rather than just final answers. This is what made DeepSeek R1 distillation so effective β small models could learn reasoning behavior, not just memorized answers.
Speculative decoding (a different kind of distillation use)
A faster variant where a small distilled draft model generates candidate tokens, and the large teacher model verifies them in parallel. This speeds up inference 2-3x without changing output quality. Used in production by Anthropic and others to make frontier models cheaper to serve.
Why Distillation Is So Powerful
Capability compression
A well-distilled 7B model can perform tasks that require 70B parameters when trained from scratch. You're not just getting a smaller model β you're getting a model that has been taught by a much better one.
Data efficiency
The teacher's soft probabilities are far more informative per example than human labels. A model trained on 100K distillation examples can outperform one trained on 1M raw-labeled examples.
Reasoning transfer
Perhaps most importantly, you can transfer how a model thinks, not just what it knows. Chain-of-thought distillation transfers reasoning strategies. A small model trained on GPT-4's reasoning traces can solve problems it would never have solved trained on the raw answers alone.
Cost collapse
DeepSeek V4 Pro demonstrated that distillation can collapse costs dramatically. Models trained partly on frontier model outputs can compete with frontier models at a fraction of the training and inference cost. This is one reason the AI pricing war of 2026 has been so intense.
The Legal and Ethical Dimension
Distillation occupies a contested legal and ethical space. Three distinct scenarios have very different status:
Scenario 1: Distilling your own open-weight model (clearly fine)
If you have access to a model's weights under a permissive license (MIT, Apache 2.0), you can distill it into a smaller version and do almost anything with the result. This is standard practice and unambiguously legal.
Scenario 2: Distilling from open-weight models with restrictive licenses (depends)
Some open-weight licenses β including early versions of Meta's Llama licenses β restrict commercial use above a certain scale or prohibit using the model's outputs to train competing models. Check the license before distilling.
Scenario 3: Distilling from closed APIs (prohibited)
Every major closed AI provider prohibits using API outputs to train competing models. OpenAI, Anthropic, Google β all have terms of service that explicitly forbid this. Doing it at scale, as Alibaba allegedly did, creates legal exposure.
The Alibaba case is significant because Anthropic took it to court rather than just issuing a warning. It signals that large-scale API distillation will be treated as a commercial threat and pursued legally, not just contractually flagged.
Distillation and the Open-Source Debate
Distillation is central to the open-source AI controversy. Here's why:
If you keep weights closed but allow API access, determined actors can still extract capability through large-scale distillation. The Alibaba case is proof. Dario Amodei's argument that closed weights protect against misuse is weakened by the fact that closed APIs can be systematically mined.
If you open-source weights, distillation becomes trivial β anyone can do it, no terms of service needed. But the capability is already out anyway, so at least the playing field is level.
If you keep everything closed (no open weights, heavily rate-limited API), you slow down distillation but also slow down legitimate research, education, and developer access.
There is no configuration that simultaneously prevents distillation by sophisticated actors and enables full legitimate use. This is one of the unresolved tensions that makes Dario Amodei's open-source policy position harder than it looks.
Fable 5 in Context: What Was Being Distilled
Claude Fable 5 represents Anthropic's most capable model generation β strong reasoning, long-context understanding, agentic tool use, and safety-tuned instruction following. When Alibaba allegedly distilled it at 28.8 million API calls, they were extracting:
- Instruction-following quality β how Claude interprets and executes complex prompts
- Reasoning patterns β how Claude works through multi-step problems
- Tone and safety behavior β Claude's distinctive response style and refusal patterns
- Tool use patterns β how Claude structures agentic task execution
This is not trivial to replicate through normal training. Fable 5 represents billions of dollars and years of RLHF (reinforcement learning from human feedback), Constitutional AI research, and safety work. Distilling it is a significant shortcut β which is exactly why Anthropic viewed it as a serious enough threat to litigate.
Practical Distillation: What's Actually Available Today
If you want to use distillation legitimately, here are the best starting points:
| Model | Base | Parameters | Distillation Type | License |
|---|---|---|---|---|
| DeepSeek R1-Distill-70B | DeepSeek R1 | 70B | Chain-of-thought | MIT |
| DeepSeek R1-Distill-7B | DeepSeek R1 | 7B | Chain-of-thought | MIT |
| Phi-4 (Microsoft) | GPT-4 synthetic data | 14B | Response distillation | MIT |
| DistilBERT | BERT-base | 66M | Feature + logit | Apache 2.0 |
| Llama 3.1-8B | Llama 3.1-70B (partially) | 8B | Mixed | Meta license |
For production use where you need Claude-level quality in a smaller, faster model, DeepSeek R1-Distill-70B is currently the strongest open option. For coding agents specifically, Qwen 3.7-Max is worth evaluating β though its relationship to distillation from closed models is contested given the ongoing litigation.
Bottom Line
Distillation is not a corner case or an exotic research technique. It is a foundational mechanism of how modern AI gets deployed at scale. Almost every small model running on a laptop or embedded in a product today has been distilled, fine-tuned on distilled data, or trained on synthetic data generated by a larger model.
The Anthropic vs Alibaba case is not really about Alibaba being uniquely bad actors β distillation from closed APIs has been happening since GPT-3. It's about the fact that at the scale of 28.8 million API calls and 25,000 fake accounts, the extraction crossed from ambiguous grey area into deliberate commercial exploitation.
Understanding distillation is essential for anyone building with AI in 2026 β both to use it effectively and to understand why the open-source vs closed-source debate is harder to resolve than it looks.
Further reading:
- Claude Fable 5 launch β what was being distilled
- Anthropic vs Alibaba: the full distillation lawsuit story
- Dario Amodei on GPT-2, open source, and the safety debate
- Dario Amodei's AI policy essay to US government
- DeepSeek V4 Pro: distillation-driven cost collapse
- Qwen 3.7-Max: Alibaba's frontier agent model
- Closed-source vs open-source AI in 2026