What is knowledge distillation in AI?

Knowledge distillation is a technique where a smaller "student" model is trained to replicate the outputs and behavior of a larger "teacher" model. Rather than learning from raw data alone, the student learns from the teacher's soft probability distributions — which contain richer signal than hard labels. The result is a model that is faster and cheaper to run while retaining much of the teacher's capability.

Is distillation legal?

It depends on the terms of service of the model being distilled and whether the distillation involves unauthorized access. Distilling from publicly available open-weight models is generally legal. Distilling from a closed API by systematically querying it to collect training data is explicitly prohibited by most major AI providers' terms of service — and Anthropic has pursued legal action against Alibaba for doing exactly this with Claude.

How did Alibaba distill Claude Fable 5?

According to Anthropic's legal complaint, operators linked to Alibaba's Qwen team created nearly 25,000 fraudulent accounts and used them to run 28.8 million Claude API interactions. These interactions were used as training data to improve Qwen models — effectively extracting Claude's knowledge through its API without permission.

What is the difference between distillation and fine-tuning?

Fine-tuning adapts a pre-trained model to a specific task using new labeled data. Distillation transfers general capability from a larger model to a smaller one using the larger model's outputs as training signal. Fine-tuning changes what a model knows; distillation changes how efficiently it knows it.

Can you distill Claude or GPT-4 legally?

Not via their APIs. Both Anthropic and OpenAI explicitly prohibit using API outputs to train competing models. Open-weight models like Llama 4 or DeepSeek V4 can be legally distilled under their respective licenses. Some licenses (Apache 2.0, MIT) allow commercial distillation; others (like some Meta Llama licenses) restrict commercial use above certain scales.

What Is AI Distillation? Knowledge Transfer, Model | explainx.ai Blog

TL;DR

Distillation is the technique of training a small, efficient "student" model to replicate the behavior of a large, expensive "teacher" model. It's one of the foundational techniques that has made modern AI practical — almost every small model you run locally is partially a product of distillation. In 2026, it's also at the centre of one of the most significant legal disputes in AI: Anthropic accused Alibaba of systematically distilling Claude Fable 5 using 25,000 fake accounts and 28.8 million unauthorized API calls.

The Core Idea: Teacher and Student

The original knowledge distillation paper — Hinton, Vinyals, and Dean, 2015 — introduced a deceptively simple idea: you can learn more from a model's probability distribution than from its final answer.

When a classifier labels an image of a dog, the hard answer is just "dog." But the model's full probability output across all classes might look like: dog 92%, wolf 5%, cat 2%, other 1%. That soft distribution encodes something the hard label doesn't — the model knows a dog is more similar to a wolf than to a cat. A student trained on these soft probabilities learns more efficiently than one trained on raw labels alone.

Geoffrey Hinton called this "dark knowledge" — the information embedded in the wrong answers that reveals how the model internally represents the world.

How Distillation Works in Practice

Step 1: Pick a teacher

The teacher is a large, capable model — GPT-4, Claude Fable 5, Llama 4 70B, whatever. It's expensive to run but highly capable.

Step 2: Generate outputs from the teacher

You feed inputs through the teacher and collect its outputs. For a language model, this means collecting the token probability distributions (logits) across thousands or millions of prompts. The richer and more diverse the prompt distribution, the more of the teacher's knowledge you can transfer.

Step 3: Train the student

The student is a smaller architecture — maybe 7B parameters instead of 70B. You train it to minimize the difference between its own output distributions and the teacher's. The loss function combines:

Distillation loss — how closely the student's soft outputs match the teacher's
Task loss — how well the student performs on ground truth labels (optional, depending on the approach)

Step 4: Evaluate and iterate

The student won't be as capable as the teacher on everything, but it will be much faster and cheaper to run. The goal is maximizing capability retention per unit of compute cost.

A Brief History of Distillation

2006 — Model compression

Buciluă, Caruana, and Niculescu-Mizil showed you could compress an ensemble of models into a single smaller model without much accuracy loss. The key insight: the ensemble's soft predictions are richer training signal than hard labels.

2015 — "Distilling the Knowledge in a Neural Network" (Hinton et al.)

Hinton formalized the teacher-student framework and named it knowledge distillation. The "temperature" softmax trick — dividing logits by a temperature parameter before softmax to soften the distribution — became standard practice. This paper kicked off the modern distillation era.

2019 — DistilBERT (Hugging Face)

Hugging Face released DistilBERT, a distilled version of BERT that was 40% smaller, 60% faster, and retained 97% of BERT's performance on GLUE benchmarks. This was the moment distillation went from a research technique to a production standard. Millions of applications run on DistilBERT or its descendants today.

2020–2022 — GPT distillation at scale

With GPT-3's API access but no open weights, researchers explored "black-box distillation" — collecting API outputs to train smaller models. Papers like "GPT3Mix" and "Self-Instruct" used GPT-3 completions as synthetic training data. OpenAI's terms of service prohibited this, but enforcement was minimal at smaller scales.

2023 — Alpaca and the era of instruction distillation

Stanford's Alpaca model was trained on 52,000 instruction-following examples generated by GPT-3.5. It showed that a 7B LLaMA model could behave like a capable instruction-following assistant after being fine-tuned on GPT-generated data — at a cost of about $600. OpenAI promptly updated its terms of service to prohibit this use. Alpaca was taken down but the technique proliferated.

2024 — Phi models (Microsoft)

Microsoft's Phi series demonstrated something remarkable: small models (1.3B–7B parameters) trained on high-quality synthetic data generated by larger models could punch far above their weight. Phi-3-mini achieved GPT-3.5-level performance with 3.8B parameters. The key was carefully curated "textbook-quality" synthetic data — effectively structured distillation from GPT-4.

2025 — DeepSeek R1 and reasoning distillation

DeepSeek released R1, a reasoning model, alongside smaller distilled versions (1.5B, 7B, 14B, 32B, 70B). The distilled versions were trained on reasoning traces generated by the full R1 model — transferring not just factual knowledge but how to reason. DeepSeek R1-Distill-70B matched or beat some closed models on math benchmarks despite being dramatically cheaper to run.

2026 — Anthropic sues Alibaba over Fable 5 distillation

The largest legal action around distillation to date. Anthropic alleged that operators linked to Alibaba's Qwen team ran 25,000 fraudulent accounts and 28.8 million Claude API interactions to collect training data for Qwen. This was large-scale API distillation — not a research experiment but an industrial extraction operation.

Types of Distillation

Response distillation (black-box)

The student learns from the teacher's final outputs — text completions, classifications, answers. You don't need access to the teacher's internal weights or probabilities, just its API. This is what Alpaca did with GPT-3.5 and what Alibaba allegedly did with Claude.

Limitation: You only get the teacher's "hard" outputs, not the full probability distribution. You miss the dark knowledge.

Logit distillation (white-box)

The student learns from the teacher's full probability distributions over tokens. Requires access to the model's logits — possible with open-weight models but not with closed APIs. Much more efficient transfer per training example.

Feature distillation

The student is trained to match the teacher's internal representations (hidden states, attention patterns) at intermediate layers, not just the final output. Used in DistilBERT. Requires white-box access.

Chain-of-thought distillation

The teacher generates step-by-step reasoning traces, and the student is trained on those traces rather than just final answers. This is what made DeepSeek R1 distillation so effective — small models could learn reasoning behavior, not just memorized answers.

Speculative decoding (a different kind of distillation use)

A faster variant where a small distilled draft model generates candidate tokens, and the large teacher model verifies them in parallel. This speeds up inference 2-3x without changing output quality. Used in production by Anthropic and others to make frontier models cheaper to serve.

Why Distillation Is So Powerful

Capability compression

A well-distilled 7B model can perform tasks that require 70B parameters when trained from scratch. You're not just getting a smaller model — you're getting a model that has been taught by a much better one.

Data efficiency

The teacher's soft probabilities are far more informative per example than human labels. A model trained on 100K distillation examples can outperform one trained on 1M raw-labeled examples.

Reasoning transfer

Perhaps most importantly, you can transfer how a model thinks, not just what it knows. Chain-of-thought distillation transfers reasoning strategies. A small model trained on GPT-4's reasoning traces can solve problems it would never have solved trained on the raw answers alone.

Cost collapse

DeepSeek V4 Pro demonstrated that distillation can collapse costs dramatically. Models trained partly on frontier model outputs can compete with frontier models at a fraction of the training and inference cost. This is one reason the AI pricing war of 2026 has been so intense.

The Legal and Ethical Dimension

Distillation occupies a contested legal and ethical space. Three distinct scenarios have very different status:

Scenario 1: Distilling your own open-weight model (clearly fine)

If you have access to a model's weights under a permissive license (MIT, Apache 2.0), you can distill it into a smaller version and do almost anything with the result. This is standard practice and unambiguously legal.

Scenario 2: Distilling from open-weight models with restrictive licenses (depends)

Some open-weight licenses — including early versions of Meta's Llama licenses — restrict commercial use above a certain scale or prohibit using the model's outputs to train competing models. Check the license before distilling.

Scenario 3: Distilling from closed APIs (prohibited)

Every major closed AI provider prohibits using API outputs to train competing models. OpenAI, Anthropic, Google — all have terms of service that explicitly forbid this. Doing it at scale, as Alibaba allegedly did, creates legal exposure.

The Alibaba case is significant because Anthropic took it to court rather than just issuing a warning. It signals that large-scale API distillation will be treated as a commercial threat and pursued legally, not just contractually flagged.

Distillation and the Open-Source Debate

Distillation is central to the open-source AI controversy. Here's why:

If you keep weights closed but allow API access, determined actors can still extract capability through large-scale distillation. The Alibaba case is proof. Dario Amodei's argument that closed weights protect against misuse is weakened by the fact that closed APIs can be systematically mined.

If you open-source weights, distillation becomes trivial — anyone can do it, no terms of service needed. But the capability is already out anyway, so at least the playing field is level.

If you keep everything closed (no open weights, heavily rate-limited API), you slow down distillation but also slow down legitimate research, education, and developer access.

There is no configuration that simultaneously prevents distillation by sophisticated actors and enables full legitimate use. This is one of the unresolved tensions that makes Dario Amodei's open-source policy position harder than it looks.

Fable 5 in Context: What Was Being Distilled

Claude Fable 5 represents Anthropic's most capable model generation — strong reasoning, long-context understanding, agentic tool use, and safety-tuned instruction following. When Alibaba allegedly distilled it at 28.8 million API calls, they were extracting:

Instruction-following quality — how Claude interprets and executes complex prompts
Reasoning patterns — how Claude works through multi-step problems
Tone and safety behavior — Claude's distinctive response style and refusal patterns
Tool use patterns — how Claude structures agentic task execution

This is not trivial to replicate through normal training. Fable 5 represents billions of dollars and years of RLHF (reinforcement learning from human feedback), Constitutional AI research, and safety work. Distilling it is a significant shortcut — which is exactly why Anthropic viewed it as a serious enough threat to litigate.

Practical Distillation: What's Actually Available Today

If you want to use distillation legitimately, here are the best starting points:

Model	Base	Parameters	Distillation Type	License
DeepSeek R1-Distill-70B	DeepSeek R1	70B	Chain-of-thought	MIT
DeepSeek R1-Distill-7B	DeepSeek R1	7B	Chain-of-thought	MIT
Phi-4 (Microsoft)	GPT-4 synthetic data	14B	Response distillation	MIT
DistilBERT	BERT-base	66M	Feature + logit	Apache 2.0
Llama 3.1-8B	Llama 3.1-70B (partially)	8B	Mixed	Meta license

For production use where you need Claude-level quality in a smaller, faster model, DeepSeek R1-Distill-70B is currently the strongest open option. For coding agents specifically, Qwen 3.7-Max is worth evaluating — though its relationship to distillation from closed models is contested given the ongoing litigation.

Bottom Line

Distillation is not a corner case or an exotic research technique. It is a foundational mechanism of how modern AI gets deployed at scale. Almost every small model running on a laptop or embedded in a product today has been distilled, fine-tuned on distilled data, or trained on synthetic data generated by a larger model.

The Anthropic vs Alibaba case is not really about Alibaba being uniquely bad actors — distillation from closed APIs has been happening since GPT-3. It's about the fact that at the scale of 28.8 million API calls and 25,000 fake accounts, the extraction crossed from ambiguous grey area into deliberate commercial exploitation.

Understanding distillation is essential for anyone building with AI in 2026 — both to use it effectively and to understand why the open-source vs closed-source debate is harder to resolve than it looks.

Further reading:

TL;DR

The Core Idea: Teacher and Student

Geoffrey Hinton called this "dark knowledge" — the information embedded in the wrong answers that reveals how the model internally represents the world.

How Distillation Works in Practice

Step 1: Pick a teacher

The teacher is a large, capable model — GPT-4, Claude Fable 5, Llama 4 70B, whatever. It's expensive to run but highly capable.

Step 2: Generate outputs from the teacher

Step 3: Train the student

Distillation loss — how closely the student's soft outputs match the teacher's
Task loss — how well the student performs on ground truth labels (optional, depending on the approach)

Step 4: Evaluate and iterate

The student won't be as capable as the teacher on everything, but it will be much faster and cheaper to run. The goal is maximizing capability retention per unit of compute cost.

A Brief History of Distillation

2006 — Model compression

2015 — "Distilling the Knowledge in a Neural Network" (Hinton et al.)

2019 — DistilBERT (Hugging Face)

2020–2022 — GPT distillation at scale

2023 — Alpaca and the era of instruction distillation

2024 — Phi models (Microsoft)

2025 — DeepSeek R1 and reasoning distillation

2026 — Anthropic sues Alibaba over Fable 5 distillation

Types of Distillation

Response distillation (black-box)

Limitation: You only get the teacher's "hard" outputs, not the full probability distribution. You miss the dark knowledge.

Logit distillation (white-box)

Feature distillation

Chain-of-thought distillation

Speculative decoding (a different kind of distillation use)

Why Distillation Is So Powerful

Capability compression

Data efficiency

The teacher's soft probabilities are far more informative per example than human labels. A model trained on 100K distillation examples can outperform one trained on 1M raw-labeled examples.

Reasoning transfer

Cost collapse

The Legal and Ethical Dimension

Distillation occupies a contested legal and ethical space. Three distinct scenarios have very different status:

Scenario 1: Distilling your own open-weight model (clearly fine)

Scenario 2: Distilling from open-weight models with restrictive licenses (depends)

Scenario 3: Distilling from closed APIs (prohibited)

Distillation and the Open-Source Debate

Distillation is central to the open-source AI controversy. Here's why:

If you open-source weights, distillation becomes trivial — anyone can do it, no terms of service needed. But the capability is already out anyway, so at least the playing field is level.

If you keep everything closed (no open weights, heavily rate-limited API), you slow down distillation but also slow down legitimate research, education, and developer access.

Fable 5 in Context: What Was Being Distilled

Instruction-following quality — how Claude interprets and executes complex prompts
Reasoning patterns — how Claude works through multi-step problems
Tone and safety behavior — Claude's distinctive response style and refusal patterns
Tool use patterns — how Claude structures agentic task execution

Practical Distillation: What's Actually Available Today

If you want to use distillation legitimately, here are the best starting points:

Model	Base	Parameters	Distillation Type	License
DeepSeek R1-Distill-70B	DeepSeek R1	70B	Chain-of-thought	MIT
DeepSeek R1-Distill-7B	DeepSeek R1	7B	Chain-of-thought	MIT
Phi-4 (Microsoft)	GPT-4 synthetic data	14B	Response distillation	MIT
DistilBERT	BERT-base	66M	Feature + logit	Apache 2.0
Llama 3.1-8B	Llama 3.1-70B (partially)	8B	Mixed	Meta license

Bottom Line

Further reading:

TL;DR

The Core Idea: Teacher and Student

How Distillation Works in Practice

Step 1: Pick a teacher

Step 2: Generate outputs from the teacher

Step 3: Train the student

Step 4: Evaluate and iterate

A Brief History of Distillation

2006 — Model compression

2015 — "Distilling the Knowledge in a Neural Network" (Hinton et al.)

2019 — DistilBERT (Hugging Face)

2020–2022 — GPT distillation at scale

2023 — Alpaca and the era of instruction distillation

2024 — Phi models (Microsoft)

2025 — DeepSeek R1 and reasoning distillation

2026 — Anthropic sues Alibaba over Fable 5 distillation

Types of Distillation

Response distillation (black-box)

Logit distillation (white-box)

Feature distillation

Chain-of-thought distillation

Speculative decoding (a different kind of distillation use)

Why Distillation Is So Powerful

Capability compression

Data efficiency

Reasoning transfer

Cost collapse

The Legal and Ethical Dimension

Scenario 1: Distilling your own open-weight model (clearly fine)

Scenario 2: Distilling from open-weight models with restrictive licenses (depends)

Scenario 3: Distilling from closed APIs (prohibited)

Distillation and the Open-Source Debate

Fable 5 in Context: What Was Being Distilled

Practical Distillation: What's Actually Available Today

Bottom Line

Related posts

Anthropic vs Alibaba: 25,000 Fake Accounts and 28.8M Claude Exchanges

Dario Amodei Warned Against GPT-2 in 2019. Now He's at the Centre of the Open-Source AI War.

Asian AI fills the Mythos gap: Sakana Fugu, 360 Tulongfeng, and the export-ban vacuum

TL;DR

The Core Idea: Teacher and Student

How Distillation Works in Practice

Step 1: Pick a teacher

Step 2: Generate outputs from the teacher

Step 3: Train the student

Step 4: Evaluate and iterate

A Brief History of Distillation

2006 — Model compression

2015 — "Distilling the Knowledge in a Neural Network" (Hinton et al.)

2019 — DistilBERT (Hugging Face)

2020–2022 — GPT distillation at scale

2023 — Alpaca and the era of instruction distillation

2024 — Phi models (Microsoft)

2025 — DeepSeek R1 and reasoning distillation

2026 — Anthropic sues Alibaba over Fable 5 distillation

Types of Distillation

Response distillation (black-box)

Logit distillation (white-box)

Feature distillation

Chain-of-thought distillation

Speculative decoding (a different kind of distillation use)

Why Distillation Is So Powerful

Capability compression

Data efficiency

Reasoning transfer

Cost collapse

The Legal and Ethical Dimension

Scenario 1: Distilling your own open-weight model (clearly fine)

Scenario 2: Distilling from open-weight models with restrictive licenses (depends)

Scenario 3: Distilling from closed APIs (prohibited)

Distillation and the Open-Source Debate

Fable 5 in Context: What Was Being Distilled

Practical Distillation: What's Actually Available Today

Bottom Line

Related posts

Anthropic vs Alibaba: 25,000 Fake Accounts and 28.8M Claude Exchanges

Dario Amodei Warned Against GPT-2 in 2019. Now He's at the Centre of the Open-Source AI War.

Asian AI fills the Mythos gap: Sakana Fugu, 360 Tulongfeng, and the export-ban vacuum