What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

VibeThinker-3B: Frontier Reasoning at 3B Parameters (2026) | explainx.ai Blog

The most provocative result in the VibeThinker-3B paper is not the benchmark score. It is the implication.

A 3-billion-parameter model scores 94.3 on AIME 2026. DeepSeek V3.2 — a model orders of magnitude larger — scores 94.6 on the same benchmark. Gemini 3 Pro scores 98.2 (higher, but Gemini is a frontier closed model with unknown parameter count). GLM-5 scores 95.3.

VibeThinker-3B is matching frontier models on verifiable mathematical reasoning with less than 1% of their parameter count.

That's not a benchmark curiosity. It's a structural signal about how reasoning capability scales — and it doesn't scale the way most people assume.

The Benchmark Results in Context

Published June 15, 2026 (arXiv:2606.16140), VibeThinker-3B's measured performance:

Benchmark	VibeThinker-3B	DeepSeek V3.2	Gemini 3 Pro	GLM-5
AIME 2026	94.3 (97.1 w/ TTS)	94.6	98.2	95.3
LiveCodeBench v6	80.2 Pass@1	—	—	—
LeetCode (recent, unseen)	96.1% acceptance	—	—	—
IFEval	93.4	—	—	—

The AIME 2026 score is the most striking data point because AIME (American Invitational Mathematics Examination) is a competition math benchmark that requires multi-step deductive reasoning, not just pattern matching. 94.3 on AIME 2026 is a frontier-tier score, regardless of what generates it.

The claim-level test-time scaling result — 97.1 — is also important. It shows that generating multiple reasoning paths and selecting the best can extract significantly more performance from the same 3B model without any additional training.

How They Did It: The Training Pipeline

VibeThinker-3B's result comes from three training stages applied to a compact base model:

1. Curriculum-Based Supervised Fine-Tuning

Standard SFT trains on a mix of examples with no structured progression. Curriculum-based SFT sequences the training data from simpler to harder problems — the model builds on verified capability before encountering more complex challenges.

For verifiable reasoning tasks (math problems, code with test cases), this is particularly effective because the reasoning patterns learned on simpler problems generalize to harder ones when the curriculum is designed carefully.

2. Multi-Domain Reinforcement Learning With Verifiable Rewards

After SFT, the model is trained using reinforcement learning on domains where correctness can be verified automatically — math (check the numerical answer), code (run the test suite), logic (evaluate the proof). This is the "verifiable" in "verifiable reasoning."

Verifiable RL is more stable than RLHF with human raters because the reward signal is clear and consistent. A math answer is right or wrong. A code submission passes tests or it doesn't. The model learns to actually solve the problems rather than to sound like it's solving them.

3. Offline Self-Distillation

After RL, the model is used to generate improved solutions to training problems, and those improved solutions are used to fine-tune the model again. This is essentially the model teaching itself: generate better answers, then learn from those better answers.

The combination — curriculum SFT → multi-domain RL → self-distillation — is their "Spectrum-to-Signal" paradigm. Each stage builds on the previous, and the result is a model that punches far above its parameter count on the task class it was optimized for.

The Parametric Compression-Coverage Hypothesis

The paper's most significant contribution is not the benchmark result — it is the theoretical framework introduced to explain it.

The Parametric Compression-Coverage Hypothesis:

Verifiable reasoning is compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

In plain terms: there are two fundamentally different types of AI capability:

Reasoning — the ability to manipulate symbols, follow logical chains, apply procedures correctly, verify intermediate steps. This, the hypothesis claims, can be compressed. A 3B model trained correctly can match a 671B model on tasks that are primarily about reasoning.

Knowledge — facts about the world, historical context, scientific details, cultural knowledge, long-tail edge cases. This requires broad parameter coverage. You can't compress "knowing that the Treaty of Westphalia was signed in 1648" or "knowing that the capital of Bhutan is Thimphu" — those facts need to live somewhere in the weights, and they need a lot of space.

This hypothesis, if it holds, has implications beyond VibeThinker-3B.

What This Means for How You Build AI Systems

The traditional intuition: bigger model = better model. Use the largest model you can afford for everything.

The revised intuition from VibeThinker-3B: route tasks by their capability requirement, not by a single model tier.

Task Type	Capability Required	Right Model Size
Math problem solving	Reasoning	Small specialist (3B) can match frontier
Code review with test-driven verification	Reasoning	Small specialist competitive
General world knowledge questions	Knowledge	Large model required
Long-tail factual retrieval	Knowledge	Large model required
Code generation (general)	Mixed	Test it — depends on specifics
Analysis of proprietary data	Mixed	Depends on context length + reasoning depth

This routing insight has a practical consequence: not every inference call needs a frontier model. A pipeline that identifies reasoning-heavy subtasks and routes them to a tuned 3B model — while sending knowledge-heavy tasks to a larger model — can achieve better overall quality at lower overall cost.

Why VibeThinker-3B Is Not the Only Signal

VibeThinker-3B is not an isolated result. It fits into a pattern:

QwQ-32B (Qwen reasoning model) matches 671B models on math
DeepSeek-R1-Zero showed that RL alone on a smaller model could produce frontier-level chain-of-thought reasoning
o1-mini vs o1-preview: OpenAI's smaller reasoning model competed with the larger one on structured tasks

The pattern: when training is specifically optimized for verifiable reasoning — not broad capability — smaller models consistently exceed what their parameter count would predict.

VibeThinker-3B extends this with 3B parameters (smaller than most of these examples) and with a more complete post-training pipeline (curriculum SFT + multi-domain RL + self-distillation rather than any single technique).

The Limits of the Hypothesis

The hypothesis is not a claim that small models can replace large models. It is a claim about decomposition.

Where small reasoning models fall short:

Tasks requiring retrieval of obscure facts (long-tail knowledge)
Tasks requiring synthesis of broad context (world events, cultural nuance)
Tasks that mix reasoning and knowledge in ways that can't be cleanly separated
Long-horizon planning over many domains simultaneously

The IFEval score (93.4) — which measures instruction-following across diverse domains — suggests VibeThinker-3B also generalizes to instruction-following. But this is different from claiming broad knowledge.

The hypothesis does not say "reasoning models have no ceiling." It says the ceiling for reasoning capability is much higher than expected for compact models, while knowledge capacity scales differently.

Test-Time Scaling Is Part of the Story

The AIME improvement from 94.3 to 97.1 via claim-level test-time scaling deserves attention.

This means: at inference time, rather than generating one response, the model generates multiple candidate reasoning chains, evaluates their internal consistency, and selects the most reliable one. No retraining, no new data — just more compute at inference.

This is increasingly important because it means a 3B model + inference-time compute budget can, in some cases, substitute for a larger model + standard inference budget. The trade-off changes from "larger model vs smaller model" to "larger model vs smaller model + more inference compute."

For cost optimization in agentic pipelines: if your pipeline runs 100 reasoning steps, running a small reasoning-tuned model with 5x inference budget per step may be cheaper and better than running a frontier model once per step.

Reading the Paper

arXiv:2606.16140, submitted June 15, 2026 by Sen Xu, Shixi Liu, Wei Wang, and colleagues. The paper details the full training pipeline, ablation studies across each training stage, and the benchmark comparison against frontier models.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

AI models directory — full directory of language models
AI skills registry — reusable AI workflows including reasoning tasks
Browse agents — autonomous systems that use reasoning-intensive pipelines

The most provocative result in the VibeThinker-3B paper is not the benchmark score. It is the implication.

VibeThinker-3B is matching frontier models on verifiable mathematical reasoning with less than 1% of their parameter count.

That's not a benchmark curiosity. It's a structural signal about how reasoning capability scales — and it doesn't scale the way most people assume.

The Benchmark Results in Context

Published June 15, 2026 (arXiv:2606.16140), VibeThinker-3B's measured performance:

Benchmark	VibeThinker-3B	DeepSeek V3.2	Gemini 3 Pro	GLM-5
AIME 2026	94.3 (97.1 w/ TTS)	94.6	98.2	95.3
LiveCodeBench v6	80.2 Pass@1	—	—	—
LeetCode (recent, unseen)	96.1% acceptance	—	—	—
IFEval	93.4	—	—	—

How They Did It: The Training Pipeline

VibeThinker-3B's result comes from three training stages applied to a compact base model:

1. Curriculum-Based Supervised Fine-Tuning

2. Multi-Domain Reinforcement Learning With Verifiable Rewards

3. Offline Self-Distillation

The Parametric Compression-Coverage Hypothesis

The paper's most significant contribution is not the benchmark result — it is the theoretical framework introduced to explain it.

The Parametric Compression-Coverage Hypothesis:

Verifiable reasoning is compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

In plain terms: there are two fundamentally different types of AI capability:

This hypothesis, if it holds, has implications beyond VibeThinker-3B.

What This Means for How You Build AI Systems

The traditional intuition: bigger model = better model. Use the largest model you can afford for everything.

The revised intuition from VibeThinker-3B: route tasks by their capability requirement, not by a single model tier.

Task Type	Capability Required	Right Model Size
Math problem solving	Reasoning	Small specialist (3B) can match frontier
Code review with test-driven verification	Reasoning	Small specialist competitive
General world knowledge questions	Knowledge	Large model required
Long-tail factual retrieval	Knowledge	Large model required
Code generation (general)	Mixed	Test it — depends on specifics
Analysis of proprietary data	Mixed	Depends on context length + reasoning depth

Why VibeThinker-3B Is Not the Only Signal

VibeThinker-3B is not an isolated result. It fits into a pattern:

QwQ-32B (Qwen reasoning model) matches 671B models on math
DeepSeek-R1-Zero showed that RL alone on a smaller model could produce frontier-level chain-of-thought reasoning
o1-mini vs o1-preview: OpenAI's smaller reasoning model competed with the larger one on structured tasks

The pattern: when training is specifically optimized for verifiable reasoning — not broad capability — smaller models consistently exceed what their parameter count would predict.

The Limits of the Hypothesis

The hypothesis is not a claim that small models can replace large models. It is a claim about decomposition.

Where small reasoning models fall short:

Tasks requiring retrieval of obscure facts (long-tail knowledge)
Tasks requiring synthesis of broad context (world events, cultural nuance)
Tasks that mix reasoning and knowledge in ways that can't be cleanly separated
Long-horizon planning over many domains simultaneously

Test-Time Scaling Is Part of the Story

The AIME improvement from 94.3 to 97.1 via claim-level test-time scaling deserves attention.

Reading the Paper

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

AI models directory — full directory of language models
AI skills registry — reusable AI workflows including reasoning tasks
Browse agents — autonomous systems that use reasoning-intensive pipelines

94.3 on AIME 2026: VibeThinker-3B and the Case for Small Models With Frontier Reasoning

The Benchmark Results in Context

How They Did It: The Training Pipeline

1. Curriculum-Based Supervised Fine-Tuning

2. Multi-Domain Reinforcement Learning With Verifiable Rewards

3. Offline Self-Distillation

The Parametric Compression-Coverage Hypothesis

What This Means for How You Build AI Systems

Why VibeThinker-3B Is Not the Only Signal

The Limits of the Hypothesis

Test-Time Scaling Is Part of the Story

Reading the Paper

Related

Related posts

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Top 10 Large Language Model (LLM) Directories & Hubs (2026)

StupidMeter: The Real-Time AI Model Benchmark Leaderboard [2026]

94.3 on AIME 2026: VibeThinker-3B and the Case for Small Models With Frontier Reasoning

The Benchmark Results in Context

How They Did It: The Training Pipeline

1. Curriculum-Based Supervised Fine-Tuning

2. Multi-Domain Reinforcement Learning With Verifiable Rewards

3. Offline Self-Distillation

The Parametric Compression-Coverage Hypothesis

What This Means for How You Build AI Systems

Why VibeThinker-3B Is Not the Only Signal

The Limits of the Hypothesis

Test-Time Scaling Is Part of the Story

Reading the Paper

Related

Related posts

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Top 10 Large Language Model (LLM) Directories & Hubs (2026)

StupidMeter: The Real-Time AI Model Benchmark Leaderboard [2026]