Ollama 0.31: Gemma 4 Is ~90% Faster on Apple Silicon With Multi-Token Prediction (No Output Change)
Ollama 0.31 ships auto-tuned multi-token prediction for Gemma 4 on MLX β 50.2 to 95.0 tok/s on Aider polyglot (M5 Max). Built-in draft model, GPU rollback, and a new MLX batch kernel. On by default for coding agents.
Local coding agents live or die on tokens per second. Every file read, tool call, and retry is another decode pass β and on Apple Silicon, 50 tok/s feels like waiting; 95 tok/s feels like a product.
On June 29, 2026, Ollama shipped 0.31 with a headline change: Gemma 4 on MLX is nearly 90% faster on a real coding-agent benchmark β on by default, with identical model output.
TL;DR
Release
Ollama 0.31+ (macOS)
Model
Gemma 4 β first model with integrated MTP (gemma4:12b-mlx)
Technique
Multi-token prediction (MTP) β built-in draft model + main verify
Benchmark
Aider polyglot (real coding agent tasks)
Hardware
M5 Max, Gemma 4 12B nvfp4
Without MTP
50.2 tok/s
With MTP
95.0 tok/s (~+89%)
Config
None β auto-tuned draft length at runtime
Output
Unchanged vs non-MTP decode
Agent launch
ollama launch claude --model gemma4:12b-mlx
What Ollama Actually Shipped
Autoregressive LLMs generate one token at a time. Multi-token prediction (MTP) adds a small, fast draft model (shipped inside Gemma 4) that proposes the next several tokens. The main model verifies the whole proposal in one pass and keeps tokens it agrees with.
Because the draft model is a small fraction of the main stack, proposals are cheap. When they are correct, you commit multiple tokens for roughly the cost of one forward pass.
Ollama's claim for 0.31:
Speedup is on by default for Gemma 4 on MLX
It does not change the model's output β same decoding semantics, faster wall clock
Gemma 4 is the first model with this integration; more models planned
That last point matters for local model pickers: if you chose Qwen 3.6 + llama.cpp MTP for speed on Mac, Gemma 4 + Ollama 0.31 is now a credible MLX-native alternative β especially if you want multimodal Gemma 4 later.
Why Coding Agents Benefit Most
Ollama calls out code explicitly: closing brackets, repeated identifiers, boilerplate β patterns a draft model guesses well.
Agent loop: read file β model β tool β model β patch β model β test β model β¦
Each arrow is decode time. ~90% faster generation does not make the model smarter β it makes Claude Code, Codex, OpenCode, and Droid (via ollama launch) more responsive when pointed at local Gemma 4.
Ollama measured on the Aider polyglot benchmark β a real coding agent running real programming tasks, not a synthetic tok/s microbench. They note explicitly that synthetic benchmarks can be tuned to show almost anything; Aider is their practical reference.
The Numbers
Setting
Generation speed
Gemma 4 12B (nvfp4), M5 Max β without MTP
50.2 tok/s
Gemma 4 12B (nvfp4), M5 Max β with MTP
95.0 tok/s
~89% higher throughput on this setup. Your machine, quant, and task mix will differ β treat this as Apple Silicon + Gemma 4 12B + agentic code, not a universal law.
Gemma 4 12B guide cited ~21 tok/s on RTX 4060 before this Ollama release
Ollama 0.31 is specifically an MLX + Gemma 4 + MTP story on Mac.
How It Works (Three Layers)
1. Auto-tuning draft length
There is no universal best βdraft N tokens then verify.β It depends on model, quant, hardware, and how predictable the text is right now.
Draft too few β leave speed on the table
Draft too many β verification costs exceed savings β MTP slower than plain AR
Ollama 0.31 tracks acceptance rate and verification time at runtime, picks the draft length that maximizes tokens per second, and reverts to one-token decoding when proposals stop getting accepted. No user-facing knob.
This avoids the classic speculative-decoding foot-gun: speculation enabled but net slower.
2. Speculative decoding in the engine
Each round:
Draft model autoregresses a short run of candidate tokens
Main model verifies the batch in one GPU pass (sample at each position)
Accepted tokens commit; rejected positions roll back
Rollback is cheap: the engine saves a checkpoint before each proposal and rewinds KV/state to the last accepted token β no full recompute of earlier context.
Drafting, sampling, verification, and post-accept sampling stay on GPU β no CPU ping-pong per token.
3. MLX kernel for small-batch verification
Most cost is verification, not drafting. Batches are awkwardly small (2β8 tokens) β between single-token decode kernels and large prefill kernels.
Ollama contributed a kernel to MLX that:
Reads/unpacks each weight block once per batch
Reuses it across all tokens in the verification batch
On M5 Max + nvfp4, they report 2Γβ2.5Γ speedup on Gemma 4's largest matmuls for that batch shape. Same math, less redundant memory traffic β benefits other MLX models too, not only Gemma 4 in Ollama.
Also supported: Codex, Droid, OpenCode, Copilot, and others via ollama launch.
For OpenCode / custom configs, point your provider at Ollama's local API β see run open models in OpenCode.
Ollama MTP vs llama.cpp MTP
Ollama 0.31 (Gemma 4 MLX)
llama.cpp (e.g. Qwen MTP GGUF)
Draft source
Built into Gemma 4 weights
--spec-type draft-mtp quants
Runtime
MLX via Ollama
llama-server direct
Tuning
Auto draft length
Manual flags / model choice
Best for
Mac users who want one binary + agents
Max control, router, embeddings
Multimodal
Gemma 4 vision/audio path
Text-first today
Many power users still run llama.cpp directly for -ngl, MTP on Qwen, and OpenCode at localhost:8080. Ollama 0.31 is the low-friction path when you want ollama launch claude and Gemma 4 speed without tuning.
Limitations
macOS / MLX first β this release targets Apple Silicon; Linux/CUDA paths may differ
Gemma 4 only (for now) β other models not yet on this MTP stack
VRAM / unified memory β 12B nvfp4 still needs a capable Mac; not a phone model
Agent quality β tok/s β faster decode does not fix weak reasoning; benchmark on your repo
Re-pull required β old gemma4:12b-mlx pulls may lack MTP-enabled weights
Not cloud parity β 95 tok/s local is excellent for Mac; still not Fable-class frontier for hard agent work