What is Ollama 0.31 multi-token prediction for Gemma 4?

Ollama 0.31 enables multi-token prediction (MTP) for Gemma 4 on MLX by default. A small built-in draft model proposes several next tokens; the main Gemma 4 model verifies them in one GPU pass and keeps accepted tokens. Output is identical to non-MTP decoding — only generation speed changes.

How much faster is Gemma 4 with MTP in Ollama?

On Ollama's Aider polyglot coding-agent benchmark, Gemma 4 12B (nvfp4) on an M5 Max went from 50.2 tok/s without MTP to 95.0 tok/s with MTP — nearly 90% faster on average. Workload-dependent; synthetic benchmarks can vary.

Do I need to configure MTP in Ollama?

No. MTP is on by default in Ollama 0.31+ for Gemma 4 on macOS. Ollama auto-tunes how many tokens to draft based on acceptance rate and verification cost, and falls back to plain one-token decoding when speculation stops helping.

How do I run Gemma 4 with a coding agent in Ollama?

Install Ollama 0.31 or later on macOS, re-pull the model (ollama pull gemma4:12b-mlx), then run ollama launch claude --model gemma4:12b-mlx. ollama launch also supports Codex, Droid, OpenCode, Copilot, and other agents.

How is Ollama MTP different from llama.cpp draft-MTP?

llama.cpp uses --spec-type draft-mtp with separate GGUF builds (e.g. Qwen 3.6 MTP quants). Ollama 0.31 integrates Gemma 4's native draft model into the MLX engine with runtime auto-tuning and a contributed MLX kernel for small-batch verification. Both are speculative decoding; stack and config differ.

Does MTP change model quality or outputs?

Ollama states MTP does not change the model's output — same tokens as autoregressive decoding when configured correctly. Speed comes from verifying multiple proposals per forward pass, not from approximating logits.

Ollama 0.31: Gemma 4 MTP ~90% Faster on Mac (2026 Guide) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Ollama 0.31: Gemma 4 MTP ~90% Faster on Mac (2026 Guide) | explainx.ai Blog | explainx.ai

Local coding agents live or die on tokens per second. Every file read, tool call, and retry is another decode pass — and on Apple Silicon, 50 tok/s feels like waiting; 95 tok/s feels like a product.

On June 29, 2026, Ollama shipped 0.31 with a headline change: Gemma 4 on MLX is nearly 90% faster on a real coding-agent benchmark — on by default, with identical model output.

TL;DR


Release	Ollama 0.31+ (macOS)
Model	Gemma 4 — first model with integrated MTP (`gemma4:12b-mlx`)
Technique	Multi-token prediction (MTP) — built-in draft model + main verify
Benchmark	Aider polyglot (real coding agent tasks)
Hardware	M5 Max, Gemma 4 12B nvfp4
Without MTP	50.2 tok/s
With MTP	95.0 tok/s (~+89%)
Config	None — auto-tuned draft length at runtime
Output	Unchanged vs non-MTP decode
Agent launch	`ollama launch claude --model gemma4:12b-mlx`

What Ollama Actually Shipped

Autoregressive LLMs generate one token at a time. Multi-token prediction (MTP) adds a small, fast draft model (shipped inside Gemma 4) that proposes the next several tokens. The main model verifies the whole proposal in one pass and keeps tokens it agrees with.

Because the draft model is a small fraction of the main stack, proposals are cheap. When they are correct, you commit multiple tokens for roughly the cost of one forward pass.

Ollama's claim for 0.31:

Speedup is on by default for Gemma 4 on MLX
It does not change the model's output — same decoding semantics, faster wall clock
Gemma 4 is the first model with this integration; more models planned

That last point matters for local model pickers: if you chose Qwen 3.6 + llama.cpp MTP for speed on Mac, Gemma 4 + Ollama 0.31 is now a credible MLX-native alternative — especially if you want multimodal Gemma 4 later.

Why Coding Agents Benefit Most

Ollama calls out code explicitly: closing brackets, repeated identifiers, boilerplate — patterns a draft model guesses well.

That maps directly to loop engineering workloads:

Agent loop: read file → model → tool → model → patch → model → test → model …

Each arrow is decode time. ~90% faster generation does not make the model smarter — it makes Claude Code, Codex, OpenCode, and Droid (via ollama launch) more responsive when pointed at local Gemma 4.

Ollama measured on the Aider polyglot benchmark — a real coding agent running real programming tasks, not a synthetic tok/s microbench. They note explicitly that synthetic benchmarks can be tuned to show almost anything; Aider is their practical reference.

The Numbers

Setting	Generation speed
Gemma 4 12B (nvfp4), M5 Max — without MTP	50.2 tok/s
Gemma 4 12B (nvfp4), M5 Max — with MTP	95.0 tok/s

~89% higher throughput on this setup. Your machine, quant, and task mix will differ — treat this as Apple Silicon + Gemma 4 12B + agentic code, not a universal law.

For context:

Qwen 3.6 27B + llama.cpp MTP reports ~32 tok/s on M5 Max (different model, stack, and quant)
Gemma 4 12B guide cited ~21 tok/s on RTX 4060 before this Ollama release

Ollama 0.31 is specifically an MLX + Gemma 4 + MTP story on Mac.

How It Works (Three Layers)

1. Auto-tuning draft length

There is no universal best “draft N tokens then verify.” It depends on model, quant, hardware, and how predictable the text is right now.

Draft too few → leave speed on the table
Draft too many → verification costs exceed savings → MTP slower than plain AR

Ollama 0.31 tracks acceptance rate and verification time at runtime, picks the draft length that maximizes tokens per second, and reverts to one-token decoding when proposals stop getting accepted. No user-facing knob.

This avoids the classic speculative-decoding foot-gun: speculation enabled but net slower.

2. Speculative decoding in the engine

Each round:

Draft model autoregresses a short run of candidate tokens
Main model verifies the batch in one GPU pass (sample at each position)
Accepted tokens commit; rejected positions roll back

Rollback is cheap: the engine saves a checkpoint before each proposal and rewinds KV/state to the last accepted token — no full recompute of earlier context.

Drafting, sampling, verification, and post-accept sampling stay on GPU — no CPU ping-pong per token.

3. MLX kernel for small-batch verification

Most cost is verification, not drafting. Batches are awkwardly small (2–8 tokens) — between single-token decode kernels and large prefill kernels.

Ollama contributed a kernel to MLX that:

Reads/unpacks each weight block once per batch
Reuses it across all tokens in the verification batch

On M5 Max + nvfp4, they report 2×–2.5× speedup on Gemma 4's largest matmuls for that batch shape. Same math, less redundant memory traffic — benefits other MLX models too, not only Gemma 4 in Ollama.

Get Started

1. Install Ollama 0.31+

Download from ollama.com (macOS).

2. Re-pull Gemma 4 with MTP weights

If you pulled Gemma 4 before 0.31:

ollama pull gemma4:12b-mlx

3. Launch a coding agent

ollama launch claude --model gemma4:12b-mlx

Also supported: Codex, Droid, OpenCode, Copilot, and others via ollama launch.

For OpenCode / custom configs, point your provider at Ollama's local API — see run open models in OpenCode.

Ollama MTP vs llama.cpp MTP

	Ollama 0.31 (Gemma 4 MLX)	llama.cpp (e.g. Qwen MTP GGUF)
Draft source	Built into Gemma 4 weights	`--spec-type draft-mtp` quants
Runtime	MLX via Ollama	`llama-server` direct
Tuning	Auto draft length	Manual flags / model choice
Best for	Mac users who want one binary + agents	Max control, router, embeddings
Multimodal	Gemma 4 vision/audio path	Text-first today

Many power users still run llama.cpp directly for -ngl, MTP on Qwen, and OpenCode at localhost:8080. Ollama 0.31 is the low-friction path when you want ollama launch claude and Gemma 4 speed without tuning.

Limitations

macOS / MLX first — this release targets Apple Silicon; Linux/CUDA paths may differ
Gemma 4 only (for now) — other models not yet on this MTP stack
VRAM / unified memory — 12B nvfp4 still needs a capable Mac; not a phone model
Agent quality ≠ tok/s — faster decode does not fix weak reasoning; benchmark on your repo
Re-pull required — old gemma4:12b-mlx pulls may lack MTP-enabled weights
Not cloud parity — 95 tok/s local is excellent for Mac; still not Fable-class frontier for hard agent work

Ollama 0.31: Gemma 4 Is ~90% Faster on Apple Silicon With Multi-Token Prediction (No Output Change)

Related posts

MacBook vs dedicated GPU for local LLMs: how much RAM you really get, and when each wins in 2026

Gemma Chat: offline vibe coding with Gemma 4 and MLX on Mac

How to Run Open Source Models Locally and Wire Them Into OpenCode (2026)

TL;DR

What Ollama Actually Shipped

Why Coding Agents Benefit Most

The Numbers

How It Works (Three Layers)

1. Auto-tuning draft length

2. Speculative decoding in the engine

3. MLX kernel for small-batch verification

Get Started

1. Install Ollama 0.31+

2. Re-pull Gemma 4 with MTP weights

3. Launch a coding agent

Ollama MTP vs llama.cpp MTP

Limitations

Related Reading