What is DiffusionGemma?

DiffusionGemma is an experimental open-weights model from Google DeepMind (released June 10, 2026) that generates text using discrete diffusion instead of autoregressive token-by-token decoding. Built on Gemma 4's 26B MoE backbone (3.8B active parameters), it denoises 256-token blocks in parallel for up to 4× faster inference on dedicated GPUs. Licensed under Apache 2.0.

How is DiffusionGemma faster than normal language models?

Autoregressive models generate one token at a time with causal attention—decode is memory-bandwidth-bound. DiffusionGemma uses an encoder-decoder architecture: an autoregressive encoder prefills context, then a diffusion decoder bidirectionally denoises 256-token canvases in parallel (~15–20 tokens per forward pass). Google reports 1000+ tokens/sec on H100 and 700+ on RTX 5090 versus baseline Gemma 4.

Can DiffusionGemma run on consumer GPUs?

Yes. As a 26B MoE with 3.8B active parameters, DiffusionGemma fits in roughly 18GB VRAM when quantized—within reach of high-end consumer cards like RTX 5090 and 4090. Google optimized with NVIDIA for NVFP4 kernels, MLX on Apple Silicon, vLLM, Hugging Face Transformers, and NeMo fine-tuning.

Is DiffusionGemma better than Gemma 4 on benchmarks?

Generally no on quality benchmarks. DiffusionGemma trails autoregressive Gemma 4 26B on most tasks—for example MMLU Pro 77.6% vs 82.6% and AIME 2026 69.1% vs 88.3% per Hugging Face's Gemma 4 launch post. Google recommends standard Gemma 4 for maximum output quality; DiffusionGemma targets speed-critical interactive workflows like inline editing and code infilling.

What inputs and outputs does DiffusionGemma support?

DiffusionGemma is multimodal: text and image inputs, text output, up to 256K context, 140+ languages, thinking mode, function calling, and image understanding (OCR, document parsing) inherited from the Gemma 4 toolkit. It shares the Gemma 4 26B A4B MoE architecture (128 experts, 8 active plus 1 shared).

Where can I download DiffusionGemma?

Weights are on Hugging Face under Apache 2.0. Serving options include vLLM, Hugging Face Transformers, Google Cloud Model Garden, NVIDIA NIM, MLX, and fine-tuning via Unsloth and NeMo. llama.cpp compatibility was announced as coming soon at launch.

DiffusionGemma: Google 4× Faster Open Text Diffusion Model | explainx.ai Blog

On June 10, 2026, Google DeepMind released DiffusionGemma—an experimental open-weights model that generates text with discrete diffusion instead of predicting one word at a time. @Google and @GoogleDeepMind pitched it as up to 4× faster on dedicated GPUs; CEO Sundar Pichai called it a "racehorse" for interactive apps.

TL;DR

Spec	DiffusionGemma	Standard Gemma 4 (26B A4B)
Method	Parallel 256-token diffusion blocks	Autoregressive (token-by-token)
Total / active params	26B MoE / 3.8B active	25.2B / 3.8B active
Speed (H100)	1000+ tok/s	Baseline
Speed (RTX 5090)	700+ tok/s	Baseline
VRAM (quantized)	~18 GB	Similar class
License	Apache 2.0	Gemma terms
Quality	Lower on most benchmarks	Production recommended
Best for	Interactive local, low latency	Max quality, cloud QPS

Why diffusion for text?

Nearly every LLM today is autoregressive: generate token t, then t+1, each depending on all prior tokens. Decode becomes memory-bandwidth-bound—the GPU waits on KV cache reads more than it computes.

DiffusionGemma inverts the decode bottleneck:

mermaid

flowchart LR
  A[Encoder prefills context] --> B[KV cache built]
  B --> C[256-token canvas of masked tokens]
  C --> D[Parallel denoising passes]
  D --> E[Canvas finalized → append to cache]
  E --> F[Next canvas…]

From the Hugging Face Gemma 4 launch post:

An autoregressive encoder prefills prompts and builds the KV cache.
A diffusion decoder applies bidirectional attention over a 256-token canvas.
The model iteratively denoises the full canvas; finalized tokens append to cache; the next canvas begins.

Block-autoregressive across canvases, parallel within each canvas—roughly 15–20 tokens per forward pass versus one.

Speed numbers that matter

Google's published throughput (official blog):

Hardware	Throughput
NVIDIA H100 (single GPU)	1000+ tokens/sec
GeForce RTX 5090	700+ tokens/sec
vs autoregressive Gemma 4	Up to ~4× faster

@sundarpichai:

DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It's a racehorse 🏇 achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token output!

Adaptive compute: Simpler prompts and structured tasks (code infilling, markdown) can use fewer denoising steps, so tokens-per-second scales with task complexity.

Architecture and footprint

DiffusionGemma shares Gemma 4's 26B A4B MoE foundation:

Attribute	Value
Total parameters	~26B (25.2B in HF spec)
Active per forward	3.8B (8 of 128 experts + 1 shared)
Context	Up to 256K tokens
Modalities	Text + image in; text out
Languages	140+
Canvas size	256 tokens per diffusion block

Self-correction: Bidirectional denoising lets the model revise masked tokens mid-block—useful for markdown formatting and structured output. Autoregressive models commit each token permanently.

Local footprint: Quantized weights target ~18 GB VRAM—high-end consumer GPUs without datacenter hardware.

The quality trade-off

Google is explicit: use autoregressive Gemma 4 for production quality. DiffusionGemma is experimental—speed first.

Benchmark snapshot (Hugging Face Gemma 4 blog):

Benchmark	DiffusionGemma	Gemma 4 26B A4B
MMLU Pro	77.6%	82.6%
AIME 2026	69.1%	88.3%
GPQA Diamond	73.2%	82.3%
HLE (no tools)	11.0%	8.7%

DiffusionGemma wins a few benchmarks and trails on most—the expected Pareto frontier when trading accuracy for throughput.

When speed wins:

Inline editing and code infilling
Rapid iteration in IDE assistants
Real-time markdown / structured formatting
Interactive local apps where latency beats absolute benchmark scores

When quality wins:

Long-form reasoning, agents, production RAG
High-QPS cloud serving where Gemma 4 autoregressive stacks are tuned

Multimodal toolkit inherited from Gemma 4

DiffusionGemma is not a text-only hack—it ships the broader Gemma 4 feature set:

Thinking mode
Function calling
Native system prompts
Image understanding — OCR, document parsing, object detection, pointing at variable aspect ratios

Released alongside the wider Gemma 4 family (E2B, E4B, multimodal on-device models), DiffusionGemma is the speed-specialized sibling under the same Apache 2.0 open license.

Ecosystem and how to run it

Download: Hugging Face model hub (Apache 2.0).

Serving and tuning:

Stack	Support
Hugging Face Transformers	Yes
vLLM	Yes
NVIDIA NeMo / NIM	Yes
MLX (Apple Silicon)	Community ports
Unsloth	Fine-tuning
llama.cpp	Announced coming soon at launch
Google Cloud Model Garden	Yes

NVIDIA optimization: NVFP4 4-bit kernels on Hopper/Blackwell; consumer RTX 5090/4090; DGX Spark and DGX Station for deskside local AI.

DiffusionGemma vs autoregressive: decision table

Question	Choose DiffusionGemma	Choose Gemma 4 AR
Need lowest latency locally?	✅
Need best benchmark scores?		✅
Interactive editing / infilling?	✅
Agent loops with tool use?	Caution—verify quality	✅
Apache 2.0 open weights?	✅	Gemma license
18 GB consumer GPU?	✅ (quantized)	✅ (smaller variants too)

For agentic coding stacks, DiffusionGemma is interesting as a local copilot engine—pair speed with verification loops (loop engineering) rather than trusting first-pass quality.

Industry context (June 2026)

DiffusionGemma landed the same week as Claude Fable 5, Code with Claude Tokyo agent scheduling, and Thariq's agent-edited launch video—a dense news cycle where speed (DiffusionGemma), autonomy (Fable, managed agents), and orchestration (workflows) all advanced in parallel.

Google's bet: decode parallelism matters for the next wave of on-device and IDE-embedded models, even if autoregression keeps the quality crown for now.

Gemma chat offline on Apple Silicon — local Gemma workflows
Loop engineering — verify fast first drafts
Code with Claude Tokyo: scheduling and vaults
What are LLM tokens? — why token throughput matters
Top 10 LLM directories — where open models are indexed

Primary sources: Google DiffusionGemma blog · Hugging Face Gemma 4 launch · @Google · @sundarpichai

Summary

DiffusionGemma is Google's open speed experiment: 26B MoE, 256-token parallel diffusion blocks, 4× faster on H100/5090, 18 GB quantized local runs, Apache 2.0. Sundar Pichai's racehorse framing is apt—it wins races where latency dominates, not where MMLU Pro does.

For production text quality, Google still points to autoregressive Gemma 4. For interactive local generation, inline edits, and researcher exploration of text diffusion, DiffusionGemma is the model to benchmark this week.

Specs, benchmarks, and serving support reflect Google's June 10, 2026 release. Re-check Hugging Face and the Google developers blog before production deployment.

TL;DR

Spec	DiffusionGemma	Standard Gemma 4 (26B A4B)
Method	Parallel 256-token diffusion blocks	Autoregressive (token-by-token)
Total / active params	26B MoE / 3.8B active	25.2B / 3.8B active
Speed (H100)	1000+ tok/s	Baseline
Speed (RTX 5090)	700+ tok/s	Baseline
VRAM (quantized)	~18 GB	Similar class
License	Apache 2.0	Gemma terms
Quality	Lower on most benchmarks	Production recommended
Best for	Interactive local, low latency	Max quality, cloud QPS

Why diffusion for text?

DiffusionGemma inverts the decode bottleneck:

mermaid

flowchart LR
  A[Encoder prefills context] --> B[KV cache built]
  B --> C[256-token canvas of masked tokens]
  C --> D[Parallel denoising passes]
  D --> E[Canvas finalized → append to cache]
  E --> F[Next canvas…]

From the Hugging Face Gemma 4 launch post:

An autoregressive encoder prefills prompts and builds the KV cache.
A diffusion decoder applies bidirectional attention over a 256-token canvas.
The model iteratively denoises the full canvas; finalized tokens append to cache; the next canvas begins.

Block-autoregressive across canvases, parallel within each canvas—roughly 15–20 tokens per forward pass versus one.

Speed numbers that matter

Google's published throughput (official blog):

Hardware	Throughput
NVIDIA H100 (single GPU)	1000+ tokens/sec
GeForce RTX 5090	700+ tokens/sec
vs autoregressive Gemma 4	Up to ~4× faster

@sundarpichai:

DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It's a racehorse 🏇 achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token output!

Adaptive compute: Simpler prompts and structured tasks (code infilling, markdown) can use fewer denoising steps, so tokens-per-second scales with task complexity.

Architecture and footprint

DiffusionGemma shares Gemma 4's 26B A4B MoE foundation:

Attribute	Value
Total parameters	~26B (25.2B in HF spec)
Active per forward	3.8B (8 of 128 experts + 1 shared)
Context	Up to 256K tokens
Modalities	Text + image in; text out
Languages	140+
Canvas size	256 tokens per diffusion block

Local footprint: Quantized weights target ~18 GB VRAM—high-end consumer GPUs without datacenter hardware.

The quality trade-off

Google is explicit: use autoregressive Gemma 4 for production quality. DiffusionGemma is experimental—speed first.

Benchmark snapshot (Hugging Face Gemma 4 blog):

Benchmark	DiffusionGemma	Gemma 4 26B A4B
MMLU Pro	77.6%	82.6%
AIME 2026	69.1%	88.3%
GPQA Diamond	73.2%	82.3%
HLE (no tools)	11.0%	8.7%

DiffusionGemma wins a few benchmarks and trails on most—the expected Pareto frontier when trading accuracy for throughput.

When speed wins:

Inline editing and code infilling
Rapid iteration in IDE assistants
Real-time markdown / structured formatting
Interactive local apps where latency beats absolute benchmark scores

When quality wins:

Long-form reasoning, agents, production RAG
High-QPS cloud serving where Gemma 4 autoregressive stacks are tuned

Multimodal toolkit inherited from Gemma 4

DiffusionGemma is not a text-only hack—it ships the broader Gemma 4 feature set:

Thinking mode
Function calling
Native system prompts
Image understanding — OCR, document parsing, object detection, pointing at variable aspect ratios

Released alongside the wider Gemma 4 family (E2B, E4B, multimodal on-device models), DiffusionGemma is the speed-specialized sibling under the same Apache 2.0 open license.

Ecosystem and how to run it

Download: Hugging Face model hub (Apache 2.0).

Serving and tuning:

Stack	Support
Hugging Face Transformers	Yes
vLLM	Yes
NVIDIA NeMo / NIM	Yes
MLX (Apple Silicon)	Community ports
Unsloth	Fine-tuning
llama.cpp	Announced coming soon at launch
Google Cloud Model Garden	Yes

NVIDIA optimization: NVFP4 4-bit kernels on Hopper/Blackwell; consumer RTX 5090/4090; DGX Spark and DGX Station for deskside local AI.

DiffusionGemma vs autoregressive: decision table

Question	Choose DiffusionGemma	Choose Gemma 4 AR
Need lowest latency locally?	✅
Need best benchmark scores?		✅
Interactive editing / infilling?	✅
Agent loops with tool use?	Caution—verify quality	✅
Apache 2.0 open weights?	✅	Gemma license
18 GB consumer GPU?	✅ (quantized)	✅ (smaller variants too)

For agentic coding stacks, DiffusionGemma is interesting as a local copilot engine—pair speed with verification loops (loop engineering) rather than trusting first-pass quality.

Industry context (June 2026)

Google's bet: decode parallelism matters for the next wave of on-device and IDE-embedded models, even if autoregression keeps the quality crown for now.

Gemma chat offline on Apple Silicon — local Gemma workflows
Loop engineering — verify fast first drafts
Code with Claude Tokyo: scheduling and vaults
What are LLM tokens? — why token throughput matters
Top 10 LLM directories — where open models are indexed

Primary sources: Google DiffusionGemma blog · Hugging Face Gemma 4 launch · @Google · @sundarpichai

Summary

Specs, benchmarks, and serving support reflect Google's June 10, 2026 release. Re-check Hugging Face and the Google developers blog before production deployment.

DiffusionGemma: Google’s 4× Faster Open Model Uses Text Diffusion

Why diffusion for text?

Speed numbers that matter

Architecture and footprint

The quality trade-off

Multimodal toolkit inherited from Gemma 4

Ecosystem and how to run it

DiffusionGemma vs autoregressive: decision table

Industry context (June 2026)

Summary

DiffusionGemma: Google’s 4× Faster Open Model Uses Text Diffusion

Why diffusion for text?

Speed numbers that matter

Architecture and footprint

The quality trade-off

Multimodal toolkit inherited from Gemma 4

Ecosystem and how to run it

DiffusionGemma vs autoregressive: decision table

Industry context (June 2026)

Summary

Related posts

Can Claude or LLMs Watch a Video? Here's How to Make It Work

AirLLM: Run 70B Language Models on a 4GB GPU — No Quantization, No $10K Hardware

Run GLM-5.2 Locally: 744B Parameters, 40B Active, on a 256GB Mac or 245GB RAM PC

Related posts

Can Claude or LLMs Watch a Video? Here's How to Make It Work

AirLLM: Run 70B Language Models on a 4GB GPU — No Quantization, No $10K Hardware

Run GLM-5.2 Locally: 744B Parameters, 40B Active, on a 256GB Mac or 245GB RAM PC

Why diffusion for text?

Speed numbers that matter

Architecture and footprint

The quality trade-off

Multimodal toolkit inherited from Gemma 4

Ecosystem and how to run it

DiffusionGemma vs autoregressive: decision table

Industry context (June 2026)

Related explainx.ai guides

Summary

Why diffusion for text?

Speed numbers that matter

Architecture and footprint

The quality trade-off

Multimodal toolkit inherited from Gemma 4

Ecosystem and how to run it

DiffusionGemma vs autoregressive: decision table

Industry context (June 2026)

Related explainx.ai guides

Summary

Related posts

Can Claude or LLMs Watch a Video? Here's How to Make It Work

AirLLM: Run 70B Language Models on a 4GB GPU — No Quantization, No $10K Hardware

Run GLM-5.2 Locally: 744B Parameters, 40B Active, on a 256GB Mac or 245GB RAM PC

Related posts

Can Claude or LLMs Watch a Video? Here's How to Make It Work

AirLLM: Run 70B Language Models on a 4GB GPU — No Quantization, No $10K Hardware

Run GLM-5.2 Locally: 744B Parameters, 40B Active, on a 256GB Mac or 245GB RAM PC