On June 10, 2026, Google DeepMind released DiffusionGemma—an experimental open-weights model that generates text with discrete diffusion instead of predicting one word at a time. @Google and @GoogleDeepMind pitched it as up to 4× faster on dedicated GPUs; CEO Sundar Pichai called it a "racehorse" for interactive apps.
TL;DR
| Spec | DiffusionGemma | Standard Gemma 4 (26B A4B) |
|---|---|---|
| Method | Parallel 256-token diffusion blocks | Autoregressive (token-by-token) |
| Total / active params | 26B MoE / 3.8B active | 25.2B / 3.8B active |
| Speed (H100) | 1000+ tok/s | Baseline |
| Speed (RTX 5090) | 700+ tok/s | Baseline |
| VRAM (quantized) | ~18 GB | Similar class |
| License | Apache 2.0 | Gemma terms |
| Quality | Lower on most benchmarks | Production recommended |
| Best for | Interactive local, low latency | Max quality, cloud QPS |
Why diffusion for text?
Nearly every LLM today is autoregressive: generate token t, then t+1, each depending on all prior tokens. Decode becomes memory-bandwidth-bound—the GPU waits on KV cache reads more than it computes.
DiffusionGemma inverts the decode bottleneck:
flowchart LR
A[Encoder prefills context] --> B[KV cache built]
B --> C[256-token canvas of masked tokens]
C --> D[Parallel denoising passes]
D --> E[Canvas finalized → append to cache]
E --> F[Next canvas…]
From the Hugging Face Gemma 4 launch post:
- An autoregressive encoder prefills prompts and builds the KV cache.
- A diffusion decoder applies bidirectional attention over a 256-token canvas.
- The model iteratively denoises the full canvas; finalized tokens append to cache; the next canvas begins.
Block-autoregressive across canvases, parallel within each canvas—roughly 15–20 tokens per forward pass versus one.
Speed numbers that matter
Google's published throughput (official blog):
| Hardware | Throughput |
|---|---|
| NVIDIA H100 (single GPU) | 1000+ tokens/sec |
| GeForce RTX 5090 | 700+ tokens/sec |
| vs autoregressive Gemma 4 | Up to ~4× faster |
DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It's a racehorse 🏇 achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token output!
Adaptive compute: Simpler prompts and structured tasks (code infilling, markdown) can use fewer denoising steps, so tokens-per-second scales with task complexity.
Architecture and footprint
DiffusionGemma shares Gemma 4's 26B A4B MoE foundation:
| Attribute | Value |
|---|---|
| Total parameters | ~26B (25.2B in HF spec) |
| Active per forward | 3.8B (8 of 128 experts + 1 shared) |
| Context | Up to 256K tokens |
| Modalities | Text + image in; text out |
| Languages | 140+ |
| Canvas size | 256 tokens per diffusion block |
Self-correction: Bidirectional denoising lets the model revise masked tokens mid-block—useful for markdown formatting and structured output. Autoregressive models commit each token permanently.
Local footprint: Quantized weights target ~18 GB VRAM—high-end consumer GPUs without datacenter hardware.
The quality trade-off
Google is explicit: use autoregressive Gemma 4 for production quality. DiffusionGemma is experimental—speed first.
Benchmark snapshot (Hugging Face Gemma 4 blog):
| Benchmark | DiffusionGemma | Gemma 4 26B A4B |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 | 69.1% | 88.3% |
| GPQA Diamond | 73.2% | 82.3% |
| HLE (no tools) | 11.0% | 8.7% |
DiffusionGemma wins a few benchmarks and trails on most—the expected Pareto frontier when trading accuracy for throughput.
When speed wins:
- Inline editing and code infilling
- Rapid iteration in IDE assistants
- Real-time markdown / structured formatting
- Interactive local apps where latency beats absolute benchmark scores
When quality wins:
- Long-form reasoning, agents, production RAG
- High-QPS cloud serving where Gemma 4 autoregressive stacks are tuned
Complete AI Builder Bootcamp
Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.
The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.
The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.
Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.
Multimodal toolkit inherited from Gemma 4
DiffusionGemma is not a text-only hack—it ships the broader Gemma 4 feature set:
- Thinking mode
- Function calling
- Native system prompts
- Image understanding — OCR, document parsing, object detection, pointing at variable aspect ratios
Released alongside the wider Gemma 4 family (E2B, E4B, multimodal on-device models), DiffusionGemma is the speed-specialized sibling under the same Apache 2.0 open license.
Ecosystem and how to run it
Download: Hugging Face model hub (Apache 2.0).
Serving and tuning:
| Stack | Support |
|---|---|
| Hugging Face Transformers | Yes |
| vLLM | Yes |
| NVIDIA NeMo / NIM | Yes |
| MLX (Apple Silicon) | Community ports |
| Unsloth | Fine-tuning |
| llama.cpp | Announced coming soon at launch |
| Google Cloud Model Garden | Yes |
NVIDIA optimization: NVFP4 4-bit kernels on Hopper/Blackwell; consumer RTX 5090/4090; DGX Spark and DGX Station for deskside local AI.
DiffusionGemma vs autoregressive: decision table
| Question | Choose DiffusionGemma | Choose Gemma 4 AR |
|---|---|---|
| Need lowest latency locally? | ✅ | |
| Need best benchmark scores? | ✅ | |
| Interactive editing / infilling? | ✅ | |
| Agent loops with tool use? | Caution—verify quality | ✅ |
| Apache 2.0 open weights? | ✅ | Gemma license |
| 18 GB consumer GPU? | ✅ (quantized) | ✅ (smaller variants too) |
For agentic coding stacks, DiffusionGemma is interesting as a local copilot engine—pair speed with verification loops (loop engineering) rather than trusting first-pass quality.
Industry context (June 2026)
DiffusionGemma landed the same week as Claude Fable 5, Code with Claude Tokyo agent scheduling, and Thariq's agent-edited launch video—a dense news cycle where speed (DiffusionGemma), autonomy (Fable, managed agents), and orchestration (workflows) all advanced in parallel.
Google's bet: decode parallelism matters for the next wave of on-device and IDE-embedded models, even if autoregression keeps the quality crown for now.
Related ExplainX guides
- Gemma chat offline on Apple Silicon — local Gemma workflows
- Loop engineering — verify fast first drafts
- Code with Claude Tokyo: scheduling and vaults
- What are LLM tokens? — why token throughput matters
- Top 10 LLM directories — where open models are indexed
Primary sources: Google DiffusionGemma blog · Hugging Face Gemma 4 launch · @Google · @sundarpichai
Summary
DiffusionGemma is Google's open speed experiment: 26B MoE, 256-token parallel diffusion blocks, 4× faster on H100/5090, 18 GB quantized local runs, Apache 2.0. Sundar Pichai's racehorse framing is apt—it wins races where latency dominates, not where MMLU Pro does.
For production text quality, Google still points to autoregressive Gemma 4. For interactive local generation, inline edits, and researcher exploration of text diffusion, DiffusionGemma is the model to benchmark this week.
Specs, benchmarks, and serving support reflect Google's June 10, 2026 release. Re-check Hugging Face and the Google developers blog before production deployment.