AI2026-06-1110 min read

DiffusionGemma Deep Dive

Google DeepMind's June 10, 2026 Open-Weight Text-Diffusion LLM, Same Backbone as Gemma 4 26B (A4B MoE), Up to 4× Faster Than AR Counterparts, Apache 2.0, With an Honest "Quality Trails AR" Disclosure

A primary-source deep dive on DiffusionGemma (google/diffusiongemma-26B-A4B-it, 25.2B total / 3.8B active MoE), released June 10, 2026 by Google DeepMind in coordination with NVIDIA. Grounded in the official Google blog, ai.google.dev model card, Hugging Face card, and NVIDIA's blog. Where autoregressive (AR) models generate one token at a time left-to-right, diffusion language models (DLMs) denoise a 256-token canvas in parallel into final text. 15-20 tokens commit per forward pass, up to 48 denoising steps, 1,000+ tok/sec on H100, 700+ on RTX 5090, ~3.5–4× the throughput of the AR Gemma 4 counterpart. Crucially, Google openly states that quality lags AR: MMLU Pro 77.6 vs 82.6, GPQA 73.2 vs 82.3, MMMU Pro 54.3 vs 73.8. Apache 2.0, distributed via Hugging Face / Vertex AI / NVIDIA NIM — the first large-scale open-weight diffusion LLM in the industry. The column covers practical implications for Japanese enterprises (on-prem internal agents, code editing, low-latency workflows) and positioning against Mercury (Inception Labs), LLaDA, and Gemini Diffusion.

Google DeepMind Gemma 4 DiffusionGemma Text Diffusion Discrete Diffusion Local LLM NVIDIA Apache 2.0

TL;DR — What Happened

On June 10, 2026, Google DeepMind and NVIDIA jointly announced DiffusionGemma (google/diffusiongemma-26B-A4B-it). It is the first open-weight, large-scale, text-diffusion LLM in the industry — a derivative of the Gemma 4 26B (A4B MoE) backbone with its autoregressive head replaced by a diffusion head.

Four points:

1. Architecture — discrete (masked) text diffusion, not autoregressive. A 256-token canvas is denoised in parallel
2. Speed — 1,000+ tok/sec on H100, 700+ on RTX 5090, ~3.5–4× the AR Gemma 4 counterpart
3. Quality — MMLU Pro / GPQA / MMMU Pro all trail the AR version. Google itself explicitly states quality is below the AR counterpart
4. License — Apache 2.0 (not Gemma License). Day-zero support on Hugging Face Transformers, vLLM, Unsloth, MLX; llama.cpp in progress

This column builds on our Gemma 4 12B encoder-free deep dive, Gemma 4 benchmark showdown, and Gemma 4 system requirements, focused on what DiffusionGemma technically is and how the quality-speed trade-off lands.

What Is a Diffusion Language Model (DLM)?

Autoregressive (AR) models — GPT, Claude, Llama, ordinary Gemma — generate text one token at a time, left-to-right. Each token depends on the prior tokens, so generation is fundamentally serial.

Diffusion language models adapt the Stable Diffusion idea to text:

1. Initialize a canvas of mask/noise tokens at the target length
2. Run multiple denoising steps that refine the entire canvas in parallel
3. Commit tokens that become confident at each step; re-refine the remainder
4. Stop once the canvas is fully determined

DiffusionGemma is in the discrete diffusion lineage, using a sampling technique Google calls Block-Autoregressive Multi-Canvas Sampling — denoise a 256-token canvas in parallel, commit confident tokens, flush them to the KV cache, advance to the next canvas.

The academic lineage runs through Stanford SEDD (Score Entropy Discrete Diffusion, 2024), LLaDA (8B, 2025), Mercury (Inception Labs, 2025-06), and Google's own Gemini Diffusion (2025, closed). DiffusionGemma is the first to combine open weights × large scale × Apache 2.0 in that lineage.

Where It Sits in the Gemma 4 Family

Model	Configuration	Released
Gemma 4 E2B / E4B	lightweight	2026-04-02
Gemma 4 26B (A4B MoE)	128 experts / 8 active / 3.8B active	2026-04-02
Gemma 4 31B Dense	Arena #3	2026-04-02
Gemma 4 12B (encoder-free)	multimodal, laptop-target	2026-06-03
DiffusionGemma 26B A4B	diffusion-head variant, Apache 2.0	2026-06-10

DiffusionGemma shares the exact same backbone as Gemma 4 26B (A4B MoE) — what changes is the head. That makes the AR and DLM versions directly comparable, and the first time the industry has had the same base running both modes as openly released artifacts.

Technical Spec (Official Model Card)

Item	Value
Total parameters	25.2B
Active parameters	3.8B at inference
Experts	128 total / 8 active
Layers	30
Vocab	262K
Canvas length	256 tokens
Max denoising steps	48 (adaptive early-stopping)
Tokens committed per forward pass	15-20
Context	up to 256K tokens
Vision encoder	~550M (separate)
Modalities	text + image + video (60s); no audio

Important caveat: the Google blog summarizes modality as 'text only,' while the Hugging Face card and ai.google.dev's model card explicitly include image and 60-second video input. Treat the HF / ai.google.dev cards as the authoritative source until the blog text is updated.

Speed (Official Numbers)

Hardware	Throughput
NVIDIA H100 (FP8)	1,000+ tok/sec
GeForce RTX 5090 (18GB VRAM quant)	700+ tok/sec
DGX Spark	~150 tok/sec
DGX Station	up to 2,000 tok/sec

About 3.5–4× the AR Gemma 4 26B counterpart. NVIDIA released an NVFP4-quantized variant (nvidia/diffusiongemma-26B-A4B-it-NVFP4) the same day, explicitly positioning it for local GPU inference.

Benchmarks — Google's Honest "Quality Trails" Disclosure

Officially published numbers:

Benchmark	DiffusionGemma	Gemma 4 26B (AR)
MMLU Pro	77.6%	82.6%
GPQA Diamond	73.2%	82.3%
MMMU Pro (vision)	54.3%	73.8%

Google's framing: "trails standard Gemma 4 on every public benchmark we tested." DiffusionGemma is positioned as speed-specialized; for quality-first workloads, the AR version is the recommended path. GSM8K / HumanEval / Chatbot Arena numbers haven't been released — third-party verification will follow.

The transparency here matters. Most model cards put their best foot forward; Google saying outright that DLM still trails AR but wins on speed is unusually candid and improves the precision of deployment decisions.

License and Distribution

- License: Apache 2.0 (not Gemma License — more permissive)
- Distribution:
- Hugging Face: google/diffusiongemma-26B-A4B-it
- Gemini Enterprise Agent Platform Model Garden
- NVIDIA NIM / NeMo
- Day-zero: Hugging Face Transformers, vLLM, Unsloth, MLX
- llama.cpp: in progress
- GGUF: unsloth/diffusiongemma-26B-A4B-it-GGUF
- NVFP4: nvidia/diffusiongemma-26B-A4B-it-NVFP4

Apache 2.0 means full commercial use, modification, and redistribution. Same posture as Gemma 4 12B — Google is signaling a clear commitment to open weights for Gemma.

Diffusion-LM Lineage and Competitors

Model	Origin	Scale	Notes
SEDD (Score Entropy Discrete Diffusion)	Stanford (2024)	Academic	Foundational theory
LLaDA	2025	8B	Narrowed AR gap on MMLU to ~5pt
Mercury / Mercury Coder	Inception Labs (2025-06)	Commercial	737-1,109 tok/sec on H100, code-focused
Gemini Diffusion	Google (2025)	Closed	Internal precursor to DiffusionGemma
DiffusionGemma	Google DeepMind (2026-06)	25.2B MoE	First open-weight, large-scale DLM

DiffusionGemma's distinction is the "open weights × large MoE × Apache 2.0" trifecta. Mercury is commercial-closed; DiffusionGemma anyone can download, modify, and commercially deploy.

Use Cases

Google and NVIDIA list:

- Interactive chat — low-latency, single user
- Agent loops — multi-tool flows where response latency dominates UX
- On-device assistants — DGX Spark / RTX 5090 self-contained
- Code editing / fill-in-the-middle — bidirectional context is a structural fit
- Long-document OCR / multimodal docs — 256K context
- Constrained generation — published results include sudoku tasks improving from 0% to 80% after fine-tuning

Limitations

- Quality trails AR (Google's own statement) — 5-20pt gaps on the published benchmarks
- GSM8K / HumanEval / Chatbot Arena not yet published — awaiting third-party validation
- No audio input (Gemma 4 12B encoder-free does support audio)
- Training cutoff January 2025
- Canvas size constrains long-form; long-form coherence may lag AR

Why Japanese Enterprises Should Care

1. Practical on-prem / local-GPU speed

Apache 2.0 + 700 tok/sec on a single RTX 5090 is a meaningful combination for finance, healthcare, and manufacturing that can't move sensitive data off-premise. Fits the data-egress constraints of Japan's amended Personal Information Protection Act and the Economic Security Promotion Act.

2. Trading quality for speed in the agent era

When "call count × latency" is the UX bottleneck — as it is for agent workloads — trading a bit of quality for serious speed becomes a real option. If each step in a Claude Code Agent View or Cursor Automations loop runs 4× faster, the end-to-end throughput improves meaningfully. With only 3.8B active params, power costs are reasonable for Japanese data-center constraints.

3. A single backbone supports both quality and speed modes

Because the AR Gemma 4 26B and DiffusionGemma share a backbone, you can switch the quality/speed trade-off with the same prompts — a clean consulting story for hybrid deployment.

4. The first time diffusion-LM has hit "production-class"

2025 was Mercury and (closed) Gemini Diffusion; 2026 is the year open weights × Apache 2.0 × large MoE all line up in one release. This is the right moment for Japanese enterprises to start serious diffusion-LM PoCs.

Our AI consulting practice handles AR + DLM hybrid designs through Forward Deployed Engineer-style on-site enablement.

What Isn't Officially Confirmed

As of June 11, 2026:

- Official GSM8K / HumanEval / Chatbot Arena scores
- Training token count
- Weight inheritance relationship with Gemini Diffusion
- Japanese-specific benchmark scores (JCommonsenseQA / JGLUE etc.)
- The modality discrepancy between Google's blog ("Text only") and the HF/ai.google.dev cards (text + image + video) — the HF/ai.google.dev cards should be treated as authoritative

Verify against the ai.google.dev model card and the Hugging Face card before production decisions.

FAQ

Q1. What's the fundamental difference between AR and a diffusion LM?
A. AR generates one token at a time, left-to-right (high quality but inherently serial). DLM denoises a whole canvas in parallel (faster, but quality currently behind AR). DLMs adapt the Stable Diffusion idea to text.

Q2. Is DiffusionGemma a replacement for the AR Gemma 4 26B?
A. No — Google itself is explicit that quality trails AR on every published benchmark. It's a complement, specialized for workloads where speed dominates UX (agent loops, interactive chat, code completion).

Q3. Commercial use?
A. Yes — Apache 2.0, fully permissive. The Gemma Prohibited Use Policy still applies but doesn't affect ordinary business use.

Q4. Will it run on a laptop?
A. With only 3.8B active params (MoE), a single RTX 5090 runs ~700 tok/sec, DGX Spark ~150 tok/sec. Apple Silicon MLX is day-zero supported, so M3 Max 64GB and similar Macs should work.

Q5. Why does DLM lag AR on quality?
A. Diffusion LMs are simply still catching up — LLaDA (8B) closed the MMLU gap to ~5pt, but at 26B the reversal hasn't happened yet. The discrete-token diffusion training objective is genuinely harder than next-token cross-entropy. Expect the gap to keep closing over 12-24 months.

Q6. How does it differ from Mercury?
A. Mercury (Inception Labs, 2025-06) is commercial-closed; DiffusionGemma is open-weights and Apache 2.0. Mercury Coder is code-focused; DiffusionGemma is general text + image + 60s video.

Q7. How does it compare to Apple AFM Core Advanced?
A. Apple AFM Core Advanced is AR with sparse MoE / IFP — Apple chose to optimize AR. DiffusionGemma chose diffusion. Two different routes to the same goal of "high speed at constant quality."

Q8. Japanese-language quality?
A. No official Japanese benchmarks yet. Gemma 4 generally covers 140 languages including Japanese; the AR Gemma 4 26B sits around 86 on JCommonsenseQA, so a 4-5pt quality hit suggests DiffusionGemma might land in the low 70s on Japanese tasks — but that's a projection, not a published number.

Bottom Line

DiffusionGemma is the moment "diffusion language models go from research into shippable product." 2025 was Mercury and closed Gemini Diffusion; 2026 is the year open weights × Apache 2.0 × large-scale MoE all converge.

Google's own "quality trails AR" disclosure is what makes the choice usable: this is a complement, not a replacement. The three takeaways for Japanese enterprises: (1) take the speed win on agent loops, interactive chat, and code completion, (2) leverage the shared backbone with AR Gemma 4 26B for hybrid quality/speed designs, (3) Apache 2.0 + 700 tok/sec on RTX 5090 turns this into a credible on-prem option for data-residency-sensitive workloads.

Through the back half of 2026, DiffusionGemma is going to become an unavoidable reference point in any conversation about the speed frontier of LLMs.

References

Primary:
- Google blog — DiffusionGemma: faster text generation
- ai.google.dev — DiffusionGemma model card
- Hugging Face — google/diffusiongemma-26B-A4B-it
- NVIDIA blog — Local Gemma Diffusion on RTX AI Garage
- NVIDIA — diffusiongemma-26B-A4B-it-NVFP4
- Unsloth — diffusiongemma-26B-A4B-it-GGUF
- Gemma 4 12B encoder-free announcement
- Gemma Prohibited Use Policy

Third-party:
- MarkTechPost — DiffusionGemma (June 10, 2026)
- The Decoder — DiffusionGemma explainer
- The New Stack — Google DiffusionGemma
- Mercury paper (arXiv:2506.17298)
- Inception Labs — Introducing Mercury
- LLaDA on OpenReview
- Gemma (language model) on Wikipedia)

Related:
- Gemma 4 12B encoder-free multimodal
- Gemma 4 benchmark showdown
- Gemma 4 system requirements
- Argent × Gemma 4 — on-device AI agent
- Apple AFM Core Advanced
- Liquid AI's Japanese-specialized models
- Hermes Desktop
- Claude Code Agent View
- Forward Deployed Engineer (FDE)

Note: Official GSM8K / HumanEval / Chatbot Arena scores, training-token count, the weight-inheritance relationship to Gemini Diffusion, Japanese-specific benchmarks, and the modality wording discrepancy between Google's blog and the HF / ai.google.dev cards are not officially confirmed as of June 11, 2026. Re-verify before production decisions.

Feel free to contact us