AI2026-05-25

Gemma 4 Performance Benchmark — Compared Against Llama 4, Qwen, Mistral, and DeepSeek on Quality, Speed, and Cost-Efficiency [2026 Open-Weights LLM Showdown]

A 2026 Q2 performance benchmark of Gemma 4 (E2B / E4B / 26B MoE / 31B Dense) against the major open-weights peers — Llama 4, Qwen 3.5, Mistral, and DeepSeek — across MMLU-Pro, GPQA, HumanEval, MATH-500, and MT-Bench. Adds throughput (tokens / s), memory efficiency (quality per GB VRAM), cost per million tokens, Japanese-language performance, native function calling, and Apache 2.0 / MIT / commercial-use licensing as of May 2026, plus a use-case selection matrix for in-house LLM, edge AI, coding assistants, and RAG.

Gemma 4 Llama 4 Qwen Mistral DeepSeek LLM Benchmark Local AI Performance Comparison

TL;DR — Where Gemma 4 Sits in the May 2026 Open-Weights LLM Landscape

As of May 2026, Gemma 4 lands as "top-tier quality at its size class, somewhat trailing on raw throughput, the most permissive license overall." Quick map of which model wins which axis:

Axis	Winner / Recommendation
Quality at fixed size	Gemma 4 31B Dense (general) / DeepSeek V3.5 (reasoning)
Throughput (tok/s)	Mistral Small 3 / Qwen 3.5 Turbo
Memory efficiency (quality / GB)	Gemma 4 E4B, Gemma 4 26B MoE
Self-hosted cost	Gemma 4 E4B, Qwen 3.5 4B
Japanese-language quality	Qwen 3.5 / Gemma 4 (close tie)
Native function calling	Gemma 4 (native) / Llama 4 (template)
License freedom	Gemma 4 (Apache 2.0) / Mistral (Apache 2.0)
Edge / mobile	Gemma 4 E2B / E4B dominate

If you need maximum total score per VRAM, certainty of commercial licensing, and edge-class deployment, Gemma 4 is the May 2026 default. For pure throughput on batch workloads or top-tier math / code reasoning, Mistral or DeepSeek may earn a slot in a hybrid stack.

This column is the sequel to Gemma 4 System Requirements and Gemma 4 + AI Studio Update, focused on quality, speed, and selection criteria with primary sources.

The May 2026 Lineup

Open-weights, production-grade models in scope for the comparison:

Family	Vendor	Key sizes	License	Release
Gemma 4	Google DeepMind	E2B / E4B / 26B MoE / 31B Dense	Apache 2.0	April 2026
Llama 4	Meta	8B / 70B / 405B	Llama 4 Community License	2026 Q1
Qwen 3.5	Alibaba	0.5B–72B	Apache 2.0 (some Qwen License)	2026 Q1–Q2
Mistral	Mistral AI	Small 3 / Medium 3 / Large 3	Apache 2.0 + commercial	2026 Q1–Q2
DeepSeek V3.5	DeepSeek	16B / 671B MoE	Custom (commercial OK)	2026 Q1
Phi-5	Microsoft	3.8B / 14B	MIT	2026 Q1

For Gemma 4 we'll focus on E4B (4B-effective edge model), 26B MoE (≈4B active at inference), and 31B Dense (flagship).

Standard Benchmarks (MMLU-Pro, GPQA, HumanEval, MATH-500)

Compiled from each vendor's announcements, model cards, and standard leaderboards (lmarena.ai, etc.) as of May 2026. Compare within size classes to be fair.

4B-class (edge / consumer GPU)

Model	MMLU-Pro	GPQA	HumanEval	MATH-500
Gemma 4 E4B	~low-60s	~low-30s	~low-70s	~mid-50s
Llama 4 8B	~mid-50s	~mid-20s	~mid-60s	~mid-40s
Qwen 3.5 4B	~high-50s	~high-20s	~low-70s	~low-50s
Mistral Small 3	~mid-50s	~mid-20s	~high-60s	~high-40s
Phi-5 14B	~low-60s	~low-30s	~low-70s	~high-50s

Dense 30B-class (business-grade local LLM)

Model	MMLU-Pro	GPQA	HumanEval	MATH-500
Gemma 4 31B Dense	~mid-70s	~high-40s	~low-80s	~mid-70s
Llama 4 70B	~mid-70s	~low-50s	~low-80s	~low-70s
Qwen 3.5 32B	~mid-70s	~high-40s	~low-80s	~low-70s
Mistral Medium 3	~mid-70s	~mid-40s	~high-70s	~70

Reading: at 30B class the four vendors are within 1–3 points — effectively a wash. Selection should hinge on throughput, memory footprint, license, and function-calling shape, which we cover next.

Important caveat: figures above are May-2026 vendor disclosures aggregated. Measurement conditions (few-shot, quantization, eval harness) vary. Always cross-check on the Open LLM Leaderboard or Chatbot Arena.

Throughput (tokens / s) by Hardware

When quality is similar, throughput is the next deciding factor. Same VRAM, same quantization.

RTX 4090 (24GB) / Q4

Model	tok/s	Feel
Gemma 4 E4B	~100–140	Instant
Qwen 3.5 4B	~110–150	Instant
Mistral Small 3	~130–170	Fastest
Llama 4 8B	~80–110	A touch slow
Gemma 4 26B MoE	~50–75	Usable
Gemma 4 31B Dense	~25–40	Business-OK
Qwen 3.5 32B	~30–45	Business-OK
Llama 4 70B (Q4)	~18–28	Slowish

Apple Silicon M3 Max 64GB / MLX / Q4

Model	tok/s
Gemma 4 E4B	~35–50
Gemma 4 26B MoE	~18–28
Gemma 4 31B Dense	~8–14
Qwen 3.5 32B	~9–15
Llama 4 70B	~4–7 (M4 Max recommended)

Reading: Mistral Small 3 has a slight throughput edge in 4B class. Gemma 4 26B MoE is the most interesting trade — 4B-effective compute serving 26B-class knowledge, well-balanced on memory and speed.

Memory Efficiency — Quality per GB of VRAM

Best-LLM-per-VRAM-budget perspective. MMLU-Pro ÷ Q4 VRAM:

Model	MMLU-Pro	VRAM (Q4)	Score/GB
Gemma 4 E4B	~62	~3GB	~20.7
Gemma 4 26B MoE	~73	~10GB	~7.3
Gemma 4 31B Dense	~78	~24GB	~3.3
Qwen 3.5 4B	~58	~3GB	~19.3
Llama 4 8B	~55	~6GB	~9.2
Llama 4 70B	~76	~40GB	~1.9
Mistral Small 3	~57	~3GB	~19.0

Reading: Gemma 4 E4B delivers ~60-something MMLU-Pro at just 3GB VRAM. Best quality-per-GB among open-weights LLMs in May 2026. For edge / mobile / low-spec deployment, it's the obvious starting point.

Self-Hosted Cost per Million Tokens

Self-hosted cost (GPU amortization + electricity), assuming an RTX 4090 rentable at ~¥5,000/month equivalent:

Model	tok/s	Time per 1M tokens	Estimated cost
Gemma 4 E4B	120	~2.3 hours	~¥5–15
Gemma 4 26B MoE	60	~4.6 hours	~¥15–30
Gemma 4 31B Dense	32	~8.7 hours	~¥30–60
Llama 4 70B (Q4)	22	~12.6 hours	~¥50–90

For reference: OpenAI GPT-4o API runs ~$2.50/1M input → ~$10.00/1M output = ¥375–¥1,500 per 1M. Self-hosting Gemma 4 E4B can deliver under 1/100 of the API price (quality gap is separate).

Japanese-Language Performance

Evaluated on JGLUE / JCommonsenseQA and other public Japanese benchmarks:

Model	JCommonsenseQA	JGLUE avg	Note
Qwen 3.5 32B	~88	~80	Japanese-tuned variants
Gemma 4 31B Dense	~86	~78	Multilingual balanced
Llama 4 70B	~82	~74	English-first
Mistral Medium 3	~78	~70	European-language-leaning
Gemma 4 E4B	~75	~65	Strong for its size class

Reading: For Japanese, Qwen 3.5 and Gemma 4 are the two front-runners. Mistral leans European, Llama 4 leans English. Combining Japanese performance + Apache 2.0 + multimodal, Gemma 4 is the current best all-rounder for Japanese enterprise use.

Function Calling and Agent Fit

Critical for agent workflows: native tool-call shape and multi-step reasoning.

Model	Function calling	Multi-step	Multimodal
Gemma 4	Native	✓	Text + image + audio
Llama 4	Prompt template	✓	Text + image
Qwen 3.5	Native	✓	Text + image
Mistral	Native	✓	Text (some image)
DeepSeek V3.5	Native	★ reasoning-strong	Text
Phi-5	Prompt template	△	Text + image

Gemma 4, Qwen 3.5, Mistral, and DeepSeek all ship native function calling — the easy quartet for agent integration. Llama 4 routes function calls through prompt templates, which complicates direct integration with MCP-based agents like Claude Code Agent View or Cursor Automations.

Licensing — Practical Differences for Commercial Use

Open-weights labels mask large practical differences.

Model	License	Commercial use	Restrictions
Gemma 4	Apache 2.0	Fully permitted	None
Mistral Small / Medium	Apache 2.0	Fully permitted	None
Mistral Large 3	Mistral Research License	Limited	Separate commercial agreement
Phi-5	MIT	Fully permitted	None
Qwen 3.5 (some)	Apache 2.0	Fully permitted	None
Qwen 3.5 (72B etc.)	Qwen License	Limited	Special terms above 100M MAU
Llama 4	Llama 4 Community License	Conditional	Above 700M MAU = separate contract, restrictions vs Meta-competing products
DeepSeek V3.5	DeepSeek License	Conditional	Check commercial clauses

Apache 2.0 / MIT are the cleanest for commercial use. Gemma 4 / Mistral Small-Medium / Phi-5 sit in the safe zone. Llama 4's MAU cap and Meta-competing clause add legal review overhead for global SaaS and financial products.

Selection Matrix by Use Case

Six common deployment shapes, May 2026:

Use case	First pick	Second pick	Why
Edge / mobile AI	Gemma 4 E4B	Qwen 3.5 4B	3GB VRAM, ~60+ MMLU-Pro, Apache 2.0
In-house LLM (general)	Gemma 4 31B Dense	Qwen 3.5 32B	Japanese + multimodal + license
Coding assistant	Qwen 3.5 32B (Coder variant)	Gemma 4 31B Dense	HumanEval lead
RAG / knowledge search	Gemma 4 26B MoE	Mistral Medium 3	Memory × throughput balance
Large-batch inference	Mistral Small 3	Gemma 4 E4B	Highest tok/s
Math / scientific reasoning	DeepSeek V3.5	Gemma 4 31B Dense	GPQA / MATH-500 lead

Where Gemma 4 Doesn't Win

- Throughput is not best-in-class — Mistral / Qwen are 10–20% faster at the same size; matters for batch
- Coding-specialized leaderboards — Qwen Coder series still tops HumanEval / MBPP
- Long-context (128K+) degradation — solid through ~32K, weaker than peers beyond that (third-party reports)
- All numbers are May 2026 snapshots — refresh every 2–3 months

Oflight's Recommended Stack (May 2026)

What we default to inside our AI consulting practice for Japanese-enterprise in-house LLM and edge AI engagements:

1. PoC / dev: Gemma 4 E4B + Ollama for local validation → Gemma 4 31B Dense + vLLM for production-scale evaluation
2. Production / in-house LLM: Gemma 4 31B Dense (Japanese business), or Qwen 3.5 32B when pure Japanese quality is essential
3. Edge / mobile: Gemma 4 E4B (including Argent × Gemma 4 iOS automation patterns)
4. Agent platform: Gemma 4 + MCP, compatible with Claude Code Agent View and Cursor Automations
5. Aggressive cost compression: self-hosted designs that reach <1/100 of API pricing through dedicated-GPU amortization

FAQ

Q1. Gemma 4 vs Llama 4 — which?
A. Gemma 4 wins on license freedom, Japanese quality, and multimodal. Llama 4 carries an MAU cap and a Meta-competing clause, so global SaaS gets extra legal review. Pick Llama 4 only for English-leaning research use cases.

Q2. Japanese business — Qwen 3.5 or Gemma 4?
A. Qwen 3.5 edges out on JCommonsenseQA / JGLUE, but Gemma 4's multimodal + Apache 2.0 + Vertex AI / AI Studio ecosystem tilt the overall call toward Gemma 4.

Q3. What runs on an RTX 3060 (12GB)?
A. Comfortable: Gemma 4 E4B Q4 (3GB), Qwen 3.5 4B, Phi-5 14B Q4, Mistral Small 3 Q4. Tight: 26B MoE Q4 (10GB). 31B Dense is out. See Gemma 4 System Requirements.

Q4. Why don't benchmark scores match real-world quality?
A. MMLU-Pro / HumanEval test general knowledge and code generation; real workflows (internal Q&A, email drafting, summarization) hinge more on fine-tuning, RAG quality, and prompt engineering. Treat benchmarks as a floor check; A/B test on your own data.

Q5. What's coming in late 2026?
A. Google has Gemini 3.5 Pro slated for June 2026; Meta is preparing reasoning-focused Llama 4 variants; Alibaba has hinted at Qwen 4. Plan for a six-month re-evaluation cycle, with an architecture that swaps models cleanly.

Q6. Anything to watch out for with Gemma 4 commercial use?
A. Apache 2.0 is extremely permissive, but Gemma's Prohibited Use Policy still bans weapons, CSAM, and similar use. Normal business use has no practical issues.

Bottom Line

For the May 2026 trifecta of best total quality per VRAM × Apache 2.0 × native function calling × multimodal × strong Japanese, Gemma 4 is the current default open-weights choice.

But it doesn't win every axis. The realistic answer is hybrid: Gemma 4 as the default, with Mistral (speed), Qwen (Japanese-pure), DeepSeek (math), and Llama 4 (English research) added as needed. We typically benchmark with the customer's own data inside an AI consulting engagement before locking the production stack.

References

Primary:
- Gemma 4 model card (Hugging Face)
- Android Developers Blog — Gemma 4 announcement
- Google Developers Blog — Gemma 4 agentic skills
- Meta — Llama 4
- Qwen docs
- Mistral
- DeepSeek
- Gemma Prohibited Use Policy

Benchmarks:
- Open LLM Leaderboard (Hugging Face)
- Chatbot Arena (lmarena.ai)
- JGLUE / JCommonsenseQA

Related columns:
- Gemma 4 System Requirements
- Gemma 4 + Google AI Studio update
- Gemini 3.5 Flash + Omni
- Argent × Gemma 4 — on-device AI agent
- Qwen 3.5 9B Complete Guide

Note: All benchmark figures are May-2026 snapshots aggregated from vendor disclosures and third-party leaderboards. Measurement conditions (few-shot, quantization, eval harness) vary, so confirm with your own A/B tests. Vendor releases land roughly quarterly — re-validate every 2–3 months.

Feel free to contact us