株式会社オブライト
AI2026-05-25

Gemma 4 Performance Benchmark — Compared Against Llama 4, Qwen, Mistral, and DeepSeek on Quality, Speed, and Cost-Efficiency [2026 Open-Weights LLM Showdown]

A 2026 Q2 performance benchmark of Gemma 4 (E2B / E4B / 26B MoE / 31B Dense) against the major open-weights peers — Llama 4, Qwen 3.5, Mistral, and DeepSeek — across MMLU-Pro, GPQA, HumanEval, MATH-500, and MT-Bench. Adds throughput (tokens / s), memory efficiency (quality per GB VRAM), cost per million tokens, Japanese-language performance, native function calling, and Apache 2.0 / MIT / commercial-use licensing as of May 2026, plus a use-case selection matrix for in-house LLM, edge AI, coding assistants, and RAG.


TL;DR — Where Gemma 4 Sits in the May 2026 Open-Weights LLM Landscape

As of May 2026, Gemma 4 lands as "top-tier quality at its size class, somewhat trailing on raw throughput, the most permissive license overall." Quick map of which model wins which axis:

AxisWinner / Recommendation
Quality at fixed sizeGemma 4 31B Dense (general) / DeepSeek V3.5 (reasoning)
Throughput (tok/s)Mistral Small 3 / Qwen 3.5 Turbo
Memory efficiency (quality / GB)Gemma 4 E4B, Gemma 4 26B MoE
Self-hosted costGemma 4 E4B, Qwen 3.5 4B
Japanese-language qualityQwen 3.5 / Gemma 4 (close tie)
Native function callingGemma 4 (native) / Llama 4 (template)
License freedomGemma 4 (Apache 2.0) / Mistral (Apache 2.0)
Edge / mobileGemma 4 E2B / E4B dominate

If you need maximum total score per VRAM, certainty of commercial licensing, and edge-class deployment, Gemma 4 is the May 2026 default. For pure throughput on batch workloads or top-tier math / code reasoning, Mistral or DeepSeek may earn a slot in a hybrid stack.

This column is the sequel to Gemma 4 System Requirements and Gemma 4 + AI Studio Update, focused on quality, speed, and selection criteria with primary sources.

The May 2026 Lineup

Open-weights, production-grade models in scope for the comparison:

FamilyVendorKey sizesLicenseRelease
Gemma 4Google DeepMindE2B / E4B / 26B MoE / 31B DenseApache 2.0April 2026
Llama 4Meta8B / 70B / 405BLlama 4 Community License2026 Q1
Qwen 3.5Alibaba0.5B–72BApache 2.0 (some Qwen License)2026 Q1–Q2
MistralMistral AISmall 3 / Medium 3 / Large 3Apache 2.0 + commercial2026 Q1–Q2
DeepSeek V3.5DeepSeek16B / 671B MoECustom (commercial OK)2026 Q1
Phi-5Microsoft3.8B / 14BMIT2026 Q1

For Gemma 4 we'll focus on E4B (4B-effective edge model), 26B MoE (≈4B active at inference), and 31B Dense (flagship).

Standard Benchmarks (MMLU-Pro, GPQA, HumanEval, MATH-500)

Compiled from each vendor's announcements, model cards, and standard leaderboards (lmarena.ai, etc.) as of May 2026. Compare within size classes to be fair.

4B-class (edge / consumer GPU)

ModelMMLU-ProGPQAHumanEvalMATH-500
Gemma 4 E4B~low-60s~low-30s~low-70s~mid-50s
Llama 4 8B~mid-50s~mid-20s~mid-60s~mid-40s
Qwen 3.5 4B~high-50s~high-20s~low-70s~low-50s
Mistral Small 3~mid-50s~mid-20s~high-60s~high-40s
Phi-5 14B~low-60s~low-30s~low-70s~high-50s

Dense 30B-class (business-grade local LLM)

ModelMMLU-ProGPQAHumanEvalMATH-500
Gemma 4 31B Dense~mid-70s~high-40s~low-80s~mid-70s
Llama 4 70B~mid-70s~low-50s~low-80s~low-70s
Qwen 3.5 32B~mid-70s~high-40s~low-80s~low-70s
Mistral Medium 3~mid-70s~mid-40s~high-70s~70

Reading: at 30B class the four vendors are within 1–3 points — effectively a wash. Selection should hinge on throughput, memory footprint, license, and function-calling shape, which we cover next.

Important caveat: figures above are May-2026 vendor disclosures aggregated. Measurement conditions (few-shot, quantization, eval harness) vary. Always cross-check on the Open LLM Leaderboard or Chatbot Arena.

Throughput (tokens / s) by Hardware

When quality is similar, throughput is the next deciding factor. Same VRAM, same quantization.

RTX 4090 (24GB) / Q4

Modeltok/sFeel
Gemma 4 E4B~100–140Instant
Qwen 3.5 4B~110–150Instant
Mistral Small 3~130–170Fastest
Llama 4 8B~80–110A touch slow
Gemma 4 26B MoE~50–75Usable
Gemma 4 31B Dense~25–40Business-OK
Qwen 3.5 32B~30–45Business-OK
Llama 4 70B (Q4)~18–28Slowish

Apple Silicon M3 Max 64GB / MLX / Q4

Modeltok/s
Gemma 4 E4B~35–50
Gemma 4 26B MoE~18–28
Gemma 4 31B Dense~8–14
Qwen 3.5 32B~9–15
Llama 4 70B~4–7 (M4 Max recommended)

Reading: Mistral Small 3 has a slight throughput edge in 4B class. Gemma 4 26B MoE is the most interesting trade — 4B-effective compute serving 26B-class knowledge, well-balanced on memory and speed.

Memory Efficiency — Quality per GB of VRAM

Best-LLM-per-VRAM-budget perspective. MMLU-Pro ÷ Q4 VRAM:

ModelMMLU-ProVRAM (Q4)Score/GB
Gemma 4 E4B~62~3GB~20.7
Gemma 4 26B MoE~73~10GB~7.3
Gemma 4 31B Dense~78~24GB~3.3
Qwen 3.5 4B~58~3GB~19.3
Llama 4 8B~55~6GB~9.2
Llama 4 70B~76~40GB~1.9
Mistral Small 3~57~3GB~19.0

Reading: Gemma 4 E4B delivers ~60-something MMLU-Pro at just 3GB VRAM. Best quality-per-GB among open-weights LLMs in May 2026. For edge / mobile / low-spec deployment, it's the obvious starting point.

Self-Hosted Cost per Million Tokens

Self-hosted cost (GPU amortization + electricity), assuming an RTX 4090 rentable at ~¥5,000/month equivalent:

Modeltok/sTime per 1M tokensEstimated cost
Gemma 4 E4B120~2.3 hours~¥5–15
Gemma 4 26B MoE60~4.6 hours~¥15–30
Gemma 4 31B Dense32~8.7 hours~¥30–60
Llama 4 70B (Q4)22~12.6 hours~¥50–90

For reference: OpenAI GPT-4o API runs ~$2.50/1M input → ~$10.00/1M output = ¥375–¥1,500 per 1M. Self-hosting Gemma 4 E4B can deliver under 1/100 of the API price (quality gap is separate).

Japanese-Language Performance

Evaluated on JGLUE / JCommonsenseQA and other public Japanese benchmarks:

ModelJCommonsenseQAJGLUE avgNote
Qwen 3.5 32B~88~80Japanese-tuned variants
Gemma 4 31B Dense~86~78Multilingual balanced
Llama 4 70B~82~74English-first
Mistral Medium 3~78~70European-language-leaning
Gemma 4 E4B~75~65Strong for its size class

Reading: For Japanese, Qwen 3.5 and Gemma 4 are the two front-runners. Mistral leans European, Llama 4 leans English. Combining Japanese performance + Apache 2.0 + multimodal, Gemma 4 is the current best all-rounder for Japanese enterprise use.

Function Calling and Agent Fit

Critical for agent workflows: native tool-call shape and multi-step reasoning.

ModelFunction callingMulti-stepMultimodal
Gemma 4NativeText + image + audio
Llama 4Prompt templateText + image
Qwen 3.5NativeText + image
MistralNativeText (some image)
DeepSeek V3.5Native★ reasoning-strongText
Phi-5Prompt templateText + image

Gemma 4, Qwen 3.5, Mistral, and DeepSeek all ship native function calling — the easy quartet for agent integration. Llama 4 routes function calls through prompt templates, which complicates direct integration with MCP-based agents like Claude Code Agent View or Cursor Automations.

Licensing — Practical Differences for Commercial Use

Open-weights labels mask large practical differences.

ModelLicenseCommercial useRestrictions
Gemma 4Apache 2.0Fully permittedNone
Mistral Small / MediumApache 2.0Fully permittedNone
Mistral Large 3Mistral Research LicenseLimitedSeparate commercial agreement
Phi-5MITFully permittedNone
Qwen 3.5 (some)Apache 2.0Fully permittedNone
Qwen 3.5 (72B etc.)Qwen LicenseLimitedSpecial terms above 100M MAU
Llama 4Llama 4 Community LicenseConditionalAbove 700M MAU = separate contract, restrictions vs Meta-competing products
DeepSeek V3.5DeepSeek LicenseConditionalCheck commercial clauses

Apache 2.0 / MIT are the cleanest for commercial use. Gemma 4 / Mistral Small-Medium / Phi-5 sit in the safe zone. Llama 4's MAU cap and Meta-competing clause add legal review overhead for global SaaS and financial products.

Selection Matrix by Use Case

Six common deployment shapes, May 2026:

Use caseFirst pickSecond pickWhy
Edge / mobile AIGemma 4 E4BQwen 3.5 4B3GB VRAM, ~60+ MMLU-Pro, Apache 2.0
In-house LLM (general)Gemma 4 31B DenseQwen 3.5 32BJapanese + multimodal + license
Coding assistantQwen 3.5 32B (Coder variant)Gemma 4 31B DenseHumanEval lead
RAG / knowledge searchGemma 4 26B MoEMistral Medium 3Memory × throughput balance
Large-batch inferenceMistral Small 3Gemma 4 E4BHighest tok/s
Math / scientific reasoningDeepSeek V3.5Gemma 4 31B DenseGPQA / MATH-500 lead

Where Gemma 4 Doesn't Win

- Throughput is not best-in-class — Mistral / Qwen are 10–20% faster at the same size; matters for batch - Coding-specialized leaderboards — Qwen Coder series still tops HumanEval / MBPP - Long-context (128K+) degradation — solid through ~32K, weaker than peers beyond that (third-party reports) - All numbers are May 2026 snapshots — refresh every 2–3 months

Oflight's Recommended Stack (May 2026)

What we default to inside our AI consulting practice for Japanese-enterprise in-house LLM and edge AI engagements:

1. PoC / dev: Gemma 4 E4B + Ollama for local validation → Gemma 4 31B Dense + vLLM for production-scale evaluation 2. Production / in-house LLM: Gemma 4 31B Dense (Japanese business), or Qwen 3.5 32B when pure Japanese quality is essential 3. Edge / mobile: Gemma 4 E4B (including Argent × Gemma 4 iOS automation patterns) 4. Agent platform: Gemma 4 + MCP, compatible with Claude Code Agent View and Cursor Automations 5. Aggressive cost compression: self-hosted designs that reach <1/100 of API pricing through dedicated-GPU amortization

FAQ

Q1. Gemma 4 vs Llama 4 — which? A. Gemma 4 wins on license freedom, Japanese quality, and multimodal. Llama 4 carries an MAU cap and a Meta-competing clause, so global SaaS gets extra legal review. Pick Llama 4 only for English-leaning research use cases. Q2. Japanese business — Qwen 3.5 or Gemma 4? A. Qwen 3.5 edges out on JCommonsenseQA / JGLUE, but Gemma 4's multimodal + Apache 2.0 + Vertex AI / AI Studio ecosystem tilt the overall call toward Gemma 4. Q3. What runs on an RTX 3060 (12GB)? A. Comfortable: Gemma 4 E4B Q4 (3GB), Qwen 3.5 4B, Phi-5 14B Q4, Mistral Small 3 Q4. Tight: 26B MoE Q4 (10GB). 31B Dense is out. See Gemma 4 System Requirements. Q4. Why don't benchmark scores match real-world quality? A. MMLU-Pro / HumanEval test general knowledge and code generation; real workflows (internal Q&A, email drafting, summarization) hinge more on fine-tuning, RAG quality, and prompt engineering. Treat benchmarks as a floor check; A/B test on your own data. Q5. What's coming in late 2026? A. Google has Gemini 3.5 Pro slated for June 2026; Meta is preparing reasoning-focused Llama 4 variants; Alibaba has hinted at Qwen 4. Plan for a six-month re-evaluation cycle, with an architecture that swaps models cleanly. Q6. Anything to watch out for with Gemma 4 commercial use? A. Apache 2.0 is extremely permissive, but Gemma's Prohibited Use Policy still bans weapons, CSAM, and similar use. Normal business use has no practical issues.

Bottom Line

For the May 2026 trifecta of best total quality per VRAM × Apache 2.0 × native function calling × multimodal × strong Japanese, Gemma 4 is the current default open-weights choice.

But it doesn't win every axis. The realistic answer is hybrid: Gemma 4 as the default, with Mistral (speed), Qwen (Japanese-pure), DeepSeek (math), and Llama 4 (English research) added as needed. We typically benchmark with the customer's own data inside an AI consulting engagement before locking the production stack.

References

Primary: - Gemma 4 model card (Hugging Face) - Android Developers Blog — Gemma 4 announcement - Google Developers Blog — Gemma 4 agentic skills - Meta — Llama 4 - Qwen docs - Mistral - DeepSeek - Gemma Prohibited Use Policy Benchmarks: - Open LLM Leaderboard (Hugging Face) - Chatbot Arena (lmarena.ai) - JGLUE / JCommonsenseQA Related columns: - Gemma 4 System Requirements - Gemma 4 + Google AI Studio update - Gemini 3.5 Flash + Omni - Argent × Gemma 4 — on-device AI agent - Qwen 3.5 9B Complete Guide Note: All benchmark figures are May-2026 snapshots aggregated from vendor disclosures and third-party leaderboards. Measurement conditions (few-shot, quantization, eval harness) vary, so confirm with your own A/B tests. Vendor releases land roughly quarterly — re-validate every 2–3 months.

Feel free to contact us

Contact Us