株式会社オブライト
AI2026-04-17

Gemma 4 Complete Requirements Reference — VRAM, RAM & GPU Quick-Lookup Tables [E2B/E4B/26B/31B All Variants]

Gemma 4 minimum: 5GB RAM (E2B Q4), recommended: 24GB VRAM (31B Dense Q4). Quick-lookup tables covering VRAM, RAM, and GPU requirements for all variants: E2B, E4B, 26B MoE, and 31B Dense.


Gemma 4 Minimum & Recommended System Requirements — Quick Answer

Gemma 4 minimum: 5GB RAM (E2B Q4). Recommended: 24GB VRAM (31B Dense Q4). All variants at a glance:

LevelModelRequirement
MinimumE2B Q45GB RAM (CPU-only)
Entry RecommendedE4B Q48GB RAM / 4-5GB VRAM
Standard26B MoE Q416GB RAM / 16-18GB VRAM
Comfortable31B Dense Q432GB RAM / 20-24GB VRAM
Maximum Quality31B Dense FP1664GB RAM / 48GB+ VRAM

All Variants: VRAM, RAM & Hardware Requirements Table

VariantParamsQ4 VRAMQ8 VRAMFP16 VRAMMin RAMRecommended GPU
E2B2.3B2-3GB3-4GB5GB8GBGTX 1660+ / M1
E4B4.5B4-5GB6-7GB9GB8GBRTX 3060 / M1 Pro
26B MoE26B (4B active)16-18GB28GB54GB16GBRTX 4080 / M3 Max
31B Dense31B20-24GB34GB62GB32GBRTX 4090 / A100

Note: The 26B MoE activates only ~4B parameters per token, making its VRAM requirement far lower than its total parameter count suggests.

GPU Compatibility & VRAM Requirements Quick-Lookup

GPUVRAME2BE4B26B MoE31B Dense
GTX 1660 Super6GB
RTX 306012GB△ (Q4 only)
RTX 309024GB○ (Q4)
RTX 407012GB△ (Q4 only)
RTX 408016GB◎ (Q4)△ (Q3)
RTX 409024GB◎ (Q4)
A100 40GB40GB
H100 80GB80GB◎ (FP16)

◎=Excellent ○=Works △=Limited ✗=Not supported

Apple Silicon System Requirements

ChipUnified MemoryE2BE4B26B MoE31B Dense31B Q4 Speed
M1 (8GB)8GB
M1/M2 Pro (16GB)16GB◎ (Q4)
M2/M3 Max (32GB)32GB◎ (Q4)10-15 tok/s
M3 Ultra (64GB)64GB25-35 tok/s
M4 Max (48GB)48GB30-40 tok/s
M4 Ultra (192GB)192GB◎ (FP16)50+ tok/s

Apple Silicon unified memory acts as both system RAM and VRAM, making M2/M3 Max (32GB) or higher ideal for running 31B Dense Q4.

Quantization Level VRAM Requirements (31B Dense)

QuantizationVRAM RequiredQuality (vs FP16)SpeedBest For
FP1662GB100%1.0xResearch / Max quality
Q8_034GB99%1.2xA100 / H100
Q6_K26GB98%1.4xDual RTX 3090
Q5_K_M22GB96%1.5xRTX 4090 (comfortable)
Q4_K_M20GB93%1.8xRTX 4090 (recommended)
Q3_K_M16GB85%2.1xRTX 4080 (compromise)
Q2_K13GB72%2.5xNot recommended

Q4_K_M is the best balance: only 7% quality loss vs FP16, with 68% VRAM reduction.

RAM Requirements — CPU-Only Inference

VariantMin RAMRecommended RAMCPU Speed
E2B Q44GB8GB15-25 tok/s
E4B Q48GB16GB8-15 tok/s
26B MoE Q420GB32GB3-6 tok/s
31B Dense Q424GB48GB2-4 tok/s

CPU-only inference is 5-10x slower than GPU. E2B or E4B Q4 are the only practical CPU-only options.

VRAM-Based Model Selection Flowchart

Loading diagram...

Gemma 4 31B VRAM Requirements (Dedicated Section)

SpecMinimumRecommendedIdeal
VRAM16GB (Q3_K_M)20-24GB (Q4_K_M)48GB+ (FP16)
RAM32GB48GB64GB
GPURTX 4080RTX 4090A100 / H100
Inference Speed (Q4)4-6 tok/s10-20 tok/s50+ tok/s

The RTX 4090 (24GB) is the minimum consumer GPU that comfortably runs 31B Dense at Q4_K_M. Extending context beyond 128K adds ~4GB VRAM overhead.

Gemma 4 E2B System Requirements (Dedicated Section)

SpecMinimumRecommended
RAM (CPU-only)4GB8GB
VRAM (GPU)2-3GB4GB+
Compatible DevicesRaspberry Pi 5 (8GB), older laptopsMacBook Air M1, GTX 1060+

Gemma 4 E2B is designed for edge and embedded use cases. It runs on a Raspberry Pi 5 (8GB model) via llama.cpp, making it the most accessible model in the Gemma 4 family.

Gemma 4 E4B Hardware Requirements

SpecMinimumRecommended
RAM8GB16GB
VRAM4-5GB (Q4)6-7GB (Q8)
Power DrawLow (60-80W)
Speed on RTX 306020-30 tok/s

E4B runs comfortably on any M2/M3 MacBook Air (16GB unified memory) and RTX 3060 cards. Ideal for users wanting a balance of quality and low power consumption.

Gemma 4 26B MoE Hardware Requirements

SpecValue
Total Parameters26B
Active Parameters~4B (per inference)
Q4 VRAM Required16-18GB
Speed on RTX 408030-45 tok/s
vs 31B Dense~3x faster, ~50% less VRAM

The MoE architecture means 26B delivers near-31B Dense quality at a fraction of the VRAM cost. The RTX 4080 (16GB) is the perfect match for this model.

Ollama Quick-Start Commands

bash
# Run Gemma 4 variants with Ollama
ollama run gemma4:e2b        # Minimum requirements
ollama run gemma4:e4b        # Light, efficient
ollama run gemma4:26b        # 26B MoE balanced
ollama run gemma4:31b        # 31B Dense max quality
ollama run gemma4:31b-q4_km  # 31B Q4_K_M for RTX 4090

# Pull specific quantization
ollama pull gemma4:31b-q4_km
ollama pull gemma4:26b-q4_km

Context Length and Additional VRAM Requirements (31B Q4)

Context LengthAdditional VRAMTotal VRAM Estimate
8KBaseline~20GB
32K+1.5GB~22GB
128K+4GB~24GB
256K+8GB~28GB

For 256K context on 31B Dense Q4, you need approximately 28GB VRAM, which exceeds the RTX 4090's 24GB. Consider a 32GB+ GPU or use 128K context or less on consumer hardware.

Multi-GPU Configurations

ConfigurationTotal VRAMSupported ModelsNotes
2x RTX 3090 (NVLink)48GB31B FP16NVLink required
2x RTX 4090 (PCIe)48GB31B FP16Tensor parallel
2x A100 40GB80GB31B FP16 fastData center

Both llama.cpp and vLLM support tensor parallelism. PCIe multi-GPU configurations are 15-30% slower than NVLink due to bandwidth constraints.

Power Requirements & Cost Estimates

GPUTDPInference DrawMonthly Cost (24/7)
RTX 3060170W~120W~$15
RTX 4090450W~300W~$38
A100 40GB400W~350W~$44
H100 80GB700W~600W~$76

Estimated at $0.18/kWh. 24/7 inference workload assumed.

Inference Framework Requirements Comparison

FrameworkVRAM EfficiencyQuantizationSetup DifficultyBest For
OllamaExcellent (auto)Q4-Q8EasyPersonal / Dev
llama.cppExcellent (GGUF)Q2-Q8MediumCustom builds
vLLMGoodBF16/FP16/AWQComplexProduction API
TGI (HuggingFace)GoodBF16/GPTQComplexEnterprise

Ollama is the easiest entry point, automatically selecting the right quantization for your VRAM. For production, vLLM with its OpenAI-compatible API server is the standard choice.

Budget-Based Hardware Recommendations

Loading diagram...

Troubleshooting — OOM Errors, Slow Inference & Quantization Selection

IssueCauseFix
OOM (Out of Memory) errorInsufficient VRAMDrop one quantization level (Q5 → Q4 → Q3)
Slow inferenceCPU offloading occurringIncrease VRAM or use a smaller model
Slow model loadingHDD storageSwitch to NVMe SSD
Poor output qualityOver-quantized (Q2/Q3)Use Q4_K_M or higher

Quantization selection guide: If VRAM allows, use Q5_K_M or Q6_K. If tight, use Q4_K_M. As a last resort, Q3_K_M. Avoid Q2_K due to significant quality degradation.

FAQ — Direct Answers to Common Requirements Questions

Q1. What are the minimum requirements to run Gemma 4? A. 5GB RAM for E2B Q4 CPU-only inference. This is the absolute minimum. Q2. Can I run Gemma 4 31B on an RTX 4090? A. Yes. Q4_K_M quantization uses 20-24GB VRAM, which fits within the RTX 4090's 24GB. Keep context under 128K for best results. Q3. What models work on an RTX 3060 (12GB)? A. E2B and E4B work excellently. 26B MoE is possible at Q4. 31B Dense is not supported on 12GB. Q4. Can I run Gemma 4 with only 8GB RAM? A. Only E2B Q4 is recommended. E4B may technically load but will be very slow on CPU-only. Q5. Does Gemma 4 run on a MacBook? A. Yes. M1/M2/M3 Pro (16GB) handles up to 26B MoE Q4. M2/M3 Max (32GB+) runs 31B Dense Q4 comfortably. Q6. Can Gemma 4 run without a GPU (CPU only)? A. Yes, but 5-10x slower than GPU. E2B Q4 or E4B Q4 are the only practical CPU-only options. Q7. Which quantization level gives the best quality-to-VRAM ratio? A. Q4_K_M. It reduces VRAM by 68% vs FP16 with only a 7% quality drop. Q8. Does Gemma 4 support 1M context? A. No. Maximum supported context is 256K tokens. Using 256K requires approximately 8GB additional VRAM on top of the base model requirement.

Need Help Deploying Gemma 4? Oflight Can Help

From hardware selection and environment setup to API integration and cost optimization, Oflight's engineers provide end-to-end support for on-premise Gemma 4 deployments. We also help teams choose between local GPU, cloud GPU, and hybrid setups based on your workload and budget. Learn more at AI Consulting Services.

Feel free to contact us

Contact Us