Gemma 4 Complete Requirements Reference — VRAM, RAM & GPU Quick-Lookup Tables [E2B/E4B/26B/31B All Variants]
Gemma 4 minimum: 5GB RAM (E2B Q4), recommended: 24GB VRAM (31B Dense Q4). Quick-lookup tables covering VRAM, RAM, and GPU requirements for all variants: E2B, E4B, 26B MoE, and 31B Dense.
Gemma 4 Minimum & Recommended System Requirements — Quick Answer
Gemma 4 minimum: 5GB RAM (E2B Q4). Recommended: 24GB VRAM (31B Dense Q4). All variants at a glance:
| Level | Model | Requirement |
|---|---|---|
| Minimum | E2B Q4 | 5GB RAM (CPU-only) |
| Entry Recommended | E4B Q4 | 8GB RAM / 4-5GB VRAM |
| Standard | 26B MoE Q4 | 16GB RAM / 16-18GB VRAM |
| Comfortable | 31B Dense Q4 | 32GB RAM / 20-24GB VRAM |
| Maximum Quality | 31B Dense FP16 | 64GB RAM / 48GB+ VRAM |
All Variants: VRAM, RAM & Hardware Requirements Table
| Variant | Params | Q4 VRAM | Q8 VRAM | FP16 VRAM | Min RAM | Recommended GPU |
|---|---|---|---|---|---|---|
| E2B | 2.3B | 2-3GB | 3-4GB | 5GB | 8GB | GTX 1660+ / M1 |
| E4B | 4.5B | 4-5GB | 6-7GB | 9GB | 8GB | RTX 3060 / M1 Pro |
| 26B MoE | 26B (4B active) | 16-18GB | 28GB | 54GB | 16GB | RTX 4080 / M3 Max |
| 31B Dense | 31B | 20-24GB | 34GB | 62GB | 32GB | RTX 4090 / A100 |
Note: The 26B MoE activates only ~4B parameters per token, making its VRAM requirement far lower than its total parameter count suggests.
GPU Compatibility & VRAM Requirements Quick-Lookup
| GPU | VRAM | E2B | E4B | 26B MoE | 31B Dense |
|---|---|---|---|---|---|
| GTX 1660 Super | 6GB | ◎ | ○ | ✗ | ✗ |
| RTX 3060 | 12GB | ◎ | ◎ | △ (Q4 only) | ✗ |
| RTX 3090 | 24GB | ◎ | ◎ | ◎ | ○ (Q4) |
| RTX 4070 | 12GB | ◎ | ◎ | △ (Q4 only) | ✗ |
| RTX 4080 | 16GB | ◎ | ◎ | ◎ (Q4) | △ (Q3) |
| RTX 4090 | 24GB | ◎ | ◎ | ◎ | ◎ (Q4) |
| A100 40GB | 40GB | ◎ | ◎ | ◎ | ◎ |
| H100 80GB | 80GB | ◎ | ◎ | ◎ | ◎ (FP16) |
◎=Excellent ○=Works △=Limited ✗=Not supported
Apple Silicon System Requirements
| Chip | Unified Memory | E2B | E4B | 26B MoE | 31B Dense | 31B Q4 Speed |
|---|---|---|---|---|---|---|
| M1 (8GB) | 8GB | ◎ | △ | ✗ | ✗ | — |
| M1/M2 Pro (16GB) | 16GB | ◎ | ◎ | ◎ (Q4) | ✗ | — |
| M2/M3 Max (32GB) | 32GB | ◎ | ◎ | ◎ | ◎ (Q4) | 10-15 tok/s |
| M3 Ultra (64GB) | 64GB | ◎ | ◎ | ◎ | ◎ | 25-35 tok/s |
| M4 Max (48GB) | 48GB | ◎ | ◎ | ◎ | ◎ | 30-40 tok/s |
| M4 Ultra (192GB) | 192GB | ◎ | ◎ | ◎ | ◎ (FP16) | 50+ tok/s |
Apple Silicon unified memory acts as both system RAM and VRAM, making M2/M3 Max (32GB) or higher ideal for running 31B Dense Q4.
Quantization Level VRAM Requirements (31B Dense)
| Quantization | VRAM Required | Quality (vs FP16) | Speed | Best For |
|---|---|---|---|---|
| FP16 | 62GB | 100% | 1.0x | Research / Max quality |
| Q8_0 | 34GB | 99% | 1.2x | A100 / H100 |
| Q6_K | 26GB | 98% | 1.4x | Dual RTX 3090 |
| Q5_K_M | 22GB | 96% | 1.5x | RTX 4090 (comfortable) |
| Q4_K_M | 20GB | 93% | 1.8x | RTX 4090 (recommended) |
| Q3_K_M | 16GB | 85% | 2.1x | RTX 4080 (compromise) |
| Q2_K | 13GB | 72% | 2.5x | Not recommended |
Q4_K_M is the best balance: only 7% quality loss vs FP16, with 68% VRAM reduction.
RAM Requirements — CPU-Only Inference
| Variant | Min RAM | Recommended RAM | CPU Speed |
|---|---|---|---|
| E2B Q4 | 4GB | 8GB | 15-25 tok/s |
| E4B Q4 | 8GB | 16GB | 8-15 tok/s |
| 26B MoE Q4 | 20GB | 32GB | 3-6 tok/s |
| 31B Dense Q4 | 24GB | 48GB | 2-4 tok/s |
CPU-only inference is 5-10x slower than GPU. E2B or E4B Q4 are the only practical CPU-only options.
VRAM-Based Model Selection Flowchart
Gemma 4 31B VRAM Requirements (Dedicated Section)
| Spec | Minimum | Recommended | Ideal |
|---|---|---|---|
| VRAM | 16GB (Q3_K_M) | 20-24GB (Q4_K_M) | 48GB+ (FP16) |
| RAM | 32GB | 48GB | 64GB |
| GPU | RTX 4080 | RTX 4090 | A100 / H100 |
| Inference Speed (Q4) | 4-6 tok/s | 10-20 tok/s | 50+ tok/s |
The RTX 4090 (24GB) is the minimum consumer GPU that comfortably runs 31B Dense at Q4_K_M. Extending context beyond 128K adds ~4GB VRAM overhead.
Gemma 4 E2B System Requirements (Dedicated Section)
| Spec | Minimum | Recommended |
|---|---|---|
| RAM (CPU-only) | 4GB | 8GB |
| VRAM (GPU) | 2-3GB | 4GB+ |
| Compatible Devices | Raspberry Pi 5 (8GB), older laptops | MacBook Air M1, GTX 1060+ |
Gemma 4 E2B is designed for edge and embedded use cases. It runs on a Raspberry Pi 5 (8GB model) via llama.cpp, making it the most accessible model in the Gemma 4 family.
Gemma 4 E4B Hardware Requirements
| Spec | Minimum | Recommended |
|---|---|---|
| RAM | 8GB | 16GB |
| VRAM | 4-5GB (Q4) | 6-7GB (Q8) |
| Power Draw | Low (60-80W) | — |
| Speed on RTX 3060 | 20-30 tok/s | — |
E4B runs comfortably on any M2/M3 MacBook Air (16GB unified memory) and RTX 3060 cards. Ideal for users wanting a balance of quality and low power consumption.
Gemma 4 26B MoE Hardware Requirements
| Spec | Value |
|---|---|
| Total Parameters | 26B |
| Active Parameters | ~4B (per inference) |
| Q4 VRAM Required | 16-18GB |
| Speed on RTX 4080 | 30-45 tok/s |
| vs 31B Dense | ~3x faster, ~50% less VRAM |
The MoE architecture means 26B delivers near-31B Dense quality at a fraction of the VRAM cost. The RTX 4080 (16GB) is the perfect match for this model.
Ollama Quick-Start Commands
# Run Gemma 4 variants with Ollama
ollama run gemma4:e2b # Minimum requirements
ollama run gemma4:e4b # Light, efficient
ollama run gemma4:26b # 26B MoE balanced
ollama run gemma4:31b # 31B Dense max quality
ollama run gemma4:31b-q4_km # 31B Q4_K_M for RTX 4090
# Pull specific quantization
ollama pull gemma4:31b-q4_km
ollama pull gemma4:26b-q4_kmContext Length and Additional VRAM Requirements (31B Q4)
| Context Length | Additional VRAM | Total VRAM Estimate |
|---|---|---|
| 8K | Baseline | ~20GB |
| 32K | +1.5GB | ~22GB |
| 128K | +4GB | ~24GB |
| 256K | +8GB | ~28GB |
For 256K context on 31B Dense Q4, you need approximately 28GB VRAM, which exceeds the RTX 4090's 24GB. Consider a 32GB+ GPU or use 128K context or less on consumer hardware.
Multi-GPU Configurations
| Configuration | Total VRAM | Supported Models | Notes |
|---|---|---|---|
| 2x RTX 3090 (NVLink) | 48GB | 31B FP16 | NVLink required |
| 2x RTX 4090 (PCIe) | 48GB | 31B FP16 | Tensor parallel |
| 2x A100 40GB | 80GB | 31B FP16 fast | Data center |
Both llama.cpp and vLLM support tensor parallelism. PCIe multi-GPU configurations are 15-30% slower than NVLink due to bandwidth constraints.
Power Requirements & Cost Estimates
| GPU | TDP | Inference Draw | Monthly Cost (24/7) |
|---|---|---|---|
| RTX 3060 | 170W | ~120W | ~$15 |
| RTX 4090 | 450W | ~300W | ~$38 |
| A100 40GB | 400W | ~350W | ~$44 |
| H100 80GB | 700W | ~600W | ~$76 |
Estimated at $0.18/kWh. 24/7 inference workload assumed.
Inference Framework Requirements Comparison
| Framework | VRAM Efficiency | Quantization | Setup Difficulty | Best For |
|---|---|---|---|---|
| Ollama | Excellent (auto) | Q4-Q8 | Easy | Personal / Dev |
| llama.cpp | Excellent (GGUF) | Q2-Q8 | Medium | Custom builds |
| vLLM | Good | BF16/FP16/AWQ | Complex | Production API |
| TGI (HuggingFace) | Good | BF16/GPTQ | Complex | Enterprise |
Ollama is the easiest entry point, automatically selecting the right quantization for your VRAM. For production, vLLM with its OpenAI-compatible API server is the standard choice.
Budget-Based Hardware Recommendations
Troubleshooting — OOM Errors, Slow Inference & Quantization Selection
| Issue | Cause | Fix |
|---|---|---|
| OOM (Out of Memory) error | Insufficient VRAM | Drop one quantization level (Q5 → Q4 → Q3) |
| Slow inference | CPU offloading occurring | Increase VRAM or use a smaller model |
| Slow model loading | HDD storage | Switch to NVMe SSD |
| Poor output quality | Over-quantized (Q2/Q3) | Use Q4_K_M or higher |
Quantization selection guide: If VRAM allows, use Q5_K_M or Q6_K. If tight, use Q4_K_M. As a last resort, Q3_K_M. Avoid Q2_K due to significant quality degradation.
FAQ — Direct Answers to Common Requirements Questions
Q1. What are the minimum requirements to run Gemma 4? A. 5GB RAM for E2B Q4 CPU-only inference. This is the absolute minimum. Q2. Can I run Gemma 4 31B on an RTX 4090? A. Yes. Q4_K_M quantization uses 20-24GB VRAM, which fits within the RTX 4090's 24GB. Keep context under 128K for best results. Q3. What models work on an RTX 3060 (12GB)? A. E2B and E4B work excellently. 26B MoE is possible at Q4. 31B Dense is not supported on 12GB. Q4. Can I run Gemma 4 with only 8GB RAM? A. Only E2B Q4 is recommended. E4B may technically load but will be very slow on CPU-only. Q5. Does Gemma 4 run on a MacBook? A. Yes. M1/M2/M3 Pro (16GB) handles up to 26B MoE Q4. M2/M3 Max (32GB+) runs 31B Dense Q4 comfortably. Q6. Can Gemma 4 run without a GPU (CPU only)? A. Yes, but 5-10x slower than GPU. E2B Q4 or E4B Q4 are the only practical CPU-only options. Q7. Which quantization level gives the best quality-to-VRAM ratio? A. Q4_K_M. It reduces VRAM by 68% vs FP16 with only a 7% quality drop. Q8. Does Gemma 4 support 1M context? A. No. Maximum supported context is 256K tokens. Using 256K requires approximately 8GB additional VRAM on top of the base model requirement.
Need Help Deploying Gemma 4? Oflight Can Help
From hardware selection and environment setup to API integration and cost optimization, Oflight's engineers provide end-to-end support for on-premise Gemma 4 deployments. We also help teams choose between local GPU, cloud GPU, and hybrid setups based on your workload and budget. Learn more at AI Consulting Services.
Feel free to contact us
Contact Us