株式会社オブライト
AI2026-04-10

Qwen 3.5 27B Dense & 35B-A3B MoE Complete Guide — DFlash Acceleration Breaks 24GB GPU Limits [2026]

Compare Qwen 3.5 27B Dense vs 35B-A3B MoE, check 24GB GPU requirements, learn DFlash 2–3x acceleration, and follow step-by-step Ollama setup instructions.


What Are Qwen 3.5 27B / 35B-A3B? The Largest Open LLMs Running on a Single 24GB GPU

Qwen 3.5 27B Dense and Qwen 3.5-35B-A3B MoE are the flagship models of the Qwen 3.5 series, released by Alibaba's Qwen team in late 2025. Both are licensed under Apache 2.0, allowing unrestricted commercial use. The 27B Dense model activates all 27 billion parameters during inference, delivering consistently high-quality output for translation, storytelling, and complex reasoning. The 35B-A3B MoE activates only 3B parameters out of 35B total, achieving over 5x faster inference with a sparse architecture. Both models run comfortably on RTX 3090 or RTX 4090 GPUs with 24GB VRAM, making them the largest open-weight LLMs practically deployable on consumer hardware today.

27B Dense vs 35B-A3B MoE — Full Specification Comparison

AttributeQwen 3.5-27B DenseQwen 3.5-35B-A3B MoE
Total Parameters27B35B
Active Parameters27B (all)3B (sparse)
VRAM (Q4_K_M)~16.1 GB~19.6 GB
Inference Speed30–50 tok/s150+ tok/s (~5x faster)
Inference StabilityExtremely highSlightly lower
Translation QualityBest-in-classGood
StorytellingBest-in-classGood
Batch ProcessingAverageExcellent
Recommended GPURTX 3090/4090 (24GB)RTX 3090/4090 (24GB)

VRAM-Based Model Selection Flowchart

Loading diagram...

Hardware Requirements — Detailed Breakdown

Hardware27B Q427B FP1635B-A3B Q435B-A3B FP16
RTX 3090 (24GB)OK (30–50 tok/s)Not possibleOK (fast)Not possible
RTX 4090 (24GB)Comfortable (40–60 tok/s)Not possibleComfortableNot possible
M4 Pro (24GB)OK (35–45 tok/s)Not possibleOKNot possible
M4 Max (48GB)ComfortableOKComfortableOK
A100 (80GB)OptimalComfortableOptimalComfortable

Ollama Setup Instructions

With Ollama, you can run Qwen 3.5 27B or 35B-A3B in just a few commands. Download and install Ollama from ollama.com, then run the following:

bash
# 27B Dense
ollama run qwen3.5:27b

# 35B-A3B MoE
ollama run qwen3.5:35b-a3b

# Specify quantization level
ollama run qwen3.5:27b-q4_K_M

Expect around 16GB of disk space for the 27B model and ~20GB for 35B-A3B. First-time download takes 10–30 minutes depending on your internet connection.

Quantization Levels — Comparing VRAM, Quality, and Speed

Quantization27B VRAM35B-A3B VRAMQuality (27B baseline)Speed (27B baseline)
FP1654 GB70 GB100%1x
Q8_028.6 GB37 GB99%1.2x
Q6_K22.1 GB28.6 GB98%1.4x
Q5_K_M18.9 GB24.5 GB96%1.5x
Q4_K_M16.1 GB19.6 GB93%1.8x

For 24GB GPU users, Q4_K_M is the standard choice for both models — it offers the best balance of quality and speed.

What Is DFlash? Block-Diffusion Speculative Decoding Explained

DFlash is a block-diffusion-based speculative decoding technique. Unlike traditional autoregressive generation (one token at a time), DFlash uses a lightweight diffusion model to generate multiple tokens in parallel, which are then verified and adopted by the main LLM. This approach outperforms EAGLE-3 — previously the fastest speculative decoding method — by 2.5x. The gains are especially significant with MoE models: applying DFlash to Qwen 3.5-35B-A3B can push throughput from 150 tok/s to 300–420 tok/s in some reported benchmarks.

Traditional Autoregressive vs DFlash — Architecture Comparison

Loading diagram...

DFlash Benchmark Speed Comparison

ModelBaseline SpeedWith DFlashSpeedup
Qwen 3.5-35B-A3B150 tok/s300–420 tok/s2–2.8x
Qwen 3.5-9B80 tok/s280 tok/s3.5x
Qwen 3.5-27B40 tok/s80–100 tok/s2–2.5x

DFlash gains are larger for smaller active-parameter models. The 9B achieves an impressive 3.5x speedup, while the 35B-A3B MoE (3B active) also sees 2–2.8x gains.

How to Enable DFlash and Current Support Status

DFlash is currently available through vLLM and SGLang. For vLLM, simply specify the DFlash diffusion model via the `--speculative-model` flag at startup. Integration into llama.cpp is under active discussion (GitHub Issue #21569) but has not yet been merged. Since Ollama runs on llama.cpp, DFlash support for Ollama users will require waiting for the llama.cpp implementation to land. For production server workloads and batch inference, you can benefit from DFlash right now via vLLM.

Why Choose 27B Dense Over MoE

Dense architecture, where all parameters are active at every inference step, provides unmatched stability and consistency in outputs. For tasks like translation, storytelling, and multi-step reasoning, the 27B Dense model produces less variance compared to MoE. This makes it the preferred choice for production applications where output quality consistency matters more than raw speed — such as business document generation, technical writing, and professional translation workflows.

Why Choose 35B-A3B MoE Over Dense

With only 3B active parameters, the 35B-A3B MoE delivers roughly 5x the inference speed of the 27B Dense on the same hardware. On a single RTX 4090, it achieves 150+ tok/s compared to 30–50 tok/s for the Dense model. Combined with DFlash, throughput can reach 420 tok/s — approaching or exceeding cloud API performance locally. For chatbots, real-time response systems, and multi-user concurrent workloads, the 35B-A3B MoE is the superior choice.

Japanese Language Performance — 201 Languages, Business-Ready Output

Qwen 3.5 supports 201 languages with particularly strong Japanese capabilities. The 27B Dense model significantly outperforms the 9B in Japanese translation, business writing, and technical documentation. For enterprises handling sensitive data that cannot be sent to cloud APIs, running the 27B locally delivers production-quality Japanese output while maintaining full data sovereignty.

9B → 27B/35B Upgrade Decision Checklist

CriteriaStay with 9BUpgrade to 27B/35B
VRAM8GB or less24GB or more
Japanese qualityBasic chatBusiness docs, translation
Reasoning complexitySimple Q&AMulti-step analysis
Speed requirementFlexibleNeed 30–50 tok/s at Q4
BudgetNo GPU upgradeRTX 4090 equivalent needed

If any right-column condition applies to your use case, upgrading is likely worth the investment.

Cost Comparison — Local 27B vs Cloud API

ItemLocal 27B (RTX 4090)Claude Sonnet API
Upfront cost~$1,650 (GPU)$0
Monthly cost~$10 (electricity)$200–$700/month
6-month total~$1,720~$1,200–$4,200
Break-even~3 months
Data securityFully on-premisesSent to cloud

If you're spending more than $200/month on cloud API calls, a single RTX 4090 investment pays for itself in roughly 3 months — with the added benefit of complete data privacy.

What's Next — DFlash + llama.cpp and Qwen 3.6

Once DFlash lands in llama.cpp, Ollama users will be able to enjoy 2–3x speed gains without any additional configuration. The discussion on GitHub is active and an implementation in 2026 seems plausible. On the model side, Alibaba has indicated that an open-weight version of Qwen 3.6 is in planning. If released, it would bring another generation of 27B–35B-class models to the open-source community, further raising the quality ceiling for local LLM deployments.

FAQ — 7 Common Questions

Q1: Which should I pick — 27B Dense or 35B-A3B MoE? Choose 27B Dense for quality and stability (translation, business docs, storytelling). Choose 35B-A3B MoE for speed and throughput (chatbots, batch processing, real-time responses). Q2: Can I run the 27B on an RTX 3060 (12GB)? No. The 27B Q4_K_M requires ~16.1GB VRAM, which exceeds the RTX 3060's 12GB. Use the 9B model (~5.1GB Q4) instead. Q3: Will the 27B run on a Mac mini M4 with 16GB? Technically possible but extremely slow — 16GB is right at the memory limit. The 9B runs comfortably. For the 27B, you need M4 Pro (24GB) or higher. Q4: Can I use DFlash right now? Yes, if you use vLLM or SGLang. Ollama and standard llama.cpp do not yet support DFlash. Q5: How much better is 27B compared to 9B in practice? Expect 20–30% improvement in translation accuracy, complex reasoning, and long-form Japanese output quality. The gap is especially noticeable in professional document tasks. Q6: Is commercial use allowed? Yes. Both models are released under Apache 2.0, which permits unrestricted commercial use, product integration, and redistribution. Q7: When will Qwen 3.6 open-weight be available? As of April 2026, Alibaba has mentioned open-weight plans for Qwen 3.6, but no specific release date has been announced.

Professional Support for Local LLM Deployment

Whether you need help setting up a local RAG pipeline with Qwen 3.5 27B, designing an on-premises AI infrastructure, or configuring high-speed inference with DFlash via vLLM, Oflight's AI consulting team is here to help. We support businesses across Tokyo and nationwide with practical, secure AI implementation. → Explore AI Consulting Services

Feel free to contact us

Contact Us