株式会社オブライト
AI2026-03-04

Qwen3.5-9B Local Setup Guide: Step-by-Step Installation on Mac, Windows & Linux

A complete step-by-step guide to installing Qwen3.5-9B locally on Mac, Windows, and Linux. Covers setup via Ollama, llama.cpp, and vLLM, along with quantization options (GGUF Q4/Q5/Q8), GPU acceleration (CUDA/Metal), Docker deployment, API server configuration, performance tuning, and troubleshooting. Practical instructions for developers and IT administrators looking to run SLMs on-premises.


Why Run Qwen3.5-9B Locally?

Running AI models locally offers three significant advantages: cost reduction, data privacy, and response speed. Cloud API usage incurs per-token charges that scale with volume, leading to unpredictable monthly expenses. By running Qwen3.5-9B locally, the only cost is the initial hardware investment, with zero ongoing expenses. For businesses in Shinagawa and Minato Ward, keeping customer data off external servers provides peace of mind for compliance with Japan's personal information protection regulations and ISMS certification. Network latency is eliminated entirely, meaning inference within your local network is substantially faster than cloud API calls. This guide provides detailed, step-by-step instructions for setting up Qwen3.5-9B on Mac, Windows, and Linux, designed to be accessible even for those new to local AI deployment.

Recommended System Requirements: RAM, GPU, and Storage

Let's review the recommended specifications for running Qwen3.5-9B comfortably. The absolute minimum RAM is 8GB, but 16GB or more is recommended. Q4 quantized models consume approximately 5GB of RAM, while Q8 quantization uses about 9GB. Running at full precision (FP16) requires 18GB or more of RAM or VRAM. A GPU is optional but accelerates inference by 2-5x when available. NVIDIA RTX 3060 or higher (8GB+ VRAM) or Apple Silicon M1 or later Macs are recommended for GPU acceleration. Storage requirements include at least 10GB for model files, with approximately 20GB of free space recommended when accounting for tools and caches. For SMBs in Ota and Setagaya Ward, most business PCs purchased within the last 2-3 years should meet these requirements without modification. Check your system specifications beforehand and consider RAM upgrades if needed.

Fastest Setup with Ollama (Mac/Windows/Linux)

Ollama is the easiest tool for deploying Qwen3.5-9B locally. Download the installer for your operating system from ollama.com and follow the on-screen instructions. On Mac, you can also install via Homebrew with "brew install ollama." After installation, open a terminal (or command prompt) and type "ollama run qwen3.5:9b" to automatically download the model and start an interactive session. The initial download requires approximately 5GB of data transfer for the Q4 quantized version, depending on your connection speed. Ollama also operates as a background server, providing an OpenAI-compatible API endpoint at "http://localhost:11434," enabling straightforward integration with existing applications. For startups in Shinagawa and Shibuya looking to rapidly prototype AI features, Ollama offers the fastest path from zero to a working local LLM setup.

Detailed Mac mini M4 Setup

The Apple Silicon Mac mini M4 is one of the ideal platforms for running Qwen3.5-9B locally. The M4 chip's Unified Memory architecture allows CPU and GPU to share the same memory pool, enabling GPU-accelerated inference via the Metal API without a dedicated graphics card. The 16GB RAM model runs Q4/Q5 quantized versions comfortably, while the 24GB model supports Q8 and even FP16 high-quality inference. Setup begins with installing Xcode Command Line Tools (xcode-select --install), followed by Ollama or llama.cpp installation. For llama.cpp, build with Metal enabled using "cmake -B build -DLLAMA_METAL=ON && cmake --build build." Expected inference speed on Mac mini M4 (16GB) with Q4 quantization is approximately 40-60 tokens per second, providing ample speed for real-time conversations. This configuration is particularly easy to adopt in design studios in Minato and Meguro Ward where Macs serve as primary workstations.

Windows Setup: WSL2 and Native Options

Windows users have two primary paths: WSL2 (Windows Subsystem for Linux 2) or native Windows builds. For WSL2, open PowerShell as administrator and run "wsl --install" to set up an Ubuntu distribution. Within WSL2, follow the same Linux procedures for installing Ollama or llama.cpp. If you have an NVIDIA GPU, install CUDA-compatible drivers on the Windows side, and WSL2 will automatically detect the GPU. For native Windows, Ollama provides a Windows installer with a GUI-based setup experience. llama.cpp can also be built natively using CMake and Visual Studio Build Tools. For Windows-centric office environments in Shinagawa and Ota Ward, IT administrators can typically complete the entire setup in approximately 30 minutes. Pay particular attention to CUDA driver version compatibility, as mismatches are the most common source of GPU-related issues—always verify compatible versions in the official documentation before proceeding.

Quantization Options: Choosing Between GGUF Q4, Q5, and Q8

Quantization converts model parameters to lower-precision numerical formats, reducing memory usage and improving inference speed. For Qwen3.5-9B in GGUF format, three quantization levels are widely used: Q4_K_M, Q5_K_M, and Q8_0. Q4_K_M (4-bit quantization) produces a model file of approximately 5GB, optimal for memory-constrained environments. Quality degradation is minimal, and accuracy remains sufficient for general business applications. Q5_K_M (5-bit) at approximately 6.5GB offers slightly higher quality while maintaining good memory efficiency—a balanced middle ground. Q8_0 (8-bit) at approximately 9GB preserves near-FP16 quality but requires more memory. The choice depends on your available RAM and desired output quality. Development teams in Shibuya and Minato Ward often find it effective to use Q4 for development and testing, then deploy Q5 or Q8 for production workloads where quality matters most.

Advanced Setup with llama.cpp and vLLM

When finer parameter control is needed beyond what Ollama provides, llama.cpp and vLLM are excellent alternatives. llama.cpp is a lightweight C/C++ inference engine optimized for CPU inference, with support for CUDA, Metal, and OpenCL GPU backends. Clone the GitHub repository and build with "cmake -B build -DLLAMA_CUDA=ON" (NVIDIA GPU) or "-DLLAMA_METAL=ON" (Apple Silicon). After building, launch the API server with "./build/bin/llama-server -m qwen3.5-9b-q4_k_m.gguf -c 8192 --port 8080." vLLM is a Python-based high-throughput inference engine featuring PagedAttention and Continuous Batching for superior concurrent request handling. Install with "pip install vllm" and serve with "vllm serve Qwen/Qwen3.5-9B --max-model-len 8192." For manufacturing companies in Ota Ward where multiple users access the model simultaneously, vLLM's batch processing capabilities are particularly valuable.

Docker Container Deployment

For reproducibility and portability, Docker container deployment is the recommended approach. Ollama provides an official Docker image, launchable with "docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama." For NVIDIA GPU support, add the "--gpus all" flag, which requires the NVIDIA Container Toolkit to be pre-installed. Docker images for llama.cpp with various GPU backend support are also available. Docker Compose enables unified management of model servers, web frontends, and vector databases, streamlining production environment setup. For systems integration companies in Shinagawa delivering AI solutions to clients, the ability to reproduce environments simply by sharing a Docker Compose file is a significant advantage. In remote work environments across Setagaya and Meguro Ward, Docker ensures consistent development environments across distributed teams.

Performance Tuning and Benchmarking

Here are key tuning parameters to maximize Qwen3.5-9B inference performance. First, set the context length (-c parameter) based on your actual use case. Allocating the full 262K context at all times increases memory consumption unnecessarily, so use 4096-8192 for general conversation, 32768 for document summarization, and adjust as needed. Batch size (-b parameter) can be increased when GPU memory allows, improving throughput for batch processing scenarios. Thread count (-t parameter) should match your CPU's physical core count; exceeding this often degrades performance rather than improving it. For benchmarking, the "llama-bench" command provides quantitative measurements of tokens per second (TPS) and First Token Latency. Tech companies in Shibuya increasingly use these benchmark results to inform hardware investment decisions and capacity planning for their AI infrastructure.

Common Issues and Troubleshooting

Here are the most frequently encountered issues when deploying Qwen3.5-9B locally and their solutions. The "Out of Memory (OOM)" error is the most common, resolved by switching to a lower quantization level (Q4), reducing context length, or closing other applications. "CUDA out of memory" indicates insufficient GPU VRAM—reduce the GPU layer count (-ngl parameter) to enable hybrid CPU/GPU inference. Metal-related errors on Mac are typically resolved by updating macOS and reinstalling Xcode Command Line Tools. If Ollama cannot find a model, verify installed models with "ollama list" and re-download with "ollama pull qwen3.5:9b." GPU detection failures in WSL2 are most often resolved by updating NVIDIA drivers on the Windows host to the latest version. For companies in Shinagawa and Minato Ward without dedicated IT staff, consulting a specialist can save significant time and frustration.

Let Oflight Inc. Handle Your Local AI Environment Setup

Struggling with configuration complexity, GPU optimization, or rolling out local AI across multiple workstations in your office? Oflight Inc., headquartered in Shinagawa Ward, provides comprehensive local AI environment design, deployment, and operations support for businesses across Minato, Shibuya, Setagaya, Meguro, and Ota Ward throughout the Tokyo metropolitan area. From selecting the optimal quantization level for your hardware configuration to configuring GPU acceleration and automating deployment with Docker, our expert team handles every technical detail with precision and care. Contact us for a free consultation to get started. Our team is ready to help you build the ideal local AI infrastructure tailored to your business requirements and technical environment.

Feel free to contact us

Contact Us