Qwen 3.5 Small Models: Run Frontier AI on Your Phone, Laptop, or Edge Device
Everything you need to know about the Qwen 3.5 Small series (0.8B, 2B, 4B, 9B) - benchmarks that beat models 13x their size, mobile deployment, and step-by-step local setup.
A 9-billion-parameter model that beats a 120-billion-parameter model on graduate-level reasoning. A 0.8-billion-parameter model that understands video — running on a phone, with no internet connection. These aren't hypotheticals. They're the Qwen 3.5 Small models, released by Alibaba on March 2, 2026.
The series has four sizes: 0.8B, 2B, 4B, and 9B. All of them share the same architecture as the massive 397B flagship, all of them natively understand text, images, and video, and all of them are Apache 2.0 licensed — meaning you can use them for anything, commercially included, for free.
This guide explains what they are, why the benchmarks are turning heads, and exactly how to get them running on your own hardware.
The Lineup
Here are the four models with what you actually need to know to pick one:
| Model | Parameters | VRAM (Full Precision) | VRAM (4-bit Quantized) | Context Window | Runs On |
|---|---|---|---|---|---|
| Qwen3.5-0.8B | 0.8B | ~1.6 GB | ~0.5 GB | 262K | Phones, Raspberry Pi, wearables |
| Qwen3.5-2B | 2B | ~4 GB | ~1.5 GB | 262K | Smartphones, tablets, old laptops |
| Qwen3.5-4B | 4B | ~8 GB | ~3 GB | 262K | Laptops, entry-level GPUs |
| Qwen3.5-9B | 9B | ~18 GB | ~6 GB | 262K (1M extended) | Consumer GPUs, Apple Silicon Macs |
All four are dense models — no Mixture-of-Experts routing, no sparse activation. Every parameter is active during inference, which keeps things simple. You don't need specialized serving infrastructure; just load the weights and go.
Why These Small Models Are Unusually Good
Plenty of companies release small language models. Most of them are clearly compromised — they drop reasoning ability, lose instruction-following reliability, and struggle with anything beyond basic text completion. The Qwen 3.5 Small series is different because Alibaba didn't simplify the architecture to shrink it. They kept the same innovations from the 397B flagship and scaled them down.
Same Hybrid Attention as the Flagship
Every small model uses the same 3:1 hybrid attention pattern as the full-size Qwen 3.5. For every four transformer blocks, three use Gated DeltaNet (a form of linear attention that scales cheaply with sequence length) and one uses traditional full attention (for capturing complex patterns).
To understand why this matters, consider the context window. Most sub-10B models from competing families top out at 8K–32K tokens. The Qwen 3.5 Small models? 262,144 tokens — native, not extended. The 9B model can go up to roughly 1 million tokens with RoPE scaling.
That means the 9B model can process an entire medium-sized codebase, a full novel, or hundreds of pages of legal documents in a single pass. That's extraordinary for a model you can run on a single GPU.
Genuinely Multimodal (Not an Afterthought)
Here's something that catches people off guard: the 0.8B model — the one that fits in half a gigabyte quantized — can understand video. Not in a toy "describe what you see" way, but meaningfully enough to score 63.8 on VideoMME.
That's because these models weren't trained on text first and then taught to see. Text, images, and video were all part of the training data from the start. A single set of weights handles all three modalities. You can hand the 2B model a screenshot and ask it a question, or feed the 4B model a sequence of video frames, and it works without any external vision model or adapter.
Multi-Token Prediction
All four models use a technique called multi-token prediction — instead of generating one token at a time, they predict several upcoming tokens simultaneously. The quality doesn't suffer, but inference gets noticeably faster. On edge devices where processing power is limited, this makes a real difference in responsiveness.
Expanded Vocabulary
The tokenizer uses 248,000 tokens (up from 152K in Qwen 3), covering 201 languages and dialects. The practical effect: non-English text gets encoded 10–60% more efficiently, which means fewer tokens to process, which means faster and cheaper inference. If you're building a multilingual app, the improved tokenizer is quietly one of the most impactful upgrades.
The Benchmarks
Let's look at numbers — because the numbers are what make this series remarkable.
The 9B Model: Beating Things It Shouldn't
The Qwen3.5-9B is the headliner. Its scores compete with models 3 to 13 times its size.
On reasoning and language understanding:
| Benchmark | Qwen3.5-9B | GPT-OSS-120B | GPT-5-Nano |
|---|---|---|---|
| GPQA Diamond | 81.7 | 71.5 | — |
| MMLU-Pro | 82.5 | — | — |
| IFEval | 91.5 | — | — |
| LongBench v2 | 55.2 | — | — |
Look at that GPQA Diamond row. That's a graduate-level STEM reasoning benchmark. The 9B model — which runs on a single consumer GPU — scores 81.7. GPT-OSS-120B, a model 13.5 times larger, manages 71.5. A ten-point gap in the wrong direction for the bigger model.
The 9B also outperforms the previous-generation Qwen3-30B on most language benchmarks. A model one-third the size, from one generation later, is just flatly better. That's the payoff from the architectural improvements.
On vision and multimodal tasks:
| Benchmark | Qwen3.5-9B | GPT-5-Nano |
|---|---|---|
| MMMU | 78.4 | 75.8 |
| MMMU-Pro | 70.1 | 57.2 |
| MathVista | 85.7 | — |
| MathVision | 78.9 | 62.2 |
| OCRBench | Exceeds larger models | — |
The MMMU-Pro gap (70.1 vs. 57.2) and MathVision gap (78.9 vs. 62.2) against GPT-5-Nano aren't marginal. These are double-digit differences on multimodal understanding benchmarks. An open-source 9B model you can download right now is significantly outperforming a proprietary model you can only access through an API.
The 0.8B Model: Surprisingly Capable for Its Size
The smallest model doesn't try to compete with the big players — it targets a completely different use case. But even here, the numbers are noteworthy:
| Benchmark | Qwen3.5-0.8B |
|---|---|
| MathVista | 62.2 |
| OCRBench | 74.5 |
| VideoMME | 63.8 |
A VideoMME score of 63.8 from a model that takes up half a gigabyte quantized is genuinely surprising. This means you can run video understanding on a phone, in airplane mode, with zero cloud dependency. OCRBench at 74.5 means it can reliably extract text from photos of documents — handy for scanning receipts, reading signs, or digitizing handwritten notes.
How Each Generation Compares
The improvements aren't just version-to-version hype. At every size class, Qwen 3.5 represents a clear step forward:
| Size Class | Qwen 2.5 | Qwen 3 | Qwen 3.5 | What Changed |
|---|---|---|---|---|
| ~1B | Qwen2.5-1.5B | Qwen3-1.7B | Qwen3.5-0.8B | Same capability at half the parameters |
| ~4B | Qwen2.5-3B | Qwen3-4B | Qwen3.5-4B | Big gains in reasoning and vision |
| ~9B | Qwen2.5-7B | Qwen3-8B | Qwen3.5-9B | Now beats previous-gen 30B models |
The pattern is clear: each generation either delivers more capability at the same size, or matches the old generation at a fraction of the parameters.
Running on Phones and Edge Devices
Here's where these models get exciting beyond benchmark numbers. They run completely offline — on phones, tablets, embedded devices, even a Raspberry Pi for the smallest variant. No internet, no API calls, no data leaving the device.
Why That Matters More Than You'd Think
Privacy becomes absolute. When the model runs on-device, conversations never leave the phone. There's no server logging your prompts, no third-party processing pipeline. For applications handling medical info, financial data, or just personal conversations, this isn't a nice-to-have — it's a requirement.
Costs drop to zero. After downloading the weights once, every inference is free. No per-token fees, no monthly API budget. For a startup processing millions of requests, the savings are enormous.
Latency disappears. No network round-trip means responses start generating instantly. The model's speed becomes the bottleneck — not your internet connection, not a data center queue.
It works anywhere. Airplane mode, remote areas, restricted networks, countries where certain cloud services aren't available. The model doesn't care — it's all local.
Which Phone Can Run Which Model?
For iPhones, Apple's MLX framework provides optimized inference on Apple Silicon:
| Device | Best Model | Storage | What It's Good For |
|---|---|---|---|
| iPhone 16 Pro / Pro Max | Qwen3.5-9B (4-bit) | ~7.5 GB | Full-power local AI — coding help, document analysis, vision |
| iPhone 15 / 15 Pro | Qwen3.5-2B | ~1.5 GB | Chat, summarization, image Q&A |
| iPhone 14 | Qwen3.5-2B | ~1.5 GB | Text tasks, basic vision |
| iPhone 13 / 12 | Qwen3.5-0.8B | ~0.5 GB | OCR, simple chat, basic video understanding |
For Android, devices with a Snapdragon 8 Gen 3 or newer handle the 0.8B and 2B models well through llama.cpp (using GGUF weights) or MLC LLM. Phones with 12 GB+ RAM can push to the 4B model.
Beyond Phones
These models aren't just for mobile. A few other deployment targets worth knowing about:
- Raspberry Pi 5 (8 GB RAM) runs the 0.8B model — think IoT projects, home automation, or privacy-focused smart home assistants.
- NVIDIA Jetson Orin handles up to the 9B model, making it viable for robotics and embedded AI applications.
- Any Mac with Apple Silicon (M1 or newer) runs all four models comfortably through Ollama or MLX. The 9B model in particular is smooth on M1 Pro/Max machines with 16 GB+ unified memory.
How to Set Everything Up
Let's get practical. Here are four ways to run these models, from simplest to most configurable.
The Easy Way: Ollama
Ollama is a single tool that downloads, quantizes, and serves the model for you. If you've never run a local LLM before, start here.
Install it:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download from https://ollama.comPick a model and run it:
ollama run qwen3.5:9b # the powerhouse — if your hardware can handle it
ollama run qwen3.5:4b # solid middle ground for laptops
ollama run qwen3.5:2b # lightweight, runs on most machines
ollama run qwen3.5:0.8b # minimal — even a Raspberry Pi worksThe first run downloads the weights (about 5–6 GB for the 9B). After that, it launches in seconds. You'll land in an interactive chat session right in your terminal.
Want to use it from your code? Ollama exposes a REST API at http://localhost:11434:
curl http://localhost:11434/api/chat -d '{
"model": "qwen3.5:9b",
"messages": [{"role": "user", "content": "Summarize this image", "images": ["base64_encoded_image"]}]
}'It's also OpenAI-compatible, so you can point any OpenAI client library at http://localhost:11434/v1 and it works as a drop-in replacement.
The Python Way: Hugging Face Transformers
If you want direct control — custom generation parameters, fine-tuning, integration into a research pipeline — use the Transformers library.
Install dependencies:
pip install transformers torch accelerateLoad and run:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3.5-9B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Explain what makes hybrid attention efficient."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Running low on VRAM? 4-bit quantization drops the 9B from ~18 GB to ~6 GB — enough for an RTX 3060:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
quantization_config=quantization_config,
device_map="auto"
)The Production Way: vLLM
When you need to serve requests to multiple users with proper batching and throughput optimization:
pip install -U vllm
vllm serve Qwen/Qwen3.5-9B --max-model-len 65536This gives you an OpenAI-compatible API at http://localhost:8000 with continuous batching, which means it handles concurrent users efficiently without queuing requests.
The Apple Silicon Way: MLX
If you're on a Mac (or targeting iPhone deployment), MLX is purpose-built for Apple's chips:
pip install mlx-lm
mlx_lm.generate --model Qwen/Qwen3.5-9B-MLX-4bit --prompt "Hello, how are you?"
# Or spin up a server
mlx_lm.server --model Qwen/Qwen3.5-9B-MLX-4bit --port 8080MLX takes advantage of Apple Silicon's unified memory — the model doesn't need to copy data between CPU and GPU, which makes loading faster and memory usage more efficient.
Picking the Right Size
Each model serves a different purpose. Here's how to think about which one fits your project.
0.8B: The "It Runs Anywhere" Model
This is your choice when the device is the constraint. Wearables, IoT sensors, phones from 2021, Raspberry Pi projects — the 0.8B fits everywhere because it barely takes up any space (0.5 GB at 4-bit).
It handles OCR well (74.5 on OCRBench), does basic video understanding (63.8 on VideoMME), and can manage simple conversations. Think of it as a capable assistant for focused tasks: extracting text from a photo, classifying a message, translating a sentence, describing an image.
Where it falls short: don't ask it to write complex code, reason through multi-step problems, or produce long-form creative writing. It's a specialist, not a generalist.
2B: The Smartphone Sweet Spot
The 2B model is a meaningful step up from 0.8B in reasoning quality while still fitting comfortably on modern phones (1.5 GB at 4-bit). It's the model you'd want for a mobile chat assistant, a document summarizer, or an on-device image Q&A tool.
It won't handle complex coding tasks or deep analytical reasoning reliably, but for conversational AI, summarization, and basic multimodal work, it's solid.
4B: The Laptop Agent
At 4B, you start getting genuinely useful agentic behavior. It follows instructions reliably enough to power chatbots, coding assistants (for simpler tasks), structured data extraction, and document processing workflows. It runs on 8 GB VRAM GPUs or Apple Silicon Macs with 8 GB of RAM.
The honest trade-off: for a modest jump in resource requirements, the 9B is significantly better across the board. The 4B makes sense when you need something bigger than 2B but can't quite swing the 9B's hardware requirements.
9B: The One Most Developers Should Use
If you have a modern GPU (RTX 3060 or better) or an Apple Silicon Mac with 16 GB+ memory, the 9B model is the clear pick. It's the model where the benchmark numbers get genuinely exciting — 81.7 on GPQA Diamond (beating a 120B model), 91.5 on IFEval (rock-solid instruction following), 70.1 on MMMU-Pro (beating GPT-5-Nano by 13 points).
You can use it for code generation, RAG pipelines, agentic workflows, multimodal applications, and research. Its 1M-token extended context means you can throw entire codebases at it. And because it's dense (no MoE routing), it's simple to serve.
The only caveat: it won't match the absolute top-tier frontier models (GPT-5.2, Claude Opus 4.6) on the hardest math and coding benchmarks. But for the vast majority of real-world tasks, the gap is smaller than you'd expect — and the cost difference is enormous.
How It Stacks Up Against the Competition
The 9B vs. Everything Else in Its Class
| Qwen3.5-9B | GPT-5-Nano | Qwen3-30B | Llama-4-Scout | |
|---|---|---|---|---|
| GPQA Diamond | 81.7 | — | Lower | — |
| MMMU | 78.4 | 75.8 | — | — |
| MMMU-Pro | 70.1 | 57.2 | — | — |
| MathVista | 85.7 | — | — | — |
| IFEval | 91.5 | — | — | — |
| License | Apache 2.0 | Proprietary | Apache 2.0 | Open-weight |
| Run locally? | Yes | No (API only) | Barely | Yes |
The pattern is consistent: the 9B beats or matches models several times its size, and it does so while being fully open-source and runnable on hardware you might already own.
The Cost Argument
| How You Use It | Input Cost (per MTok) | Output Cost (per MTok) |
|---|---|---|
| Qwen3.5-9B locally | $0 | $0 |
| Claude Sonnet 4.5 API | $3.00 | $15.00 |
| GPT-5.2 API | $1.75 | $14.00 |
If you're processing any significant volume, the math is straightforward. API costs add up fast; local inference is free after the hardware investment. For a startup running millions of queries per month, the difference between $0 and $3/$15 per million tokens is the difference between viable and not.
Wrapping Up
The Qwen 3.5 Small series punches so far above its weight that the benchmarks feel like they must be wrong — until you actually run the models and see it for yourself. The 9B competes with models an order of magnitude larger. The 0.8B does video understanding on a phone. All four share the flagship's hybrid attention architecture, native multimodal support, and 262K context window.
Getting started is one command: ollama run qwen3.5:9b. From there you have a local, private, zero-cost AI that handles text, images, and video, with scores that compete with paid cloud APIs costing dollars per million tokens.
If you need AI on edge devices, in environments where data can't leave the building, or you just want to stop paying per-token fees for capabilities you can run yourself — the Qwen 3.5 Small series is the best open-weight option available right now.