AI Development
intermediate
#AI#Qwen#LLM

Qwen 3.5: Architecture, Benchmarks, Comparisons, and How to Run It Locally

A comprehensive guide to Alibaba's Qwen 3.5 model family - the hybrid attention architecture, benchmark results, how it compares to GPT-5.2 and Claude, pricing, and step-by-step local setup.

March 11, 2026
20 min read
Share this article:

Here's something that would've sounded absurd a year ago: an open-source model with 397 billion parameters that you can download, modify, and deploy commercially — and it goes toe-to-toe with GPT-5.2 on most benchmarks. That's Qwen 3.5, released by Alibaba's Qwen team on February 16, 2026.

The trick? Only 17 billion of those 397 billion parameters are active at any given time. The rest sit quietly until they're needed, thanks to a clever architecture that we'll break down in this guide.

Whether you're evaluating it for production, curious about the architecture, or just want to get it running on your machine, this article covers all of it.

What Exactly Is Qwen 3.5?

Qwen 3.5 isn't one model — it's a family of eight, ranging from something small enough to run on a phone (0.8B parameters) to a massive flagship (397B parameters). Here's the full lineup:

ModelTotal ParametersActive ParametersType
Qwen3.5-397B-A17B397B17BMoE (Flagship)
Qwen3.5-122B-A10B122B10BMoE
Qwen3.5-35B-A3B35B3BMoE
Qwen3.5-27B27B27BDense
Qwen3.5-9B9B9BDense
Qwen3.5-4B4B4BDense
Qwen3.5-2B2B2BDense
Qwen3.5-0.8B0.8B0.8BDense

Notice the "Active Parameters" column. The three MoE (Mixture-of-Experts) models have a huge gap between their total size and what actually fires per token. The 397B flagship, for example, only lights up 17B parameters for each token it processes. Think of it like a company with 397 employees — but for any given task, only 17 of them need to show up. The result is a model that knows as much as a 397B model but runs closer to the speed of a 17B one.

Every model in this family ships under the Apache 2.0 license. That means no usage restrictions — you can fine-tune them, build commercial products on top of them, and distribute modified versions freely.

How the Architecture Works

Two ideas power Qwen 3.5 under the hood: a new way of handling attention, and the sparse expert system we just touched on. Both are worth understanding because they explain why this model punches so far above its weight class.

The Attention Problem (and Qwen's Fix)

If you've worked with large language models, you've probably hit the context length wall. Standard transformer attention is quadratic — if you double the input length, the compute cost quadruples. That's why so many models cap out at 8K or 32K tokens. Processing a 200-page document in one shot? Expensive.

Qwen 3.5 takes a different approach. Instead of using the same attention mechanism in every layer, it mixes two kinds in a 3:1 pattern:

Three layers of Gated DeltaNet (the efficient one): These use linear attention, which means the cost grows proportionally with input length, not quadratically. The underlying tech, called Gated DeltaNet, merges ideas from Mamba2 (a state-space model with gated decay) and the delta rule for updating hidden states. If that sounds dense — the key takeaway is that these layers handle the bulk of the work cheaply.

One layer of full attention (the expressive one): Every fourth layer uses traditional softmax attention. Pure linear attention can miss subtle long-range dependencies, so these layers act as periodic "checkpoints" where the model can attend to everything at once.

This hybrid design gives Qwen 3.5 a native 262K-token context window — enough to process entire codebases or book-length documents. With RoPE scaling, it stretches to roughly 1 million tokens. Alibaba's hosted Qwen3.5-Plus variant ships with 1M context out of the box.

Sparse Mixture-of-Experts: How 397B Acts Like 17B

The flagship model's transformer blocks each contain 512 expert sub-networks. When a token arrives, a learned router picks the 10 most relevant experts plus 1 shared expert (always active). That's 11 out of 512 — about 4.3% of the total.

Why does this matter? It means the model stores far more knowledge than it computes on. The 397B-parameter model "knows" things across an enormous range of domains, but each individual token only needs a small, specialized slice of that knowledge. You get the breadth of a huge model without the GPU bill.

The smaller MoE variants work the same way: the 122B model activates 10B per token, and the 35B model activates just 3B.

Built-In Vision and Video (Not Bolted On)

A lot of multimodal models are really text models with a vision adapter duct-taped to the side. Qwen 3.5 takes a fundamentally different approach: text, images, and video were all part of the training mix from the very beginning. There's a single unified set of weights handling all three modalities.

Why does this matter in practice? Because the model learned cross-modal relationships during pretraining rather than trying to stitch separately-trained components together after the fact. When you ask it to analyze a chart, it isn't translating image features into text-model-compatible embeddings — it genuinely understands both domains natively. The benchmark results (which we'll get to shortly) back this up.

A Much Bigger Vocabulary

Qwen 3.5 expanded its vocabulary from 152K tokens to 248K tokens, and it now covers 201 languages and dialects (up from 119). The practical effect is that non-English text gets encoded 10–60% more efficiently. If you're building multilingual applications, that's a direct reduction in both cost and latency — fewer tokens means faster, cheaper inference.

What Changed from Qwen 3

If you've been following the Qwen series, here's a quick summary of the generational jump:

FeatureQwen 3Qwen 3.5
AttentionFull attention (quadratic cost)Hybrid Gated DeltaNet + full (3:1 ratio)
Context window32K–128K262K native, ~1M extended
Vocabulary152K tokens248K tokens
Languages119201
MultimodalSeparate vision modelsNative text-vision fusion from scratch
Training pipelineFP16/BF16Native FP8 (50% less memory, 10%+ faster)
Agent capabilitiesBasic tool useNative agent design (reasoning, coding, tool use, GUI)

The hybrid attention architecture is the headline change. It's what unlocks the massive context window and makes the MoE approach actually practical — without it, the memory requirements for long sequences would be prohibitive even with sparse activation.

The FP8 training pipeline is less flashy but significant for anyone who might want to fine-tune: it halves activation memory and speeds up training by 10%+.

Benchmarks: Where Qwen 3.5 Lands

Let's look at numbers. The benchmarks here come from Alibaba's published evaluations, cross-referenced with third-party reports where available.

Reasoning and STEM

BenchmarkQwen3.5-397BQwen3.5-9BGPT-5.2Claude Opus 4.5
GPQA DiamondTop tier81.7~8583.4
MMLU-ProLeading82.5~90~88
IFEvalLeading91.5

The standout here is the 9B model hitting 81.7 on GPQA Diamond. For context, GPT-OSS-120B — a model 13.5 times larger — scores 71.5 on the same benchmark. A model you can run on a single consumer GPU is outperforming a model that needs a server cluster on graduate-level STEM reasoning.

Vision and Multimodal

BenchmarkQwen3.5-397BGPT-5.2
MathVision88.683.0
MMMU-Pro (9B)70.157.2 (Nano)
MathVista (9B)85.7
OCRBench (9B)Exceeds larger models

This is where the native multimodal training pays off. The flagship beating GPT-5.2 on MathVision by 5.6 points is impressive, but the 9B small model outscoring GPT-5-Nano on MMMU-Pro by nearly 13 points is arguably more meaningful for most developers — because you can actually run the 9B locally.

Agentic and Web Tasks

BenchmarkQwen3.5-397B
BrowseComp78.6
OSWorld-Verified62.2

These benchmarks test whether a model can actually do things — browse the web, interact with GUIs, complete multi-step tasks in real software. Qwen 3.5 was designed with this in mind. It's not just a chat model; it's built to be an agent.

Putting It in Perspective

According to Alibaba's evaluations, Qwen 3.5-397B matches or beats GPT-5.2 and Claude Opus 4.5 on roughly 80% of tested dimensions. That claim needs the usual caveats — model developers tend to emphasize benchmarks where they do well — but even discounting for optimism, the performance-to-compute ratio is remarkable. This is a model that achieves frontier-tier results with 17B active parameters.

How It Compares to the Competition

Here's the landscape as of March 2026:

FeatureQwen 3.5 (397B)GPT-5.2Claude Opus 4.6Gemini 3.1 ProLlama 4 Maverick
Active params17BUndisclosedUndisclosedUndisclosed~17B (of 400B)
Context window262K (1M extended)1M200K1M1M
LicenseApache 2.0ProprietaryProprietaryProprietaryOpen-weight
MultimodalText, image, videoText, image, audioText, imageText, image, video, audioText, image
Input cost (per MTok)$0.60$1.75$15.00$2.00Free (self-hosted)
Output cost (per MTok)$3.60$14.00$75.00$12.00Free (self-hosted)

Where Qwen 3.5 Has the Edge

Price. At $0.60/$3.60 per million tokens via third-party providers, it's the cheapest frontier-class API available. And if you self-host, it's free — you're only paying for electricity and hardware.

Openness. Apache 2.0 is about as permissive as it gets. You can fine-tune it for your specific domain, deploy it on your own infrastructure, and build products on it without worrying about licensing terms changing on you. That's a meaningful advantage over GPT-5.2 and Claude, where you're always one policy update away from disruption.

Vision. The native multimodal approach produces genuinely strong visual understanding — 88.6 on MathVision beats GPT-5.2's 83.0, and the OCR and video capabilities are solid across the board.

Efficiency. If you're comparing total knowledge capacity to inference cost, nothing else in this tier comes close. 397B parameters of stored knowledge at 17B compute per token.

Where Competitors Still Win

Hard math. GPT-5.2 scores 100% on AIME 2025. For pure mathematical competition-style problems, it's still the leader.

Production coding. Claude models dominate SWE-bench Verified — 80.9% for Claude Opus 4.5, 83% for GPT-5.3 Codex. If your primary use case is large-scale code generation and modification, the proprietary models still have an edge.

Developer ecosystem. OpenAI and Anthropic have years of tooling investment — SDKs, enterprise support, managed fine-tuning, evaluation frameworks. The Qwen ecosystem is growing fast but isn't as mature.

Native 1M context. Both Gemini 3.1 Pro and GPT-5.2 offer 1M-token context natively, without needing RoPE extension tricks. Qwen 3.5's native window is 262K — still generous, but the extension to 1M may come with some quality degradation at the edges.

Pricing and Access

You have three ways to use Qwen 3.5:

Self-host it for free. Download the weights from Hugging Face, run it on your own hardware. Zero cost beyond your compute. This is the open-source advantage in its purest form.

Use a third-party API. Several providers offer Qwen 3.5 through OpenAI-compatible endpoints:

ProviderInput (per MTok)Output (per MTok)
Novita AI$0.60$3.60
Qubrid AIVariesVaries
Atlas CloudVariesVaries

Most offer free trial credits without requiring a credit card, and they all support streaming, function calling, and structured JSON output. Since the API format is OpenAI-compatible, you can usually swap in Qwen 3.5 by just changing the base URL and model name in your existing code.

Use Alibaba's hosted Qwen3.5-Plus. This is Alibaba's own hosted variant, available through the DashScope API, with a full 1M-token context window. Pricing is competitive with the frontier proprietary models at a fraction of the cost.

Running Qwen 3.5 Locally

This is where open-weight models really shine. There are three main approaches depending on what you need.

Ollama: Get Running in Two Minutes

If you just want to try the model or use it for personal projects, Ollama is the fastest path. It handles downloading, quantization, and serving in a single tool.

First, install it:

bash
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: grab the installer from https://ollama.com

Then run the model:

bash
ollama run qwen3.5:9b

That's it. The first time you run this, Ollama downloads about 5–6 GB of model weights. After that, it starts in seconds. You'll get an interactive chat prompt right in your terminal.

If the 9B is too heavy for your hardware, drop down a size:

bash
ollama run qwen3.5:4b   # lighter, still capable
ollama run qwen3.5:2b   # runs on most laptops
ollama run qwen3.5:0.8b # runs on almost anything

Once the model is running, Ollama also exposes a REST API at http://localhost:11434 that you can hit from any application:

bash
curl http://localhost:11434/api/chat -d '{
  "model": "qwen3.5:9b",
  "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}]
}'

vLLM: When You Need Production Performance

If you're serving the model to multiple users or need high throughput, vLLM is the better choice. It's optimized for GPU utilization and supports continuous batching.

Install it:

bash
pip install -U vllm

For the flagship model, Docker is the cleanest approach:

bash
docker run --gpus all -p 8000:8000 --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai Qwen/Qwen3.5-397B-A17B

For the smaller models, you can serve directly:

bash
vllm serve Qwen/Qwen3.5-9B --max-model-len 65536

Either way, you get an OpenAI-compatible API at http://localhost:8000 — drop it into any app that currently talks to OpenAI's API and it should just work.

Hugging Face Transformers: Full Control in Python

If you want to fine-tune the model, run it in a notebook, or integrate it into a custom pipeline:

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-9B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "What is the hybrid attention mechanism in Qwen 3.5?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For the MoE variants, there's a dedicated class:

python
from transformers import Qwen3_5MoeForConditionalGeneration

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B",
    torch_dtype="auto",
    device_map="auto"
)

What Hardware Do You Need?

This is probably the most practical question. Here's a realistic breakdown:

ModelVRAM (BF16)VRAM (4-bit)RAMWhat You'd Run It On
0.8B~1.6 GB~0.5 GB8 GBLiterally anything — even a Raspberry Pi
2B~4 GB~1.5 GB8 GBOlder laptops, GTX 1660
4B~8 GB~3 GB16 GBRTX 3060, M1 MacBook
9B~18 GB~6 GB16 GBRTX 3090 / 4080, M1 Pro/Max MacBook
27B~54 GB~16 GB32 GBA100 40GB, 2x RTX 4090
35B-A3B~70 GB~20 GB32 GBA100 80GB (only 3B active though)
122B-A10B~244 GB~70 GB64 GBMulti-GPU setup (A100 cluster)
397B-A17B~794 GB~220 GB128 GB+Multi-node cluster

For most people reading this, the 9B model through Ollama is the sweet spot. It fits on a single consumer GPU or an Apple Silicon Mac with 16 GB+ unified memory, and its benchmarks compete with models many times its size. Start there.

What You Can Build With It

Coding Agents

Qwen 3.5 was designed from the ground up as an agent model — it supports function calling, structured JSON output, and multi-step tool orchestration. The 9B model scores 91.5 on IFEval, which measures instruction-following accuracy. That's high enough to trust it in agentic loops where reliability matters.

Long Document Analysis

With a 262K native context (up to 1M extended), you can feed the model entire codebases, long legal documents, or research paper collections in a single pass. No chunking strategies, no retrieval workarounds — just paste the full text and ask your question.

Visual Understanding

Because vision was part of training from the start, you can pass images and video directly to the model. Chart analysis, document OCR, screenshot understanding, video summarization — these work natively without bolting on extra preprocessing.

Multilingual Products

201 languages with a tokenizer that's 10–60% more efficient on non-English text. If you're building something that needs to work across languages — customer support bots, translation tools, content generation for global markets — the improved tokenizer alone cuts your inference costs significantly.

Browser and GUI Automation

Scores of 78.6 on BrowseComp and 62.2 on OSWorld-Verified mean the model can navigate websites, fill out forms, and interact with desktop software. If you're building automation agents that need to operate in real GUI environments, Qwen 3.5 is a serious contender.

The Bottom Line

Qwen 3.5 changes the math on open-weight AI models. A year ago, open-source models were "good enough for some things." Today, Qwen 3.5's flagship competes with GPT-5.2 on 80% of benchmarks, its 9B model beats a 120B model on graduate-level reasoning, and the whole thing is Apache 2.0 licensed.

The hybrid Gated DeltaNet architecture isn't just an incremental improvement — it fundamentally changes what context lengths are practical for open models. The sparse MoE design means you get 397B parameters of knowledge at 17B parameters of compute. And the native multimodal training produces visual understanding that actually holds up against the proprietary competition.

You can get started with a single terminal command (ollama run qwen3.5:9b), scale up to production with vLLM, or dig into the weights with Transformers for research and fine-tuning. No API keys, no usage limits, no licensing surprises.

If you're building AI-powered applications — especially agents, multimodal tools, or anything that needs to work across languages — Qwen 3.5 deserves a spot on your shortlist.