Benchmarking llama.cpp's brand-new MTP support on Strix Halo
PR #22673 landed in llama.cpp on May 16. It adds first-class Multi-Token Prediction (MTP) speculative decoding for models that ship with an MTP head, including Qwen3.6 27B dense and the 35B-A3B MoE. The author posted ~2.5× speedups on a DGX Spark.
I have a Strix Halo Framework Desktop and an RTX 3090, so I built llama.cpp from master a few hours after the merge and ran my speed-bench harness against both. Most wrappers (lemonade, ollama, LM Studio) won't have MTP for a while, so this is from-source territory.
TL;DR
- MTP n=3 gives Qwen3.6 27B Q4_K_M a 1.81× speedup on Strix Halo (11.7 → 21.2 tok/s on chat); on Q8_0 the same setup hits 2.44× (7.4 → 18.1 tok/s), the biggest relative gain in the dataset.
- MTP is still a smaller lever than the 3090's power budget. A 3090 at the full 450 W cap chews Q4_K_M at 38.7 tok/s baseline and tops out at 59.5 tok/s with MTP n=2 (1.54× speedup, n=2 is the sweet spot once the card is uncapped). MTP helps less on the 3090 than on Strix because the card has more raw headroom to burn through. See the power-limits post for what the same card looks like at 200 W.
- It works on the MoE 35B-A3B too, not just dense.
- It costs almost no extra VRAM, because the spec head shares the target model's embeddings, LM head, tokenizer, and main KV cache.
- Output is identical to baseline. Speculative decoding only accepts drafted tokens the main model would have generated anyway, so quality doesn't change. You're trading idle GPU time for faster output, not accuracy.
- Build steps for mainline llama.cpp with MTP are at the bottom of this post.
What MTP does
MTP makes the model draft several tokens at once and verify them in a single forward pass, instead of generating one token per pass. Normal generation is one-at-a-time: full pass, pick the next token, repeat. Speculative decoding skips that loop by guessing ahead. If the guesses are right, you got multiple tokens for the cost of one pass. If they're wrong, you fall back to one token like normal.
Quality doesn't change. The verify step only accepts drafted tokens the main model would have generated anyway. A rejected guess gets resampled from the main model's true distribution at that position. So the output you get with MTP at temperature 0 is bit-identical to baseline, and at higher temperatures it's statistically equivalent. You're trading wall-clock time, not accuracy.
The usual way is to run a tiny separate model alongside the big one to make the guesses. That costs you a second model's worth of VRAM. MTP cuts that out by giving the big model a small extra head (one or a few transformer layers) that does the guessing itself, sharing the main model's input lookup table, output layer, tokenizer, and KV cache (the conversation's running working memory). The VRAM overhead is a fraction of a gigabyte.
You turn it on with --spec-type draft-mtp --spec-draft-n-max N on the new llama.cpp. Bigger N means more aggressive guessing per step. Acceptance drops as N grows, so there's a sweet spot per model. The PR author measured ~75 % acceptance at N=3 on Qwen3.6 27B and got similar speedups on the dense 27B and the 35B-A3B Mixture-of-Experts variant (an architecture where only a slice of the model runs per token).
The rigs
Two pieces of hardware, both running llama.cpp built from master at commit 4f13cb7:
- Strix Halo Framework Desktop: AMD Ryzen AI MAX+ 395 with the integrated Radeon 8060S GPU. 128 GiB of unified memory total, 96 GiB of which the GPU can use as VRAM. Runs on ROCm 7.2.3.
- RTX 3090: 24 GiB of GDDR6X memory, running at the full 450 W cap. One card for the Q4_K_M runs, two layer-split cards for the Q8_0 runs (which don't fit on 24 GiB). CUDA 13.1. The cap matters a lot here. See the companion post on power limits for what the same card looks like at 200 W (roughly half the speed).
Models: unsloth/Qwen3.6-27B-MTP-GGUF (dense) and unsloth/Qwen3.6-35B-A3B-MTP-GGUF (MoE). Bench harness is the same one feeding /benchmarks: 5 measured runs per cell after 2 warmups, four workload shapes, temperature 0, median tok/s.
Headline numbers: Qwen3.6 27B (dense)
Q4_K_M, chat shape (single-stream, 100 tokens out):
| Variant | Quant | Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | TTFT | TPOT (ms) |
|---|---|---|---|---|---|---|---|---|---|
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | chat | 1 | 59.5 | 259ms | 0.1 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | chat | 1 | 58.7 | 259ms | 0.1 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 38.7 | 238ms | 23.4 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | chat | 1 | 34.2 | 283ms | 0.1 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | chat | 1 | 32.0 | 271ms | 0.1 |
| 27B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | chat | 1 | 21.2 | 386ms | 3.0 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | chat | 1 | 21.1 | 266ms | 43.9 |
| 27B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | chat | 1 | 19.7 | 375ms | 0.0 |
| 27B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | chat | 1 | 11.7 | 345ms | 82.4 |
Q8_0 (29 GB GGUF; needs dual 3090s since one card's 24 GiB VRAM won't fit it):
| Variant | Quant | Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | TTFT | TPOT (ms) |
|---|---|---|---|---|---|---|---|---|---|
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | chat | 1 | 55.9 | 287ms | 0.1 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | chat | 1 | 50.5 | 265ms | 0.1 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | chat | 1 | 49.6 | 258ms | 0.1 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | chat | 1 | 47.4 | 275ms | 0.0 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 25.7 | 236ms | 35.6 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | chat | 1 | 25.1 | 238ms | 37.0 |
| 27B-MTPthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | chat | 1 | 18.1 | 509ms | 0.0 |
| 27B-MTPthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | chat | 1 | 15.7 | 501ms | 0.0 |
| 27B-MTPthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | chat | 1 | 7.4 | 455ms | 129.5 |
MTP gives Strix Halo bigger relative speedups than the 3090 at most quants. Strix's Q4_K_M gain is 1.81× vs the 3090's 1.54× at n=2 on a single card. Strix's Q8_0 gain is 2.44× vs 2.17× on dual 3090 at n=3 (closer than it looks; pipeline stalls help MTP on the dual setup).
The reason is bandwidth headroom. Strix's iGPU runs on ~215 GB/s of measured LPDDR5X-8000 (AMD spec / Notebookcheck); the 3090 has 936 GB/s. The Strix baselines saturate that ceiling (Q4_K_M pulls ~190 GB/s, Q8 pulls ~215 GB/s, basically wall-to-wall), so MTP's "fewer weight loads per generated token" trick buys it the most. The 3090 baselines hit ~68 % of bandwidth at 450 W; MTP still helps but the proportional headroom is smaller.
The 3090 wins outright at 450 W, with or without MTP. Bare 3090 chat is 38.7 tok/s on Q4_K_M; with MTP n=2 (the sweet spot here) it hits 59.5 tok/s. Strix's best Q4 number is 21.2 tok/s. The cap on a 3090 absolutely changes the framing. At 200 W the 3090 baseline drops to 21.1 tok/s and Strix Halo with MTP-3 actually matches it. See the power-limits post.
The PR author's DGX Spark results show MTP n=3 hitting ~2.5× baseline on Qwen3.6 27B at Q8_0. Strix Halo's 2.44× is close enough to count as confirmation; the DGX Spark is also gfx1151-class with LPDDR5X.
Note that on the uncapped 3090, n=2 beats n=3 at Q4_K_M (59.5 vs 58.7 tok/s on chat). With the card already running fast on baseline, the dropping acceptance rate past n=2 stops paying for itself. On Strix and on the 200 W 3090, where each forward pass is a bigger fraction of the per-token cost, n=3 still wins. At Q8_0 on dual 3090, n=3 still wins (55.85 vs 49.6) because pipeline-parallel makes the per-pass cost relatively larger.
Q4 single vs Q8 dual flips at high power
At my earlier 200 W cap, Q8_0 on two layer-split 3090s beat Q4_K_M on one card (25.1 vs 21.1 baseline). At 450 W on the same hardware, Q4_K_M on one card now beats Q8_0 on two (38.7 vs 25.7 baseline; 59.5 vs 55.9 with MTP). The reason is that --split-mode layer runs the two cards as a pipeline: at any instant only one card is doing meaningful compute, the other is waiting for layers. So the dual-card setup gets almost nothing from the lifted power cap, while the single card gets 1.83×.
MTP closes the gap on the dual setup (25.7 → 55.9 is 2.17×, the biggest MTP win in the dataset) because pipeline stalls hide more of the speculative-decode verify cost. But the absolute fastest configuration in this matrix is still a single 3090 at 450 W with MTP n=2 (59.5 tok/s), not two cards with anything.
The practical read inverted from where I started. If you have one card and Q4_K_M fits, run it. Two cards mainly buys you VRAM headroom for larger quants you couldn't otherwise load. Pipeline-parallel layer-split does not double throughput.
Why I stop at n=3
You can pass --spec-draft-n-max 8, but Qwen3.6's MTP head is one transformer layer deep, so guesses past two or three tokens get rejected fast. The PR author's reference numbers show acceptance dropping from ~83 % at n=2 to ~72 % at n=3 to worse beyond, and tok/s tops out around n=2 or n=3 depending on rig. On the uncapped 3090, n=2 is the sweet spot for Q4_K_M; on Strix Halo and on the 200 W 3090, n=3 still wins. Models with deeper MTP heads (DeepSeek-V3) can push higher.
Numbers: Qwen3.6 35B-A3B (MoE)
Same family, but only ~3B of the model's 35B parameters run per token (the "A3B" tag). That makes baseline already quick.
| Variant | Quant | Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | TTFT | TPOT (ms) |
|---|---|---|---|---|---|---|---|---|---|
| 35B-A3B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | chat | 1 | 148.3 | 149ms | 0.1 |
| 35B-A3B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | chat | 1 | 143.2 | 140ms | 0.1 |
| 35B-A3B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 120.0 | 127ms | 6.7 |
| 35B-A3B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | chat | 1 | 69.4 | 158ms | 0.0 |
| 35B-A3B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | chat | 1 | 65.4 | 157ms | 0.0 |
| 35B-A3B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | chat | 1 | 49.5 | 139ms | 19.0 |
MTP helps the MoE less than it helps the dense 27B. Strix Halo goes from 49.5 to 69.4 tok/s at n=3 (1.40×). The 3090 at 450 W goes from 120.0 to 148.3 tok/s at n=3 (1.24×). Compare to Strix's gain on the dense 27B (1.81× on Q4_K_M, 2.44× on Q8_0) or the 3090's (1.54× on Q4_K_M, 2.17× on Q8_0 dual).
The reason is just how much weight gets touched per token. Only ~3B of the 35B params run per token in this MoE, so a single forward pass moves far fewer bytes than a dense 27B pass. MTP's whole trick is saving N-1 forward passes per N generated tokens. When each pass is cheap, the savings are smaller.
What I had to leave out: Gemma 4 with MTP
I wanted to include a cross-family comparison with Gemma 4 31B (dense) and Gemma 4 26B-A4B (MoE), but Gemma's MTP support lives in the AtomicBot-ai/atomic-llama-cpp-turboquant fork rather than mainline.
Comparing fork-MTP numbers against mainline-baseline numbers conflates two things: the fork's kernel choices and the MTP gain itself. That made the headline numbers hard to interpret cleanly, especially on the MoE where they looked like a 3× regression (probably not entirely MTP's fault).
The proper comparison is fork-baseline vs fork-MTP on the same build, which I haven't done yet. Saved for a follow-up.
How to build llama.cpp with MTP support
Until the wrapper ecosystem catches up, this is from-source territory. Steps assume Ubuntu 24.04; swap apt for your package manager on Arch / CachyOS / Fedora. The build flags are the same.
Step 1: Install the toolchain for your GPU.
For NVIDIA CUDA (RTX 30-series and newer):
# Add NVIDIA CUDA repo (Ubuntu 24.04 / noble)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
apt install -y cuda-toolkit-12-9 build-essential cmake ninja-build git
export PATH=/usr/local/cuda-12.9/bin:$PATH
export CUDA_HOME=/usr/local/cuda-12.9For AMD ROCm (RDNA3/RDNA4 dGPUs, Strix Halo iGPU):
# Add AMD ROCm repo
mkdir -p /etc/apt/keyrings
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/rocm.gpg
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/latest noble main" \
> /etc/apt/sources.list.d/rocm.list
# Pin the ROCm repo above Ubuntu's older rocm-cmake / hipcc packages
cat > /etc/apt/preferences.d/rocm-pin-600 <<EOF
Package: *
Pin: origin repo.radeon.com
Pin-Priority: 600
EOF
apt update
apt install -y rocm-hip-runtime-dev hipblas-dev rocblas-dev rocminfo \
build-essential cmake ninja-build git
export PATH=/opt/rocm/bin:$PATHSanity check with rocminfo (AMD) or nvidia-smi (NVIDIA). If your GPU isn't listed, the build won't help.
Step 2: Clone and build llama.cpp from master.
git clone https://github.com/ggml-org/llama.cpp.git --depth=1
cd llama.cpp
git log -1 --oneline # confirm you're past commit 4f13cb7 (the merge)For NVIDIA CUDA:
cmake -S . -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -G Ninja
cmake --build build --target llama-server llama-bench -j 4For AMD ROCm:
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_BUILD_TYPE=Release -G Ninja \
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang \
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
cmake --build build --target llama-server llama-bench -j 4Substitute AMDGPU_TARGETS for your card: gfx1100 for RX 7900-series, gfx1101 for RX 7800, gfx1151 for Strix Halo, gfx942 for MI300X. rocminfo will tell you what you have.
Step 3: Verify MTP is in the build.
./build/bin/llama-server --help | grep -A 1 spec-typeYou should see none,draft-simple,draft-eagle3,draft-mtp,ngram-... in the list. If draft-mtp is missing, your llama.cpp is from before the merge. Pull master and rebuild.
Step 4: Grab an MTP-enabled GGUF.
pip install --break-system-packages --ignore-installed rich "huggingface_hub[cli]"
# Dense 27B
hf download unsloth/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-Q4_K_M.gguf \
--local-dir ~/models/qwen36-27b-mtp
# MoE 35B-A3B
hf download unsloth/Qwen3.6-35B-A3B-MTP-GGUF Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
--local-dir ~/models/qwen36-35b-mtpStep 5: Run.
Baseline:
./build/bin/llama-server -m ~/models/qwen36-27b-mtp/Qwen3.6-27B-Q4_K_M.gguf \
--port 8001 --host 127.0.0.1 -ngl 99 -c 8192 -np 1 --jinjaWith MTP n=3 (draft 3 tokens per step):
./build/bin/llama-server -m ~/models/qwen36-27b-mtp/Qwen3.6-27B-Q4_K_M.gguf \
--port 8001 --host 127.0.0.1 -ngl 99 -c 8192 -np 1 --jinja \
--spec-type draft-mtp --spec-draft-n-max 3The OpenAI-compatible API is up on :8001/v1/chat/completions. Point ollama-style clients or curl at it and you're done.
Caveats
My runs were on the master tip at commit 4f13cb7 (the merge itself). The instructions above clone whatever is current on master and just check that the merge is in your history, so reproducing this a week from now puts you on later code.
The MTP code path is days old and will evolve fast. Expect numbers to shift. If you want exact parity, git checkout 4f13cb7 after the clone.
I only ran two quant levels (Q4_K_M and Q8_0) per Qwen model. A full quant sweep is worth doing later.
The raw YAMLs are at /benchmarks if you want to audit numbers or pull them into your own analysis.