MTP Speeds Up Token Generation, Not Huge Prompts
My first MTP post showed the fun part: MTP can make supported models write much faster.
But that left a practical question open. What happens when the chat already has a huge amount of context?
I ran that test. The short version: MTP helps token generation. It does not help prefill.
The terms are useful here. Prefill is the model reading your prompt and building the KV cache. Decode is the model writing new output tokens. TTFT, or time to first token, is mostly the prefill bill plus the first bit of decode.
For small prompts, faster token generation is great. For huge fresh prompts, prefill dominates the request. MTP does not remove that step. In the llama.cpp builds I tested, it made that step slower.
The cache case is different. If the backend can reuse the KV cache from the shared context, the request turns back into a decode-heavy workload, and MTP helps again.
The Terms
Every request has two main parts:
total time = prefill + decodePrefill is where the backend ingests the prompt and creates the KV cache. Decode is where it generates the answer one token at a time.
MTP is a decode optimization. The model drafts a few future tokens, verifies them, and keeps the draft tokens that match. That can make output much faster.
Prompt tokens are different. During prefill, the backend already knows the next token because it is in the prompt. There is nothing useful to speculate. The model still has to process the context and update its cache state.
If the prompt is short and the answer is long, MTP can help a lot. If the prompt is 100k+ tokens and the answer is short or medium length, the request is mostly prefill. Faster decode cannot save enough time.
That is exactly what showed up in the data.
The Benchmark
I tested Qwen3.6-27B-MTP-GGUF-Q4_K_M with direct llama.cpp on two machines:
- RTX 3090 rig
- Strix Halo
The prompts targeted 1k, 4k, 16k, 32k, 64k, 100k, and 128k tokens. The biggest prompt landed at about 121k real prompt tokens for this model.
I used two shapes:
probe: long prompt, 8 output tokens. This mostly measures TTFT and prefill.answer: same prompt, 500 output tokens. This measures whether faster decode pays back the prompt cost.
Prompt-cache reuse was disabled with --cache-ram 0, so this first set is a cold long-context test. I ran baseline, MTP n=2, and MTP n=3 on both machines. The main RTX 3090 and Strix Halo runs used llama.cpp commit 4f13cb7, from shortly after MTP support landed. I also reran the long 3090 rows on 3e12fbd, the follow-up prompt-processing fix.
Small Prompts Got Faster
At about 1k prompt tokens, MTP did what I expected.
| Hardware | Shape | Baseline total | MTP n=3 total | Result |
|---|---|---|---|---|
| RTX 3090 | ctx1k_answer | 14.2s | 10.0s | MTP faster |
| Strix Halo | ctx1k_answer | 44.4s | 23.5s | MTP faster |
The model did not have much prompt to prefill. Most of the useful time was decode, and MTP made decode faster.
On the 3090, generation went from 35.2 tok/s to 50.2 tok/s. On Strix Halo, it went from 11.3 tok/s to 21.3 tok/s.
Huge Fresh Prompts Got Slower
At about 121k prompt tokens, the result flipped.
| Hardware | Shape | Baseline total | MTP n=3 total | Result |
|---|---|---|---|---|
| RTX 3090 | ctx128k_answer | 160.7s | 263.2s | MTP slower |
| Strix Halo | ctx128k_answer | 797.5s | 837.9s | MTP slower |
The reason is TTFT.
On the 3090, baseline reached the first token at 141.4 seconds. MTP n=3 reached it at 252.8 seconds.
On Strix Halo, baseline reached the first token at 740.3 seconds. MTP n=3 reached it at 806.4 seconds.
That is the result from the first build. MTP decoded faster after the model started answering, but the request had already lost too much time in prefill.
MTP n=2 did not change that conclusion. On Strix Halo at 128k, MTP n=2 took 844.9 seconds and MTP n=3 took 837.9 seconds, while baseline took 797.5 seconds.
The Prompt Fix Helped, But Not Enough
The fresh-prompt slowdown got smaller after the llama.cpp prompt fix, but it did not fully go away.
The original llama.cpp MTP PR said prompt processing could take a negative hit because of device-to-host embedding transfers: llama.cpp PR #22673. The next day, llama.cpp merged a follow-up that avoids copying logits during MTP prompt decode to reduce memory traffic and improve prompt-processing speed: llama.cpp PR #23198.
I reran the 32k, 64k, and 128k cold cases after that fix. The RTX 3090 run used the follow-up commit directly. The Strix Halo run used a newer llama.cpp build that includes the same fix.
The 128k RTX 3090 MTP answer improved by about 10 seconds, from 263.2 seconds to 253.3 seconds. Baseline stayed the same at about 160.8 seconds.
On Strix Halo, the newer llama.cpp build also narrowed the 128k gap. Baseline was 801.8 seconds and MTP n=3 was 823.5 seconds. That is still slower, but much closer than the first build.
| Shape, after prompt fix | Hardware | Baseline total | MTP n=3 total | Result |
|---|---|---|---|---|
ctx32k_answer | RTX 3090 | 38.2s | 48.6s | MTP slower |
ctx64k_answer | RTX 3090 | 70.7s | 99.9s | MTP slower |
ctx128k_answer | RTX 3090 | 160.8s | 253.3s | MTP slower |
ctx32k_answer | Strix Halo | 156.4s | 144.1s | MTP faster |
ctx64k_answer | Strix Halo | 320.9s | 317.1s | about even |
ctx128k_answer | Strix Halo | 801.8s | 823.5s | MTP slower |
So the build detail matters, and hardware matters too. The safer claim is: MTP does not make fresh prefill cheaper, and some implementations can make cold long-context prefill slower.
KV Cache Reuse Changes The Result
Agent chats are not always cold prompts. A coding agent may keep the same repo context and ask several follow-up questions. In llama.cpp, prompt-cache checkpoints can reuse that shared prefix.
I ran a second test with the same 32k, 64k, and 128k context prefix, but different follow-up instructions. This time I enabled llama.cpp's prompt cache with --cache-ram 8192.
| Hardware | Shape | Baseline total | MTP n=3 total | Baseline TTFT | MTP n=3 TTFT |
|---|---|---|---|---|---|
| RTX 3090 | ctx32k_agent_cache | 10.8s | 7.9s | 0.96s | 1.18s |
| RTX 3090 | ctx64k_agent_cache | 12.2s | 9.0s | 1.24s | 1.64s |
| RTX 3090 | ctx128k_agent_cache | 15.0s | 10.3s | 1.91s | 2.58s |
| Strix Halo | ctx32k_agent_cache | 34.3s | 19.5s | 2.48s | 2.74s |
| Strix Halo | ctx64k_agent_cache | 38.0s | 23.3s | 3.37s | 3.70s |
| Strix Halo | ctx128k_agent_cache | 45.6s | 27.4s | 5.13s | 5.62s |
That is the agent-shaped result I expected. The first request still has to prefill the context, but the measured follow-up turns mostly restore cached checkpoints. TTFT drops from minutes to a couple seconds, so decode speed matters again.
The MTP TTFT is still a little worse than baseline, but the total request is faster because the answer generation is much faster. At 128k cached context, MTP n=3 takes the 3090 run from 15.0 seconds to 10.3 seconds. On Strix Halo, it takes the same shape from 45.6 seconds to 27.4 seconds.
Does MTP Make Things Slower?
Sometimes, yes.
I would not say "MTP is slower." That is too broad. The better rule is:
MTP is faster when decode dominates. MTP can be slower when fresh prefill dominates.
That makes it a good fit for normal chat, coding-agent replies, cached-context follow-ups, and longer answers. It is a weaker fit for "read this giant context from scratch and answer in a few sentences," especially on builds where MTP adds prompt-processing overhead.
The earlier medium-context agent result still matters. On Strix Halo, Qwen3.6 27B Q4_K_M went from 12.0 tok/s to 20.4 tok/s with MTP n=3 on my agent shape. Total time dropped from 41.7 seconds to 24.5 seconds.
That is the workload where MTP feels good: the model has some context, but it also has enough answer to decode.
My Current Rule
For local use, I would still keep MTP on for normal chats and agent-style work.
For giant fresh prompts, I would test both modes. If the prompt is 100k+ tokens and the answer is not very long, baseline may be faster. The prompt-processing fix helped a bit on my 3090 run, but MTP was still slower for cold 32k, 64k, and 128k prompts.
For ongoing chats with KV cache reuse, the result is much less harsh. The cold-prompt benchmark answers one question: when the model has to prefill the whole long prompt from scratch, MTP does not make that prefill cheaper. The cached-agent benchmark answers the other: once that context is reused, MTP can help again.