MTP Speeds Up Token Generation, Not Huge Prompts

2026-05-20 21:32:03::AUTHOR: CALEB

My first MTP post showed the fun part: MTP can make supported models write much faster.

But that left a practical question open. What happens when the chat already has a huge amount of context?

I ran that test. The short version: MTP helps token generation. It does not help prefill.

The terms are useful here. Prefill is the model reading your prompt and building the KV cache. Decode is the model writing new output tokens. TTFT, or time to first token, is mostly the prefill bill plus the first bit of decode.

For small prompts, faster token generation is great. For huge fresh prompts, prefill dominates the request. MTP does not remove that step. In the llama.cpp builds I tested, it made that step slower.

The cache case is different. If the backend can reuse the KV cache from the shared context, the request turns back into a decode-heavy workload, and MTP helps again.

The Terms

Every request has two main parts:

total time = prefill + decode

Prefill is where the backend ingests the prompt and creates the KV cache. Decode is where it generates the answer one token at a time.

MTP is a decode optimization. The model drafts a few future tokens, verifies them, and keeps the draft tokens that match. That can make output much faster.

Prompt tokens are different. During prefill, the backend already knows the next token because it is in the prompt. There is nothing useful to speculate. The model still has to process the context and update its cache state.

If the prompt is short and the answer is long, MTP can help a lot. If the prompt is 100k+ tokens and the answer is short or medium length, the request is mostly prefill. Faster decode cannot save enough time.

That is exactly what showed up in the data.

The Benchmark

I tested Qwen3.6-27B-MTP-GGUF-Q4_K_M with direct llama.cpp on two machines:

  • RTX 3090 rig
  • Strix Halo

The prompts targeted 1k, 4k, 16k, 32k, 64k, 100k, and 128k tokens. The biggest prompt landed at about 121k real prompt tokens for this model.

I used two shapes:

  • probe: long prompt, 8 output tokens. This mostly measures TTFT and prefill.
  • answer: same prompt, 500 output tokens. This measures whether faster decode pays back the prompt cost.

Prompt-cache reuse was disabled with --cache-ram 0, so this first set is a cold long-context test. I ran baseline, MTP n=2, and MTP n=3 on both machines. The main RTX 3090 and Strix Halo runs used llama.cpp commit 4f13cb7, from shortly after MTP support landed. I also reran the long 3090 rows on 3e12fbd, the follow-up prompt-processing fix.

Small Prompts Got Faster

At about 1k prompt tokens, MTP did what I expected.

HardwareShapeBaseline totalMTP n=3 totalResult
RTX 3090ctx1k_answer14.2s10.0sMTP faster
Strix Haloctx1k_answer44.4s23.5sMTP faster

The model did not have much prompt to prefill. Most of the useful time was decode, and MTP made decode faster.

On the 3090, generation went from 35.2 tok/s to 50.2 tok/s. On Strix Halo, it went from 11.3 tok/s to 21.3 tok/s.

Huge Fresh Prompts Got Slower

At about 121k prompt tokens, the result flipped.

HardwareShapeBaseline totalMTP n=3 totalResult
RTX 3090ctx128k_answer160.7s263.2sMTP slower
Strix Haloctx128k_answer797.5s837.9sMTP slower

The reason is TTFT.

On the 3090, baseline reached the first token at 141.4 seconds. MTP n=3 reached it at 252.8 seconds.

On Strix Halo, baseline reached the first token at 740.3 seconds. MTP n=3 reached it at 806.4 seconds.

That is the result from the first build. MTP decoded faster after the model started answering, but the request had already lost too much time in prefill.

MTP n=2 did not change that conclusion. On Strix Halo at 128k, MTP n=2 took 844.9 seconds and MTP n=3 took 837.9 seconds, while baseline took 797.5 seconds.

RTX 3090: cold answer time vs prompt size
No context-size data
Strix Halo: total answer time vs prompt size
0ms235.86s471.72s707.57s943.43s9694k15k30k61k95k121kprompt tokens, server-reportedTotal secondsbaseline ctx1k_answer: 969 prompt tokens, 44.42sbaseline ctx4k_answer: 4k prompt tokens, 53.00sbaseline ctx16k_answer: 15k prompt tokens, 92.05sbaseline ctx32k_answer: 30k prompt tokens, 155.13sbaseline ctx64k_answer: 61k prompt tokens, 318.82sbaseline ctx100k_answer: 95k prompt tokens, 563.37sbaseline ctx128k_answer: 121k prompt tokens, 797.46sMTP n=2 ctx1k_answer: 969 prompt tokens, 25.91sMTP n=2 ctx4k_answer: 4k prompt tokens, 35.35sMTP n=2 ctx16k_answer: 15k prompt tokens, 79.44sMTP n=2 ctx32k_answer: 30k prompt tokens, 148.56sMTP n=2 ctx64k_answer: 61k prompt tokens, 326.27sMTP n=2 ctx100k_answer: 95k prompt tokens, 590.24sMTP n=2 ctx128k_answer: 121k prompt tokens, 844.87sMTP n=3 ctx1k_answer: 969 prompt tokens, 23.50sMTP n=3 ctx4k_answer: 4k prompt tokens, 32.64sMTP n=3 ctx16k_answer: 15k prompt tokens, 77.75sMTP n=3 ctx32k_answer: 30k prompt tokens, 147.14sMTP n=3 ctx64k_answer: 61k prompt tokens, 322.74sMTP n=3 ctx100k_answer: 95k prompt tokens, 585.80sMTP n=3 ctx128k_answer: 121k prompt tokens, 837.93sbaseline after prompt fix ctx32k_answer: 30k prompt tokens, 156.36sbaseline after prompt fix ctx64k_answer: 61k prompt tokens, 320.91sbaseline after prompt fix ctx128k_answer: 121k prompt tokens, 801.79sMTP n=3 after prompt fix ctx32k_answer: 30k prompt tokens, 144.12sMTP n=3 after prompt fix ctx64k_answer: 61k prompt tokens, 317.06sMTP n=3 after prompt fix ctx128k_answer: 121k prompt tokens, 823.49s
baselineMTP n=2MTP n=3baseline after prompt fixMTP n=3 after prompt fix

The Prompt Fix Helped, But Not Enough

The fresh-prompt slowdown got smaller after the llama.cpp prompt fix, but it did not fully go away.

The original llama.cpp MTP PR said prompt processing could take a negative hit because of device-to-host embedding transfers: llama.cpp PR #22673. The next day, llama.cpp merged a follow-up that avoids copying logits during MTP prompt decode to reduce memory traffic and improve prompt-processing speed: llama.cpp PR #23198.

I reran the 32k, 64k, and 128k cold cases after that fix. The RTX 3090 run used the follow-up commit directly. The Strix Halo run used a newer llama.cpp build that includes the same fix.

The 128k RTX 3090 MTP answer improved by about 10 seconds, from 263.2 seconds to 253.3 seconds. Baseline stayed the same at about 160.8 seconds.

On Strix Halo, the newer llama.cpp build also narrowed the 128k gap. Baseline was 801.8 seconds and MTP n=3 was 823.5 seconds. That is still slower, but much closer than the first build.

Shape, after prompt fixHardwareBaseline totalMTP n=3 totalResult
ctx32k_answerRTX 309038.2s48.6sMTP slower
ctx64k_answerRTX 309070.7s99.9sMTP slower
ctx128k_answerRTX 3090160.8s253.3sMTP slower
ctx32k_answerStrix Halo156.4s144.1sMTP faster
ctx64k_answerStrix Halo320.9s317.1sabout even
ctx128k_answerStrix Halo801.8s823.5sMTP slower

So the build detail matters, and hardware matters too. The safer claim is: MTP does not make fresh prefill cheaper, and some implementations can make cold long-context prefill slower.

KV Cache Reuse Changes The Result

Agent chats are not always cold prompts. A coding agent may keep the same repo context and ask several follow-up questions. In llama.cpp, prompt-cache checkpoints can reuse that shared prefix.

I ran a second test with the same 32k, 64k, and 128k context prefix, but different follow-up instructions. This time I enabled llama.cpp's prompt cache with --cache-ram 8192.

HardwareShapeBaseline totalMTP n=3 totalBaseline TTFTMTP n=3 TTFT
RTX 3090ctx32k_agent_cache10.8s7.9s0.96s1.18s
RTX 3090ctx64k_agent_cache12.2s9.0s1.24s1.64s
RTX 3090ctx128k_agent_cache15.0s10.3s1.91s2.58s
Strix Haloctx32k_agent_cache34.3s19.5s2.48s2.74s
Strix Haloctx64k_agent_cache38.0s23.3s3.37s3.70s
Strix Haloctx128k_agent_cache45.6s27.4s5.13s5.62s

That is the agent-shaped result I expected. The first request still has to prefill the context, but the measured follow-up turns mostly restore cached checkpoints. TTFT drops from minutes to a couple seconds, so decode speed matters again.

The MTP TTFT is still a little worse than baseline, but the total request is faster because the answer generation is much faster. At 128k cached context, MTP n=3 takes the 3090 run from 15.0 seconds to 10.3 seconds. On Strix Halo, it takes the same shape from 45.6 seconds to 27.4 seconds.

RTX 3090: cached agent follow-up time vs context size
No context-size data
Strix Halo: cached agent follow-up time vs context size
16.33s24.44s32.55s40.66s48.77s30k61k121kprompt tokens, server-reportedTotal secondsbaseline + cache ctx32k_agent_cache: 30k prompt tokens, 34.26sbaseline + cache ctx64k_agent_cache: 61k prompt tokens, 38.03sbaseline + cache ctx128k_agent_cache: 121k prompt tokens, 45.63sMTP n=3 + cache ctx32k_agent_cache: 30k prompt tokens, 19.47sMTP n=3 + cache ctx64k_agent_cache: 61k prompt tokens, 23.33sMTP n=3 + cache ctx128k_agent_cache: 121k prompt tokens, 27.36s
baseline + cacheMTP n=3 + cache

Does MTP Make Things Slower?

Sometimes, yes.

I would not say "MTP is slower." That is too broad. The better rule is:

MTP is faster when decode dominates. MTP can be slower when fresh prefill dominates.

That makes it a good fit for normal chat, coding-agent replies, cached-context follow-ups, and longer answers. It is a weaker fit for "read this giant context from scratch and answer in a few sentences," especially on builds where MTP adds prompt-processing overhead.

The earlier medium-context agent result still matters. On Strix Halo, Qwen3.6 27B Q4_K_M went from 12.0 tok/s to 20.4 tok/s with MTP n=3 on my agent shape. Total time dropped from 41.7 seconds to 24.5 seconds.

That is the workload where MTP feels good: the model has some context, but it also has enough answer to decode.

My Current Rule

For local use, I would still keep MTP on for normal chats and agent-style work.

For giant fresh prompts, I would test both modes. If the prompt is 100k+ tokens and the answer is not very long, baseline may be faster. The prompt-processing fix helped a bit on my 3090 run, but MTP was still slower for cold 32k, 64k, and 128k prompts.

For ongoing chats with KV cache reuse, the result is much less harsh. The cold-prompt benchmark answers one question: when the model has to prefill the whole long prompt from scratch, MTP does not make that prefill cheaper. The cached-agent benchmark answers the other: once that context is reused, MTP can help again.