How much do power limits affect LLM benchmark tok/s?
The RTX 3090 in my homelab runs at the full 450 W cap by default. I had previously capped it at 200 W because the rig with four cards in it kept popping the circuit breaker when more than one card pulled stock power under load. 200 W per card was the cap that let me run all four at once without going dark.
That seemed fine until I started benchmarking against a Strix Halo Framework Desktop and an RTX 5070 and got numbers that didn't match the reviews. So I ran the same card at three caps: 200 W, 350 W, 450 W. The difference is huge.
TL;DR
- A 3090 at 200 W runs Qwen3.6 27B Q4_K_M at 21.1 tok/s. The same card at 450 W runs it at 38.7 tok/s. Same card, same model, same backend, same software.
- Most of the gain happens in a 140-270 W band. Below 140 W the card is throttled to a near-idle ~33 tok/s on Gemma-3 4B. Above 270 W the curve flattens. 350 W → 450 W is roughly flat on this workload.
- A 5070 at its stock 250 W beats a 3090 at 200 W across the board, but a 3090 at 350 W catches the 5070 on single-stream chat and codegen on Gemma-3 4B. On prefill-heavy (rag) and multi-stream (agent c=4) shapes the 5070 still leads even with the 350 W 3090 (143 vs 102 tok/s rag, 66 vs 55 tok/s agent c=4). The first benchmark pass making it look like "the 5070 beats the 3090" was mostly a power-cap story for chat-style workloads, but there's still a real Blackwell-on-prefill advantage too. Details in docs/5070-vs-3090-investigation.md.
- My MTP post needed updating. "Strix Halo with MTP catches the bare 3090" was true at 200 W; at 450 W the 3090 baseline is 38.7 tok/s, well clear of Strix Halo's 21.2.
- If you're buying or comparing GPUs for local LLM work, ask what the power cap is. It's at least as load-bearing as quant choice.
Gemma-3 4B Q4_K_M, single 3090, 100 W → 450 W in 10 W steps
Small model, fits comfortably in 24 GiB. Same llama.cpp build, same harness, same prompts, temperature 0, median of 5 measured runs after 2 warmups. One line per workload shape:
The chat curve sits at ~33 tok/s from 100 W to 130 W: the card is pinned at a near-idle compute state and the cap is binding hard. From 140 W it climbs sharply, passing 100 tok/s by 180 W and 150 tok/s by 240 W. Above ~270 W the curve flattens around 165 tok/s and 450 W gives no more.
Peak draw at 450 W was 433 W, so the card has the headroom but doesn't use it. Codegen (long answer) tracks chat almost exactly. Rag (longer prompt) plateaus much earlier and lower (~95-100 tok/s above 200 W) because prefill is compute-bound on long prompts and doesn't scale with cap the way decode does. Agent concurrency 4 stays low (~50-65 tok/s) because the four streams contend for the same kernels.
The wall isn't bandwidth (936 GB/s on GDDR6X) and isn't VRAM. It's the power budget. Below ~270 W the boost clocks throttle. Above, decode isn't generating fast enough to keep the SMs busy and the extra wattage goes to heat.
Qwen3.6 27B with MTP, single 3090, 200 W vs 450 W
Same setup as the MTP post, one card, Q4_K_M, dense not MoE. Two cap settings:
| Setting | Baseline | MTP n=2 | MTP n=3 | MTP gain |
|---|---|---|---|---|
| 200 W cap | 21.1 | 32.1 | 34.2 | 1.62× |
| 450 W cap | 38.7 | 59.5 | 58.7 | 1.54× |
Two things happen when you lift the cap:
- The baseline almost doubles (21.1 → 38.7, +83 %). The cap was the bottleneck, not the architecture.
- The MTP sweet spot moves from n=3 to n=2. At 200 W the baseline was so slow that the third drafted token was still worth keeping; at 450 W the baseline is fast enough that the additional draft's acceptance rate doesn't pay for itself.
The relative MTP gain shrinks slightly (1.62× → 1.54×), but the absolute tok/s nearly doubles. The bigger lever for the 3090 was the cap. MTP still helps, but the headroom MTP exploits ("the GPU is idle waiting on the next forward pass") gets smaller when the card has the wattage to do those passes faster on its own.
Dual-card Q8_0 barely moved
Same model in Q8_0 on two layer-split 3090s, same two caps:
| Setting | Baseline | MTP n=2 | MTP n=3 |
|---|---|---|---|
| 200 W cap | 25.1 | 47.4 | 50.5 |
| 450 W cap | 25.7 | 49.6 | 55.9 |
Baseline barely moved (+2 %). That's because --split-mode layer runs the two cards as a pipeline: at any instant one card is computing and the other is waiting for the next layer's input. Per-card power utilization stays low. Lifting the cap doesn't help when the card already has the wattage it needs.
So at the 200 W cap, Q8 dual beat Q4 single (25.1 vs 21.1) because Q4 single was throttled. At 450 W, Q4 single beats Q8 dual (38.7 vs 25.7) because Q4 single can finally use the watts it had been promised. Two cards in layer-split is a VRAM-headroom story, not a throughput story.
What the rest of the model fleet looks like
Once I had the 200 W and 450 W passes done for Qwen3.6 27B, I re-ran the rest of the 3090 lineup at 350 W and 450 W to see if the pattern held. 26 models, 50+ new YAMLs. The chat tok/s deltas are consistent and pretty load-bearing for anyone reading my benchmarks page:
| Model class | 200 → 450 W lift | Examples |
|---|---|---|
| Dense 27-32B | +70 to +113 % | Qwen3.6 27B all 6 quants (+70-99 %), Qwen2.5 14B-AWQ (+90 %), 32B-AWQ (+113 %), granite-4.1-30b (+92 %) |
| Dense 7-8B | +53 to +73 % | Qwen2.5-Coder-7B AWQ (+72 %), Qwen2.5-Coder-7B GGUF (+65 %), granite-4.1-8b (+53 %) |
| MoE 8-35B | +13 to +21 % | Qwen3-Coder-30B-A3B (+16 %), Qwen3.5-35B-A3B (+13 %), Nemotron 30B-A3B (+12 %), LFM2-8B-A1B (+21 %) |
| Small dense (under 3B) | +0 to +29 % | LFM2 1.2-2.6B (+9-29 %), Gemma-4 E2B/E4B (+0-15 %) |
| Tiny (under 1B) | +3 % | LFM2.5-350M |
Three things fall out:
- 350 W catches almost all the gain. Across the matrix, going from 350 W to 450 W moves chat tok/s by less than 5 % on most models. A few — Qwen3-Coder-30B-A3B (+11 %) and Qwen2.5-Coder-7B GGUF (+11 %) — keep climbing past 350 W, but most don't. If you don't want to run your card flat-out, 350 W is the knee on a stock 3090.
- MoE architectures don't scale with the cap. Qwen3-Coder-30B-A3B, Qwen3.5-35B-A3B, Nemotron-3-Nano-Omni 30B-A3B, and LFM2-8B-A1B all only activate ~3 B of params per token. They're not compute-bound the way a dense 27 B is, so the boost clocks aren't the bottleneck. Cap helps a little, but not 2×.
- Backend doesn't change the story. Qwen2.5 7B in AWQ via vLLM went +73 %; Qwen2.5-Coder 7B in GGUF via llama.cpp went +65 %. Same chip, same lift, different software stack. This is a hardware story, not a backend story.
If you only learn one thing from any of this, learn this: a 200 W cap is the difference between a 3090 doing 21 tok/s on a 27 B model and 38 tok/s on the same model. That's not a quirk of one quant or one harness. It's the cap.
What this means for benchmarking
I had to update the MTP post once the 450 W numbers landed. The "Strix Halo with MTP catches the 3090 baseline" framing was technically true, but only at the 200 W cap I'd set on my rack. At 450 W the same comparison reads "the bare 3090 baseline is almost 2× faster than Strix Halo's best MTP number." Same hardware, completely different conclusion.
This isn't a Strix Halo problem or an MTP problem. Strix Halo runs at a fixed ~140 W power budget for the whole APU and has nowhere to lift; its 21.2 tok/s with MTP is honest. The number that was lying was the 3090's, because I'd quietly throttled it.
Most reviews and benchmark posts don't disclose the power cap. They should. A 3090 in a tightly-cooled gaming case at 250 W is a different chip from a 3090 in a workstation at 450 W, and you'll get answers that are 50 % apart on the same model. The same goes for RTX 5070 vs 5090, RX 7900 vs MI300X, and so on. Desktop cards ship with quietly different default caps, and even within the same SKU vendors set them differently.
How I'm capturing this going forward
The benchmark harness now writes the configured power_limit_watts, the hardware power_max_watts, the peak observed draw during the run, and the full driver/CUDA versions into every YAML. The benchmarks page renders the cap as a small badge on every row, in amber when the limit is below the card's max:
RTX 3090 · 24 GiB
200 W · drv 590
vs
RTX 3090 · 24 GiB
450 W · drv 590
Click on a hardware row in the /benchmarks page to see the full per-rig metadata, including PCIe link state, GPU clocks, and peak draw. The schema-v3 fields are documented in scripts/llm-speed/docs/yaml-schema.md if you want to read along.
If you reproduce any of these numbers, run nvidia-smi --query-gpu=power.limit,power.max_limit --format=csv,noheader and check what your card is actually set to. You'll save yourself a long argument with the data.
Caveats
- This is one model on one workload shape per cap. The general pattern (most gain in the 200-350 W band, diminishing returns past that) is consistent across the four shapes I measured, but I haven't tested it on every model.
- The 350 → 450 W plateau is workload-specific. A bigger model or a longer-context run that keeps the SMs busier could keep climbing past 350 W. I'll bench heavier models in a follow-up.
- I didn't sweep the 5070 across power caps. It only has a 250 W default and I don't want to undervolt it to find out where its knee is until I have more 5070 hours on the rig.
- The 450 W numbers in this post are from running ONE 3090 at a time at the full cap. With multiple cards loaded the circuit-breaker problem comes back; the 200 W setting was the all-four-at-once compromise that let me actually use the rig as a daily driver. If you have one 3090 on a normal 15 A circuit you should be fine; if you have four sharing a circuit, plan for that or buy more circuits.
- Sources for the cap behavior: NVIDIA's Power Management Modes, GPU-Z's power-state reporting, and llama.cpp's per-token timing logs, which let me see the dip when the boost clocks throttle.