Benchmarks
Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.
content/benchmarks/runs/Advanced filters
Hardware tested
Rig metadata and microbenchmarks are shown here so memory bandwidth and tensor math do not get mixed into model-serving rankings.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
A self-built quad-3090 benchmark rig used for local ML and inference testing. Rows in the table that read "2× RTX 3090" use llama.cpp's --split-mode layer across two cards so larger Q8_0-class 27B models fit in memory. Every other RTX 3090 row uses exactly one card.
- GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), running at the full 450 W cap. Earlier benchmarks at a 200 W rack-noise cap are noted in the per-run YAML and discussed in the power-limits post.
- CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
- Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
- Power policy: 450 W max per card for the current full-power runs, with older capped rows preserved for comparison.
Best workload row per rig
| Rig | Best workload row | Decode tok/s | Backend / mode |
|---|---|---|---|
| GeForce RTX 3090 · 24 GiB | LFM2.5 350M · chat | 940.7 | llama.cpp baseline-pl-350w |
Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.
Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.
Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.