Skip to content

Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows835 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/
Filters
Advanced filters

Full row-level explorer. This is the place for raw shapes, hardware probes, cache/ppfix rows, dense power caps, and reruns.

1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
617.7
13ms1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
617.3
11ms1.6
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
613.1
19ms1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
613.1
17ms1.6
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
611.6
34ms1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
609.8
33ms1.6
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
609.0
12ms1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
609.0
20ms1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
608.6
12ms1.6
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
607.9
20ms1.6
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
607.2
1.06s1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
606.1
1.04s1.6
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
605.7
17ms1.7
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
605.0
16ms1.7
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
602.4
47ms1.7
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
601.7
23ms1.7
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
601.0
34ms1.7
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
600.6
19ms1.7
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
599.2
1.16s1.7
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
598.8
1.03s1.7
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
486.6
13ms2.1
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
485.7
14ms2.1
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
481.9
37ms2.1
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
479.2
22ms2.1
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
475.3
37ms2.1
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
473.7
20ms2.1
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
468.4
48ms2.1
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
461.0
23ms2.2
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
332.6
17ms3.0
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
329.2
55ms3.0
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
328.2
32ms3.0
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
326.6
28ms3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
325.8
1.57s3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
324.9
18ms3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
322.4
57ms3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
321.9
33ms3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
320.5
42ms3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
319.2
1.98s3.1
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
254.0
21ms3.9
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
251.1
66ms4.0
1.2BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
248.3
328ms4.0
1.2B-ToolQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
240.4
304ms4.2
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
234.6
35ms4.3
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
234.4
90ms4.3
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
226.4
41ms4.4
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
226.2
40ms4.4
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
224.5
83ms4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
224.3
56ms4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
223.7
3.63s4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
223.7
54ms4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
223.3
51ms4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
223.0
3.64s4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
222.5
49ms4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
222.2
82ms4.5
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
202.7
44ms4.9
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
199.5
116ms5.0
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
197.2
59ms5.1
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
195.1
84ms5.1
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3chat1
186.7
36ms5.4
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595chat1
186.2
37ms5.4
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595codegen1
183.1
132ms5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3codegen1
182.6
119ms5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3rag1
181.9
339ms5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595rag1
181.6
392ms5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595agent1
181.3
227ms5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595agent4
181.1
4.20s5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3agent1
180.8
228ms5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3agent4
180.6
3.46s5.5
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
179.1
38ms5.6
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
174.7
123ms5.7
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
174.6
335ms5.7
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
172.6
4.20s5.8
4b-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
172.5
228ms5.8
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
148.2
59ms6.7
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
147.3
69ms6.8
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
146.2
98ms6.8
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
145.3
180ms6.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
145.1
96ms6.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
144.8
5.62s6.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
144.6
196ms6.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
144.4
148ms6.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
143.9
138ms6.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
143.8
5.68s7.0
2.6BQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
126.2
434ms7.9
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
123.6
66ms8.1
26B-A4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
119.5
103ms8.4
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
119.3
307ms8.4
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
118.7
101ms8.4
26B-A4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
117.3
235ms8.5
26B-A4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
116.4
634ms8.6
26B-A4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
116.3
426ms8.6
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
116.2
171ms8.6
26B-A4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
116.0
7.71s8.6
E2B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
101.6
1.04s9.8
E4B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
64.0
1.77s15.6
31B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
37.5
264ms26.7
31B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
36.5
847ms27.4
31B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
36.2
1.90s27.7
31B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
36.1
24.88s27.7
31B-itQ4_K_Mlegacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
36.1
2.43s27.7
Decode tok/s
Headline speed metric
TTFT / TPOT
Latency context
Raw vs workload
Separate comparison contracts
Notes badge key
hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.
350 W capRecorded GPU power limit.
drv 590GPU driver branch.
reasoningReasoning-token model.
Metric guide
Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.
TTFT - Time to first token. This includes prompt processing and server/API overhead.
TPOT / ITL - Time per output token after the first token. Lower is better.
Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.
Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.
Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.