Skip to content

Gemma-3 4b-it

Q4_K_M·4B params·GGUF
checkpoint: unsloth/gemma-3-4b-it-GGUF:gemma-3-4b-it-Q4_K_M.gguf

All runs (134)

rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_4096_2561
197.2
197.24096256
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_1024_10241
189.9
189.910241024
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450tg_10241
189.8
189.81024
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_384_11521
189.0
189.03841152
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_2048_7681
189.0
189.02048768
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_1024_161
189.0
189.0102416
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450tg_5121
188.7
188.7512
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_64_10241
188.1
188.1641024
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3chat1
186.7
167.8818.536ms5.429100566ms0.000 GiB
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450tg_1281
186.4
186.4128
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595chat1
186.2
170.5835.937ms5.429100569ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2chat1
185.0
171.51133.826ms5.429100583ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
184.7
167.9788.538ms5.429100573ms0.000 GiB
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_1280_30721
183.8
183.812803072
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_16_15361
183.7
183.7161536
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2chat1
183.4
157.41047.129ms5.529100635ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595codegen1
183.1
171.8541.8132ms5.56410005.82s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3codegen1
182.6
171.6555.0119ms5.56410005.83s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
182.2
164.4816.338ms5.529100582ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3rag1
181.9
101.91952.2339ms5.584670689ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595rag1
181.6
101.21976.0392ms5.584670779ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2codegen1
181.5
172.22305.329ms5.56410005.81s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
181.3
169.4539.9129ms5.56410005.90s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595agent1
181.3
157.62277.1227ms5.56114362.76s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595agent4
181.1
58.8185.64.20s5.56114366.28s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3agent1
180.8
154.12265.1228ms5.56114362.78s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
180.6
95.51871.9339ms5.584670688ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3agent4
180.6
68.9183.23.46s5.56114366.20s0.000 GiB
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450mixed_2048_2561
180.3
180.32048256
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2rag1
180.2
112.85385.3231ms5.584670674ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2agent1
180.1
154.22550.4203ms5.66114362.76s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
180.1
152.82316.2223ms5.66114362.77s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
179.8
65.9152.13.87s5.66114366.77s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2codegen1
179.8
171.52058.629ms5.66410005.83s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2agent4
179.7
63.7206.83.72s5.66114365.99s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
179.1
160.0807.938ms5.629100597ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2rag1
178.9
111.45753.7235ms5.684670649ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
178.6
166.7502.5127ms5.66410006.00s0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_4096_2561
177.7
177.74096256
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
177.7
101.61932.8387ms5.684670792ms0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_2048_2561
177.7
177.72048256
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
177.7
153.22267.3228ms5.66114362.81s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2agent1
177.4
154.22369.8218ms5.66114362.84s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2agent4
177.3
73.4169.24.01s5.66114366.94s0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2tg_1281
177.1
177.1128
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
177.1
54.6168.74.16s5.66114366.39s0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2tg_5121
176.5
176.5512
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_1024_161
176.2
176.2102416
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_2048_7681
175.9
175.92048768
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_1024_10241
175.1
175.110241024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_64_10241
175.1
175.1641024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2tg_10241
175.0
175.01024
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
174.7
160.9531.5123ms5.76410006.21s0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_384_11521
174.7
174.73841152
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
174.6
97.12050.5335ms5.784670701ms0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_16_15361
173.6
173.6161536
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
172.6
60.8153.64.20s5.86114367.07s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselinechat1
172.6
166.91377.821ms5.829100595ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
172.5
143.72266.9228ms5.86114362.84s0.000 GiB
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2mixed_1280_30721
171.1
171.112803072
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselinecodegen1
169.5
168.72014.935ms5.96410005.93s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wchat1
169.2
149.8807.837ms5.929100645ms0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselinerag1
168.4
143.010212.283ms5.984667532ms0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselineagent1
168.4
163.710203.960ms5.96113762.29s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselineagent4
168.3
66.4212.63.87s5.96113766.17s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wcodegen1
164.9
155.2499.9131ms6.16410006.45s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wrag1
164.6
95.11923.9328ms6.184670702ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wagent4
163.5
63.7155.64.37s6.16114367.49s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wagent1
163.4
139.92172.2238ms6.16114363.12s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2chat1
142.9
130.3744.741ms7.029100759ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wchat1
142.9
128.5652.145ms7.029100742ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2rag1
138.1
84.71960.4372ms7.284670807ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wrag1
133.5
84.01908.2340ms7.584670803ms0.000 GiB
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_4096_2561
129.9
129.94096256
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_1024_161
129.1
129.1102416
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wagent1
128.7
111.12080.5249ms7.86114363.91s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2agent1
128.4
111.22104.8246ms7.86114363.91s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wagent4
128.2
53.0121.75.41s7.86114369.51s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wcodegen1
127.9
120.9540.1127ms7.86410008.27s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2agent4
126.9
45.2129.35.64s7.96114369.51s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2codegen1
126.7
119.3552.2125ms7.96410008.38s0.000 GiB
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2tg_1281
122.9
122.9128
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_2048_2561
122.2
122.22048256
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2tg_5121
120.9
120.9512
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_2048_7681
120.0
120.02048768
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2tg_10241
119.9
119.91024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_64_10241
119.9
119.9641024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_1024_10241
119.9
119.910241024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_384_11521
118.8
118.83841152
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_16_15361
118.5
118.5161536
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2mixed_1280_30721
117.6
117.612803072
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselinechat1
66.3
64.259ms15.11001.56s0.001 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselinecodegen1
65.0
64.699ms15.4100015.47s0.002 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselineagent1
64.5
61.3426ms15.53545.97s0.002 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselinerag1
63.7
55.5325ms15.7671.60s0.002 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselineagent4
22.2
17.83.36s45.037621.06s0.001 GiB
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450pp_51219361.9512
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450pp_102419786.01024
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450pp_204819515.62048
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2-pl450pp_409619748.34096
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2pp_51216459.8512
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2pp_102416494.81024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2pp_204815858.72048
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp llama.cpp-3e12fbd (cuda)raw-v4-r2pp_409615699.64096
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2pp_51219445.5512
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2pp_102419711.41024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2pp_204819679.02048
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp llama.cpp-1a68ec9 (cuda)raw-v4-r2pp_409619626.04096

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp44°C idle · 46°C peak
peak draw196 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power250 W / 450 W max(56% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp47°C idle · 51°C peak
peak draw243 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp54°C idle · 62°C peak
peak draw335 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp60°C idle · 77°C peak
peak draw433 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp39°C idle · 44°C peak
peak draw196 W
backendllama.cpp cuda-3e12fbd (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.39
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp43°C idle · 60°C peak
peak draw337 W
backendllama.cpp cuda-3e12fbd (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.39
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp52°C idle · 72°C peak
peak draw424 W
backendllama.cpp cuda-3e12fbd (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.39
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power300 W / 450 W max(67% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1950/2100 MHz · mem 9501 MHz
temp37°C idle · 64°C peak
peak draw291 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
pcieGen 1 x16 / Gen 4 x16 max
clocksgfx 180/3090 MHz · mem 405 MHz
temp31°C idle · 64°C peak
peak draw194 W
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W672 GB/s271 GB/s67.9 TF68.4 TF
250 W672 GB/s271 GB/s69.5 TF68.2 TF
300 W672 GB/s270 GB/s69.6 TF68.4 TF
compute: 12
backendllama.cpp cuda-1a68ec9 (cuda)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.43
python3.14.4
build flagsGGML_CUDA=ON CMAKE_CUDA_ARCHITECTURES=120 CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
fixed256 GB/s106 GB/s30.3 TF-
compute: 11.5
backendllama.cpp b1203 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
pcieGen 1 x16 / Gen 4 x16 max
clocksgfx 180/3090 MHz · mem 405 MHz
temp39°C idle · 62°C peak
peak draw175 W
backendllama.cpp vulkan-1a68ec9 (vulkan)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.43
python3.14.4
llama.cppversion: 1 (1a68ec9) built with GNU 15.2.1 for Linux x86_64
build flagsGGML_VULKAN=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
clocksgfx 210 MHz · mem 405 MHz
temp34°C idle · 34°C peak
peak draw24 W
backendllama.cpp llama.cpp-3e12fbd (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
python3.12.3
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
clocksgfx 210 MHz · mem 405 MHz
temp40°C idle · 40°C peak
peak draw26 W
backendllama.cpp llama.cpp-3e12fbd (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
python3.12.3
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
clocksgfx 180 MHz · mem 405 MHz
temp30°C idle · 30°C peak
peak draw1 W
backendllama.cpp llama.cpp-1a68ec9 (cuda)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
python3.14.4
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
clocksgfx 180 MHz · mem 405 MHz
temp32°C idle · 32°C peak
peak draw2 W
backendllama.cpp llama.cpp-b9174 (vulkan)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
python3.14.4
runs/cell3
warmups0
endpointllama-bench
streamingfalse