Qwen2.5 14B-Instruct
Q4_K_M·14B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
Qwen2.5-14B-Instruct-Q4_K_M.ggufAll runs (12)
| raw | hardware comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 595 | llama.cpp opt-build (cuda) | pl-450w | mixed_2048_768 | 1 | 78.9 | 78.9 | — | — | — | — | — | 2048 | 768 | — | — |
| raw | hardware comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 595 | llama.cpp opt-build (cuda) | pl-450w | mixed_64_1024 | 1 | 78.4 | 78.4 | — | — | — | — | — | 64 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 595 | llama.cpp opt-build (cuda) | pl-450w | mixed_1024_1024 | 1 | 78.0 | 78.0 | — | — | — | — | — | 1024 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (cuda) | pl-250w | mixed_2048_768 | 1 | 64.6 | 64.6 | — | — | — | — | — | 2048 | 768 | — | — |
| raw | hardware comparable | GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595 | llama.cpp b9174 (cuda) | pl-200w | mixed_2048_768 | 1 | 64.4 | 64.4 | — | — | — | — | — | 2048 | 768 | — | — |
| raw | hardware comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (cuda) | pl-250w | mixed_64_1024 | 1 | 64.4 | 64.4 | — | — | — | — | — | 64 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (cuda) | pl-250w | mixed_1024_1024 | 1 | 64.4 | 64.4 | — | — | — | — | — | 1024 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595 | llama.cpp b9174 (cuda) | pl-200w | mixed_64_1024 | 1 | 64.3 | 64.3 | — | — | — | — | — | 64 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595 | llama.cpp b9174 (cuda) | pl-200w | mixed_1024_1024 | 1 | 64.2 | 64.2 | — | — | — | — | — | 1024 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595 | llama.cpp opt-build (cuda) | pl-200w | mixed_64_1024 | 1 | 38.7 | 38.7 | — | — | — | — | — | 64 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595 | llama.cpp opt-build (cuda) | pl-200w | mixed_1024_1024 | 1 | 37.9 | 37.9 | — | — | — | — | — | 1024 | 1024 | — | — |
| raw | hardware comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595 | llama.cpp opt-build (cuda) | pl-200w | mixed_2048_768 | 1 | 37.9 | 37.9 | — | — | — | — | — | 2048 | 768 | — | — |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
clocksgfx 210 MHz · mem 405 MHz
temp35°C idle · 35°C peak
peak draw24 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
compute: 8.6
backendllama.cpp opt-build (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
python3.12.3
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
clocksgfx 240 MHz · mem 5001 MHz
temp50°C idle · 50°C peak
peak draw112 W
backendllama.cpp opt-build (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
python3.12.3
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power200 W / 300 W max(67% cap)
clocksgfx 180 MHz · mem 405 MHz
temp31°C idle · 31°C peak
peak draw1 W
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 672 GB/s | 271 GB/s | 67.9 TF | 68.4 TF |
| 250 W | 672 GB/s | 271 GB/s | 69.5 TF | 68.2 TF |
| 300 W | 672 GB/s | 270 GB/s | 69.6 TF | 68.4 TF |
compute: 12
backendllama.cpp b9174 (cuda)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
python3.14.4
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
clocksgfx 2910 MHz · mem 14001 MHz
temp50°C idle · 50°C peak
peak draw36 W
backendllama.cpp b9174 (cuda)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
python3.14.4
runs/cell3
warmups0
endpointllama-bench
streamingfalse