Skip to content

Qwen2.5 14B-Instruct

Q4_K_M·14B params·GGUF
checkpoint: Qwen2.5-14B-Instruct-Q4_K_M.gguf

All runs (12)

rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp opt-build (cuda)pl-450wmixed_2048_7681
78.9
78.92048768
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp opt-build (cuda)pl-450wmixed_64_10241
78.4
78.4641024
rawhardware comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp opt-build (cuda)pl-450wmixed_1024_10241
78.0
78.010241024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (cuda)pl-250wmixed_2048_7681
64.6
64.62048768
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595
llama.cpp b9174 (cuda)pl-200wmixed_2048_7681
64.4
64.42048768
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (cuda)pl-250wmixed_64_10241
64.4
64.4641024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (cuda)pl-250wmixed_1024_10241
64.4
64.410241024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595
llama.cpp b9174 (cuda)pl-200wmixed_64_10241
64.3
64.3641024
rawhardware comparable
GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595
llama.cpp b9174 (cuda)pl-200wmixed_1024_10241
64.2
64.210241024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp opt-build (cuda)pl-200wmixed_64_10241
38.7
38.7641024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp opt-build (cuda)pl-200wmixed_1024_10241
37.9
37.910241024
rawhardware comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp opt-build (cuda)pl-200wmixed_2048_7681
37.9
37.92048768

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
clocksgfx 210 MHz · mem 405 MHz
temp35°C idle · 35°C peak
peak draw24 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp opt-build (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
python3.12.3
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
clocksgfx 240 MHz · mem 5001 MHz
temp50°C idle · 50°C peak
peak draw112 W
backendllama.cpp opt-build (cuda)
osUbuntu 24.04 LTS
kernel7.0.2-4-pve
driverNVIDIA 595.71.05 + CUDA 13.2
python3.12.3
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power200 W / 300 W max(67% cap)
clocksgfx 180 MHz · mem 405 MHz
temp31°C idle · 31°C peak
peak draw1 W
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W672 GB/s271 GB/s67.9 TF68.4 TF
250 W672 GB/s271 GB/s69.5 TF68.2 TF
300 W672 GB/s270 GB/s69.6 TF68.4 TF
compute: 12
backendllama.cpp b9174 (cuda)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
python3.14.4
runs/cell3
warmups0
endpointllama-bench
streamingfalse
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
clocksgfx 2910 MHz · mem 14001 MHz
temp50°C idle · 50°C peak
peak draw36 W
backendllama.cpp b9174 (cuda)
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
python3.14.4
runs/cell3
warmups0
endpointllama-bench
streamingfalse