Qwen2.5 14B-Instruct

Q4_K_M·14B params·GGUF

intelligence: see on Artificial Analysis →

checkpoint: Qwen2.5-14B-Instruct-Q4_K_M.gguf

All runs (12)


raw	hardware comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 595	llama.cpp opt-build (cuda)	pl-450w	mixed_2048_768	1	78.9	78.9	—	—	—	—	—	2048	768	—	—
raw	hardware comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 595	llama.cpp opt-build (cuda)	pl-450w	mixed_64_1024	1	78.4	78.4	—	—	—	—	—	64	1024	—	—
raw	hardware comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 595	llama.cpp opt-build (cuda)	pl-450w	mixed_1024_1024	1	78.0	78.0	—	—	—	—	—	1024	1024	—	—
raw	hardware comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (cuda)	pl-250w	mixed_2048_768	1	64.6	64.6	—	—	—	—	—	2048	768	—	—
raw	hardware comparable	GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595	llama.cpp b9174 (cuda)	pl-200w	mixed_2048_768	1	64.4	64.4	—	—	—	—	—	2048	768	—	—
raw	hardware comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (cuda)	pl-250w	mixed_64_1024	1	64.4	64.4	—	—	—	—	—	64	1024	—	—
raw	hardware comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (cuda)	pl-250w	mixed_1024_1024	1	64.4	64.4	—	—	—	—	—	1024	1024	—	—
raw	hardware comparable	GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595	llama.cpp b9174 (cuda)	pl-200w	mixed_64_1024	1	64.3	64.3	—	—	—	—	—	64	1024	—	—
raw	hardware comparable	GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595	llama.cpp b9174 (cuda)	pl-200w	mixed_1024_1024	1	64.2	64.2	—	—	—	—	—	1024	1024	—	—
raw	hardware comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595	llama.cpp opt-build (cuda)	pl-200w	mixed_64_1024	1	38.7	38.7	—	—	—	—	—	64	1024	—	—
raw	hardware comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595	llama.cpp opt-build (cuda)	pl-200w	mixed_1024_1024	1	37.9	37.9	—	—	—	—	—	1024	1024	—	—
raw	hardware comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595	llama.cpp opt-build (cuda)	pl-200w	mixed_2048_768	1	37.9	37.9	—	—	—	—	—	2048	768	—	—

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

clocksgfx 210 MHz · mem 405 MHz

temp35°C idle · 35°C peak

peak draw24 W

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp opt-build (cuda)

osUbuntu 24.04 LTS

kernel7.0.2-4-pve

driverNVIDIA 595.71.05 + CUDA 13.2

python3.12.3

runs/cell3

warmups0

endpointllama-bench

streamingfalse

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

clocksgfx 240 MHz · mem 5001 MHz

temp50°C idle · 50°C peak

peak draw112 W

backendllama.cpp opt-build (cuda)

osUbuntu 24.04 LTS

kernel7.0.2-4-pve

driverNVIDIA 595.71.05 + CUDA 13.2

python3.12.3

runs/cell3

warmups0

endpointllama-bench

streamingfalse

GeForce RTX 5070 · 12 GiB

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.4 GiB)

power200 W / 300 W max(67% cap)

clocksgfx 180 MHz · mem 405 MHz

temp31°C idle · 31°C peak

peak draw1 W

hardware probes

copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%

192-bit14001 MHz48 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	672 GB/s	271 GB/s	67.9 TF	68.4 TF
250 W	672 GB/s	271 GB/s	69.5 TF	68.2 TF
300 W	672 GB/s	270 GB/s	69.6 TF	68.4 TF

compute: 12

backendllama.cpp b9174 (cuda)

osCachyOS

kernel7.0.8-1-cachyos

driverNVIDIA 595.71.05 + CUDA 13.2

python3.14.4

runs/cell3

warmups0

endpointllama-bench

streamingfalse

GeForce RTX 5070 · 12 GiB

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.4 GiB)

power250 W / 300 W max(83% cap)

clocksgfx 2910 MHz · mem 14001 MHz

temp50°C idle · 50°C peak

peak draw36 W

backendllama.cpp b9174 (cuda)

osCachyOS

kernel7.0.8-1-cachyos

driverNVIDIA 595.71.05 + CUDA 13.2

python3.14.4

runs/cell3

warmups0

endpointllama-bench

streamingfalse