Qwen2.5-Coder 7B-Instruct

Q4_K_M·7B params·GGUF

intelligence: see on Artificial Analysis →

checkpoint: Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:q4_k_m

commit: 13fb94bfda8c

weights 4.36 GiB

All runs (20)


legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	151.9	146.4	1977.5	—	26ms	6.6	—	49	90	605ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	150.1	140.5	2102.4	—	47ms	6.7	—	81	476	3.39s
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	149.8	125.4	8204.2	—	104ms	6.7	—	855	47	567ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	148.9	136.6	8636.8	—	75ms	6.7	—	606	337	2.55s
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	148.6	51.3	151.8	—	4.36s	6.7	—	606	337	6.88s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	147.9	132.3	1383.1	—	36ms	6.8	—	49	90	707ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	146.8	139.2	1946.3	—	47ms	6.8	—	81	476	3.44s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	146.4	104.9	7236.2	—	108ms	6.8	—	855	47	569ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	145.6	133.7	8530.0	—	71ms	6.9	—	606	337	2.58s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	145.4	63.1	147.5	—	3.82s	6.9	—	606	337	6.82s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	chat	1	120.7	117.2	1531.0	—	33ms	8.3	—	49	97	811ms
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	codegen	1	120.2	119.4	2100.4	—	39ms	8.3	—	81	589	4.94s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	rag	1	118.5	110.5	10682.4	—	80ms	8.4	—	855	47	698ms
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	agent	1	117.6	114.5	3643.2	—	130ms	8.5	—	606	337	2.94s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	91.2	88.6	1403.1	—	36ms	11.0	—	49	90	1.01s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	87.4	77.5	7234.7	—	118ms	11.4	—	855	47	886ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	85.4	79.9	8003.3	—	76ms	11.7	—	606	162	2.10s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	81.8	81.4	1697.7	—	53ms	12.2	—	81	476	5.87s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	agent	4	80.6	40.1	234.2	—	3.89s	12.4	—	606	336	7.76s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	50.7	42.4	622.3	—	1.04s	19.7	—	606	232	5.81s

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power350 W / 450 W max(78% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1995/2100 MHz · mem 9501 MHz

temp40°C idle · 60°C peak

peak draw343 W

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp46°C idle · 75°C peak

peak draw426 W

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

backendllama.cpp 59778f0 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

python3.12.3

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 5070 · 12 GiB

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.4 GiB)

power250 W / 300 W max(83% cap)

hardware probes

copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%

192-bit14001 MHz48 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	672 GB/s	271 GB/s	67.9 TF	68.4 TF
250 W	672 GB/s	271 GB/s	69.5 TF	68.2 TF
300 W	672 GB/s	270 GB/s	69.6 TF	68.4 TF

compute: 12

backendllama.cpp b9174 (vulkan)

osCachyOS

kernel7.0.8-1-cachyos

driver595.71.05

python3.14.4

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue