Skip to content

Qwen2.5-Coder 7B-Instruct

Q4_K_M·7B params·GGUF
checkpoint: Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:q4_k_m
commit: 13fb94bfda8c
weights 4.36 GiB

All runs (20)

legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
151.9
146.41977.526ms6.64990605ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
150.1
140.52102.447ms6.7814763.39s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
149.8
125.48204.2104ms6.785547567ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
148.9
136.68636.875ms6.76063372.55s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
148.6
51.3151.84.36s6.76063376.88s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
147.9
132.31383.136ms6.84990707ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
146.8
139.21946.347ms6.8814763.44s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
146.4
104.97236.2108ms6.885547569ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
145.6
133.78530.071ms6.96063372.58s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
145.4
63.1147.53.82s6.96063376.82s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinechat1
120.7
117.21531.033ms8.34997811ms0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinecodegen1
120.2
119.42100.439ms8.3815894.94s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinerag1
118.5
110.510682.480ms8.485547698ms0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent1
117.6
114.53643.2130ms8.56063372.94s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
91.2
88.61403.136ms11.049901.01s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
87.4
77.57234.7118ms11.485547886ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
85.4
79.98003.376ms11.76061622.10s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
81.8
81.41697.753ms12.2814765.87s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent4
80.6
40.1234.23.89s12.46063367.76s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
50.7
42.4622.31.04s19.76062325.81s0.000 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1995/2100 MHz · mem 9501 MHz
temp40°C idle · 60°C peak
peak draw343 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp46°C idle · 75°C peak
peak draw426 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W672 GB/s271 GB/s67.9 TF68.4 TF
250 W672 GB/s271 GB/s69.5 TF68.2 TF
300 W672 GB/s270 GB/s69.6 TF68.4 TF
compute: 12
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.8-1-cachyos
driver595.71.05
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue