Qwen2.5-Coder 7B-Instruct
Q4_K_M·7B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:q4_k_mcommit:
13fb94bfda8cweights 4.36 GiB
All runs (20)
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 151.9 | 146.4 | 1977.5 | — | 26ms | 6.6 | — | 49 | 90 | 605ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 150.1 | 140.5 | 2102.4 | — | 47ms | 6.7 | — | 81 | 476 | 3.39s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 149.8 | 125.4 | 8204.2 | — | 104ms | 6.7 | — | 855 | 47 | 567ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 148.9 | 136.6 | 8636.8 | — | 75ms | 6.7 | — | 606 | 337 | 2.55s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 148.6 | 51.3 | 151.8 | — | 4.36s | 6.7 | — | 606 | 337 | 6.88s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 147.9 | 132.3 | 1383.1 | — | 36ms | 6.8 | — | 49 | 90 | 707ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 146.8 | 139.2 | 1946.3 | — | 47ms | 6.8 | — | 81 | 476 | 3.44s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 146.4 | 104.9 | 7236.2 | — | 108ms | 6.8 | — | 855 | 47 | 569ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 145.6 | 133.7 | 8530.0 | — | 71ms | 6.9 | — | 606 | 337 | 2.58s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 145.4 | 63.1 | 147.5 | — | 3.82s | 6.9 | — | 606 | 337 | 6.82s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | chat | 1 | 120.7 | 117.2 | 1531.0 | — | 33ms | 8.3 | — | 49 | 97 | 811ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | codegen | 1 | 120.2 | 119.4 | 2100.4 | — | 39ms | 8.3 | — | 81 | 589 | 4.94s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | rag | 1 | 118.5 | 110.5 | 10682.4 | — | 80ms | 8.4 | — | 855 | 47 | 698ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 1 | 117.6 | 114.5 | 3643.2 | — | 130ms | 8.5 | — | 606 | 337 | 2.94s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | chat | 1 | 91.2 | 88.6 | 1403.1 | — | 36ms | 11.0 | — | 49 | 90 | 1.01s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | rag | 1 | 87.4 | 77.5 | 7234.7 | — | 118ms | 11.4 | — | 855 | 47 | 886ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 1 | 85.4 | 79.9 | 8003.3 | — | 76ms | 11.7 | — | 606 | 162 | 2.10s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | codegen | 1 | 81.8 | 81.4 | 1697.7 | — | 53ms | 12.2 | — | 81 | 476 | 5.87s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 4 | 80.6 | 40.1 | 234.2 | — | 3.89s | 12.4 | — | 606 | 336 | 7.76s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 4 | 50.7 | 42.4 | 622.3 | — | 1.04s | 19.7 | — | 606 | 232 | 5.81s | 0.000 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1995/2100 MHz · mem 9501 MHz
temp40°C idle · 60°C peak
peak draw343 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp46°C idle · 75°C peak
peak draw426 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 672 GB/s | 271 GB/s | 67.9 TF | 68.4 TF |
| 250 W | 672 GB/s | 271 GB/s | 69.5 TF | 68.2 TF |
| 300 W | 672 GB/s | 270 GB/s | 69.6 TF | 68.4 TF |
compute: 12
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.8-1-cachyos
driver595.71.05
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue