Skip to content

Qwen2.5-Coder 32B-Instruct

AWQ·32B params·safetensors
checkpoint: Qwen/Qwen2.5-Coder-32B-Instruct-AWQ
commit: 1ed0a6145da0
weights 18.00 GiB

All runs (15)

legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat1
41.9
41.0839.561ms23.8491002.39s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat1
41.9
41.0878.158ms23.8491002.39s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag1
41.6
40.116203.053ms24.0855621.77s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag1
41.6
40.016188.053ms24.0855621.77s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen1
41.5
41.3815.996ms24.18172017.46s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen1
41.5
41.3817.096ms24.18172017.46s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent1
41.4
41.111373.253ms24.26062957.56s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent1
41.3
41.211391.453ms24.26062957.56s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent4
39.0
38.87704.199ms25.76062957.61s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent4
38.9
38.78186.8100ms25.76062957.62s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselinechat1
20.0
19.5452.6112ms49.9491004.76s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselinecodegen1
19.4
19.3417.2187ms51.48172037.27s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselineagent1
19.4
19.25080.7119ms51.660636318.90s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselinerag1
19.2
18.98013.5111ms52.2855623.73s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselineagent4
19.0
18.83535.8175ms52.660629515.71s0.000 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power420 W / 450 W max(93% cap)
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendvLLM 0.21.0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendvLLM 0.21.0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue