Skip to content

Qwen2.5 14B-Instruct

AWQ·14B params·safetensors
checkpoint: Qwen/Qwen2.5-14B-Instruct-AWQ
commit: 539535859b13
weights 9.29 GiB

All runs (15)

legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat1
82.3
80.71509.834ms12.14974908ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat1
82.3
80.71533.333ms12.14974908ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag1
81.5
76.030227.236ms12.385545675ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag1
81.5
76.330203.135ms12.385545675ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen1
81.4
81.01765.744ms12.3816227.63s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen1
81.4
81.01763.544ms12.3816227.63s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent1
80.8
80.121761.729ms12.46063554.41s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent1
80.8
80.121655.729ms12.46063554.41s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent4
74.6
74.011489.765ms13.46063554.78s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent4
74.6
74.210789.763ms13.46063554.77s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselinechat1
44.5
42.6859.659ms22.549741.60s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselinerag1
40.9
38.916420.153ms24.5855451.36s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselineagent1
40.9
40.610787.656ms24.56063558.73s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselinecodegen1
40.1
40.0803.997ms24.98162215.46s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
vLLM 0.21.0 (cuda)baselineagent4
39.6
39.37270.793ms25.26063559.04s0.000 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power420 W / 450 W max(93% cap)
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendvLLM 0.21.0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendvLLM 0.21.0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue