Skip to content

granite-4.1 8b

Q4_K_M·8B params·GGUF
checkpoint: unsloth/granite-4.1-8b-GGUF:Q4_K_M
commit: 6f9671f73eb0
weights 4.98 GiB

All runs (20)

legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
127.0
114.1665.242ms7.928100814ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
124.8
115.5520.0121ms8.0597356.37s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
124.4
92.13118.8223ms8.069579910ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
124.1
117.7645.145ms8.128100838ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
123.1
112.62930.7189ms8.15555004.38s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
122.8
43.2122.36.00s8.15555009.49s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
122.5
115.2564.5110ms8.2597356.60s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
122.1
89.32847.9244ms8.269579942ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
120.9
46.697.96.22s8.35555009.77s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
120.7
109.83212.3173ms8.35555004.48s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinechat1
100.2
97.6766.738ms10.0281001.00s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinecodegen1
99.1
98.51213.055ms10.1597387.48s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinerag1
98.1
91.55797.695ms10.2695791.12s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent1
96.0
94.62805.2159ms10.45555005.24s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
77.2
74.4591.949ms13.0281001.30s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
74.6
63.83453.4201ms13.4695791.22s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
72.2
70.7581.8112ms13.85973510.41s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
71.5
66.61759.5254ms14.05555007.36s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent4
59.8
36.9135.25.30s16.755550011.52s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
42.3
35.0237.72.74s23.655550013.85s0.000 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp44°C idle · 62°C peak
peak draw326 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp44°C idle · 78°C peak
peak draw428 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W672 GB/s271 GB/s67.9 TF68.4 TF
250 W672 GB/s271 GB/s69.5 TF68.2 TF
300 W672 GB/s270 GB/s69.6 TF68.4 TF
compute: 12
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.8-1-cachyos
driver595.71.05
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue