granite-4.1 8b

Q4_K_M·8B params·GGUF

intelligence: see on Artificial Analysis →

checkpoint: unsloth/granite-4.1-8b-GGUF:Q4_K_M

commit: 6f9671f73eb0

weights 4.98 GiB

All runs (20)


legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	127.0	114.1	665.2	—	42ms	7.9	—	28	100	814ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	124.8	115.5	520.0	—	121ms	8.0	—	59	735	6.37s
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	124.4	92.1	3118.8	—	223ms	8.0	—	695	79	910ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	124.1	117.7	645.1	—	45ms	8.1	—	28	100	838ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	123.1	112.6	2930.7	—	189ms	8.1	—	555	500	4.38s
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	122.8	43.2	122.3	—	6.00s	8.1	—	555	500	9.49s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	122.5	115.2	564.5	—	110ms	8.2	—	59	735	6.60s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	122.1	89.3	2847.9	—	244ms	8.2	—	695	79	942ms
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	120.9	46.6	97.9	—	6.22s	8.3	—	555	500	9.77s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	120.7	109.8	3212.3	—	173ms	8.3	—	555	500	4.48s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	chat	1	100.2	97.6	766.7	—	38ms	10.0	—	28	100	1.00s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	codegen	1	99.1	98.5	1213.0	—	55ms	10.1	—	59	738	7.48s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	rag	1	98.1	91.5	5797.6	—	95ms	10.2	—	695	79	1.12s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	agent	1	96.0	94.6	2805.2	—	159ms	10.4	—	555	500	5.24s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	77.2	74.4	591.9	—	49ms	13.0	—	28	100	1.30s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	74.6	63.8	3453.4	—	201ms	13.4	—	695	79	1.22s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	72.2	70.7	581.8	—	112ms	13.8	—	59	735	10.41s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	71.5	66.6	1759.5	—	254ms	14.0	—	555	500	7.36s
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	agent	4	59.8	36.9	135.2	—	5.30s	16.7	—	555	500	11.52s
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	42.3	35.0	237.7	—	2.74s	23.6	—	555	500	13.85s

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power350 W / 450 W max(78% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp44°C idle · 62°C peak

peak draw326 W

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp44°C idle · 78°C peak

peak draw428 W

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

backendllama.cpp 59778f0 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

python3.12.3

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 5070 · 12 GiB

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.4 GiB)

power250 W / 300 W max(83% cap)

hardware probes

copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%

192-bit14001 MHz48 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	672 GB/s	271 GB/s	67.9 TF	68.4 TF
250 W	672 GB/s	271 GB/s	69.5 TF	68.2 TF
300 W	672 GB/s	270 GB/s	69.6 TF	68.4 TF

compute: 12

backendllama.cpp b9174 (vulkan)

osCachyOS

kernel7.0.8-1-cachyos

driver595.71.05

python3.14.4

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue