granite-4.1 30b

Q4_K_M·30B params·GGUF

intelligence: see on Artificial Analysis →

checkpoint: unsloth/granite-4.1-30b-GGUF:Q4_K_M

commit: 6cb34f31b11c

weights 16.29 GiB

All runs (15)

Hardware	Backend	Mode	Shape	Conc.	Gen tok/s ↓	Prefill tok/s	TTFT	TPOT (ms)	Prompt tok	Out tok	Total
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	40.9	338.2	214ms	23.5	59	884	21.67s
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	40.4	378.2	74ms	23.2	28	100	2.37s
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	40.2	363.9	205ms	23.9	59	884	21.94s
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	39.9	1982.6	280ms	23.6	555	479	11.97s
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	39.8	370.3	76ms	23.6	28	100	2.42s
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	38.7	1973.4	281ms	24.0	555	479	12.30s
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	36.5	1925.2	361ms	23.5	695	116	3.03s
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	36.5	1855.7	339ms	24.0	695	116	3.17s
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	21.0	306.1	92ms	46.5	28	100	4.59s
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	20.9	359.1	217ms	47.6	59	884	42.46s
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	20.3	1999.8	278ms	48.1	555	377	18.94s
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	19.5	1193.7	428ms	47.6	695	138	6.91s
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	18.3	36.0	15.37s	23.7	555	479	27.46s
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	13.5	42.6	15.64s	24.0	555	479	24.66s
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	11.4	154.4	4.48s	81.2	555	499	40.13s

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power350 W / 450 W max(78% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1965/2100 MHz · mem 9501 MHz

temp44°C idle · 64°C peak

peak draw332 W

backendllama.cpp cuda-4f13cb7 (cuda)

serverlemonade unknown

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

containerizedtrue

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp44°C idle · 82°C peak

peak draw430 W

backendllama.cpp cuda-4f13cb7 (cuda)

serverlemonade unknown

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

containerizedtrue

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

backendllama.cpp 59778f0 (cuda)

serverlemonade unknown

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

python3.12.3

containerizedtrue

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue