NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-Reasoning

Q4_K_M·30B params·GGUF

reasoning

intelligence: see on Artificial Analysis →

checkpoint: unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:Q4_K_M

commit: 571758804835

weights 22.25 GiB

All runs (14)

Hardware	Backend	Mode	Shape	Conc.	Gen tok/s ↓	Prefill tok/s	TTFT	TPOT (ms)	Prompt tok	Out tok	Total	VRAM Δ
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	168.5	422.5	191ms	5.5	70	1000	5.93s	0.010 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	166.2	404.6	204ms	5.5	70	1000	6.02s	0.010 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	159.6	3367.8	189ms	5.5	601	500	3.13s	0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	158.9	2893.4	208ms	5.5	601	500	3.15s	0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	141.0	247.8	145ms	5.5	38	100	709ms	0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	138.7	226.5	159ms	5.4	38	100	721ms	0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	136.8	3465.8	291ms	5.5	868	200	1.46s	0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	134.2	375.9	217ms	7.0	70	1000	7.45s	0.010 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	132.4	2900.3	328ms	5.5	868	200	1.51s	0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	123.5	243.3	156ms	6.5	38	100	810ms	0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	121.5	1301.5	462ms	7.0	601	500	4.12s	0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	109.6	1976.0	439ms	6.7	868	200	1.82s	0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	68.4	132.4	4.68s	5.6	601	500	7.57s	0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	66.3	125.4	4.83s	5.5	601	500	7.74s	0.000 GiB

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power350 W / 450 W max(78% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp42°C idle · 61°C peak

peak draw333 W

backendllama.cpp cuda-4f13cb7 (cuda)

serverlemonade unknown

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

containerizedtrue

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1965/2100 MHz · mem 9501 MHz

temp43°C idle · 74°C peak

peak draw429 W

backendllama.cpp cuda-4f13cb7 (cuda)

serverlemonade unknown

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

containerizedtrue

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

backendllama.cpp 59778f0 (cuda)

serverlemonade unknown

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

python3.12.3

containerizedtrue

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue