LFM2 8B-A1B
Q4_K_M·8B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
LiquidAI/LFM2-8B-A1B-GGUF:Q4_K_Mcommit:
11624c2ea122weights 4.70 GiB
All runs (25)
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 439.9 | 385.1 | 1474.1 | — | 21ms | 2.3 | — | 31 | 100 | 248ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 438.0 | 423.4 | 2183.2 | — | 34ms | 2.3 | — | 65 | 883 | 2.05s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 436.7 | 299.8 | 11883.9 | — | 52ms | 2.3 | — | 752 | 111 | 316ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 435.2 | 397.4 | 35490.6 | — | 22ms | 2.3 | — | 602 | 500 | 1.18s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 433.7 | 369.5 | 1465.5 | — | 21ms | 2.3 | — | 31 | 100 | 261ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 432.0 | 150.6 | 500.3 | — | 1.53s | 2.3 | — | 602 | 500 | 2.44s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 430.3 | 406.4 | 2167.9 | — | 33ms | 2.3 | — | 65 | 883 | 2.26s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 428.3 | 391.6 | 35824.1 | — | 23ms | 2.3 | — | 602 | 500 | 1.19s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 427.2 | 177.4 | 403.1 | — | 1.61s | 2.3 | — | 602 | 500 | 2.63s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 425.0 | 341.7 | 12238.0 | — | 53ms | 2.4 | — | 752 | 111 | 313ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | chat | 1 | 372.3 | 336.1 | 1098.5 | — | 28ms | 2.7 | — | 31 | 100 | 295ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | codegen | 1 | 370.8 | 364.9 | 2004.8 | — | 33ms | 2.7 | — | 65 | 791 | 2.17s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | rag | 1 | 367.4 | 319.2 | 13067.4 | — | 52ms | 2.7 | — | 752 | 117 | 370ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 1 | 362.1 | 355.0 | 39235.6 | — | 15ms | 2.8 | — | 602 | 426 | 1.20s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | chat | 1 | 351.9 | 318.7 | 1277.6 | — | 24ms | 2.8 | — | 31 | 100 | 302ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | rag | 1 | 344.2 | 278.6 | 9916.1 | — | 62ms | 2.9 | — | 752 | 111 | 374ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | codegen | 1 | 339.1 | 332.9 | 1813.9 | — | 39ms | 2.9 | — | 65 | 883 | 2.62s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 1 | 333.6 | 315.9 | 8274.9 | — | 73ms | 3.0 | — | 602 | 434 | 1.42s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | chat | 1 | 154.8 | 144.6 | 627.2 | — | 49ms | 6.5 | — | 31 | 100 | 691ms | 0.001 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | codegen | 1 | 153.8 | 151.4 | 897.6 | — | 75ms | 6.5 | — | 65 | 843 | 5.59s | 0.004 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | rag | 1 | 152.6 | 142.6 | 38318.1 | — | 25ms | 6.6 | — | 892 | 81 | 555ms | 0.001 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | agent | 1 | 149.2 | 144.7 | 29283.6 | — | 21ms | 6.7 | — | 602 | 347 | 2.41s | 0.001 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 4 | 128.4 | 119.4 | 2116.6 | — | 289ms | 7.8 | — | 602 | 362 | 2.93s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 4 | 124.6 | 116.0 | 1985.3 | — | 339ms | 8.0 | — | 602 | 357 | 3.17s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | agent | 4 | 63.6 | 61.2 | 1589.4 | — | 506ms | 15.7 | — | 602 | 359 | 5.76s | -0.009 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp44°C idle · 59°C peak
peak draw334 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp44°C idle · 66°C peak
peak draw380 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| fixed | 256 GB/s | 106 GB/s | 30.3 TF | - |
compute: 11.5
backendllama.cpp b8940 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 672 GB/s | 271 GB/s | 67.9 TF | 68.4 TF |
| 250 W | 672 GB/s | 271 GB/s | 69.5 TF | 68.2 TF |
| 300 W | 672 GB/s | 270 GB/s | 69.6 TF | 68.4 TF |
compute: 12
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue