Mellum2-12B-A2.5B-Thinking GGUF (Q4_K_M)

Quantized GGUF version of JetBrains/Mellum2-12B-A2.5B-Thinking for use with llama.cpp.

โš ๏ธ Requires llama.cpp with Mellum architecture support. This is not yet in mainline llama.cpp โ€” use PR #23966 or a build that includes it.

Model Details

Property Value
Base model JetBrains/Mellum2-12B-A2.5B-Thinking
Architecture Mellum (MoE)
Total parameters 12B
Active parameters 2.5B
Experts 64 (8 per token)
Context length 128K (sliding window)
Quantization Q4_K_M
File size ~7.5 GB

Usage

llama-server \
  -m Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf \
  -c 32000 \
  -ngl 99 \
  --jinja \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --port 18081

For thinking mode, use reasoning_budget to cap reasoning tokens:

curl http://localhost:18081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Mellum2-12B-A2.5B-Thinking-Q4_K_M","messages":[{"role":"user","content":"Explain quicksort"}],"max_tokens":500,"reasoning_budget":512}'

Low VRAM (6GB GPU)

For GPUs with limited VRAM, offload MoE experts to CPU:

llama-server \
  -m Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf \
  -c 32000 \
  -ngl 99 \
  -ncmoe 20 \
  -t 8 \
  --jinja \
  --flash-attn on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0

Tested on GTX 1060 6GB: ~18 t/s with ncmoe=20, ~22 t/s with ncmoe=10.

Acknowledgements

Downloads last month
1,586
GGUF
Model size
12B params
Architecture
mellum
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RJ000/Mellum2-12B-A2.5B-Thinking-GGUF

Quantized
(25)
this model