Mellum2-12B-A2.5B-Thinking - GGUF

Quantized GGUF version of JetBrains/Mellum2-12B-A2.5B-Thinking. These were generated using the default settings with llama-quantize (b9482).

Quantizations provided

File Quantization Size
Mellum2-12B-A2.5B-Thinking-Q4_0.gguf Q4_01 6.91 GB
Mellum2-12B-A2.5B-Thinking-Q4_K_S.gguf Q4_K_S 7.4 GB
Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf Q4_K_M 8.07 GB
Mellum2-12B-A2.5B-Thinking-Q5_K_M.gguf Q5_K_M 9.21 GB
Mellum2-12B-A2.5B-Thinking-Q6_K.gguf Q6_K 10.9 GB
Mellum2-12B-A2.5B-Thinking-Q8_0.gguf Q8_0 12.9 GB

1: Q4_0 is not recommended. Perplexity doubled which suggests degredated quality, and I encountered endlessly repeating tokens with my test prompt.

Perplexity test

I tested perplexity using llama-perplexity and Salesforce's wikitext-2-raw-v1.

File Quantization Ctx PPL
Mellum2-12B-A2.5B-Thinking-Q4_0.gguf Q4_0 512 18.6689 +/- 0.17737
Mellum2-12B-A2.5B-Thinking-Q4_K_S.gguf Q4_K_S 512 9.8269 +/- 0.07248
Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf Q4_K_M 512 9.7410 +/- 0.07128
Mellum2-12B-A2.5B-Thinking-Q5_K_M.gguf Q5_K_M 512 9.4490 +/- 0.06807
Mellum2-12B-A2.5B-Thinking-Q6_K.gguf Q6_K 512 9.7329 +/- 0.07207
Mellum2-12B-A2.5B-Thinking-Q8_0.gguf Q8_0 512 9.3657 +/- 0.06734
Mellum2-12B-A2.5B-Thinking-BF16.gguf BF16 512 9.4037 +/- 0.06784

Serving with llama.cpp

llama.cpp added support for Mellum2 in release b9482. It has a max context size of 131,072. This can be served using:

llama-server \
  -hf CodeFault/Mellum2-12B-A2.5B-Thinking-GGUF:Q5_K_M \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20
Downloads last month
1,048
GGUF
Model size
12B params
Architecture
mellum
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for CodeFault/Mellum2-12B-A2.5B-Thinking-GGUF

Quantized
(25)
this model