leonsarmiento/Domyn-Small-v1.0-8bit-mlx

This model was converted to MLX format from domyn/Domyn-Small-v1.0 using 8-bit uniform quantization (8.501 bits per weight) optimized for Apple Silicon.

Domyn-Small-v1.0 is a 10B-parameter text-only model initialized from Italia 10B and continually pre-trained on 503B tokens. Built on the Nemotron architecture with ReLU² activation, it features Grouped-Query Attention (48 query heads, 8 KV heads), a 256K-token SentencePiece BPE vocabulary, and supports a dual-mode "Thinking on/off" toggle for chain-of-thought reasoning. Native context window is 32K tokens, extensible to 128K via YaRN.

Quantization Details

Property Value
Quantization 8-bit uniform
Bits per weight 8.501
Group size 64
Model size 9.7 GB
Shards 2
Source dtype bfloat16

Use with mlx-lm

pip install -U mlx-lm
python -m mlx_lm.generate --model leonsarmiento/Domyn-Small-v1.0-8bit-mlx --max-tokens 256 --prompt "Ciao, come stai?"

Recommended Inference Parameters

Thinking Off (default)

Parameter Value
Temperature 0.1
Top-p 0.95
Top-k 50
Min-p 0.1

Thinking On

Parameter Value
Temperature 0.6
Top-p 0.90
Top-k 25
Min-p 0.1

⚠️ Greedy decoding should not be used in thinking mode as it degrades reasoning quality and causes repetition.

Chat Template

This model uses a custom chat template with <extra_id_0> / <extra_id_1> role markers. Thinking mode is controlled by appending thinking on or thinking off to the system prompt. The template also supports tool calling via <tool_call> XML tags.

The chat_template.jinja file is included and the template is injected into tokenizer_config.json for compatibility.

Downloads last month
15
Safetensors
Model size
10B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leonsarmiento/Domyn-Small-v1.0-8bit-mlx

Quantized
(4)
this model