Instructions to use mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit

Run Hermes

hermes

MLX LM

How to use mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Configuration Parsing Warning:Invalid JSON for config file config.json

mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

A 4-bit mixed-precision MLX quant of mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-BF16 produced by mlx-optiq, the sensitivity-aware quantization toolkit for Apple Silicon. +2.0 Capability Score over stock uniform 4-bit, winning or tying every one of the six benchmarks.

Nemotron 3 Nano 30B-A3B is a hybrid Mamba2 + attention model with a 128-expert sparse MoE (≈3B active parameters per token). OptiQ measures each linear's KL-divergence sensitivity against a reference forward pass and assigns 4-bit or 8-bit per-layer, including the fused switch_mlp routed-expert tensors that dominate the model's parameter mass. Sensitive layers go to 8-bit; robust ones (including most of the experts) stay at 4-bit.

Quantization details

Property	Value
Predominant precision	4-bit
Layers at 8-bit (sensitive)	127
Layers at 4-bit (robust)	36
Total quantized layers	163
Achieved BPW	5.05
Group size	64
Calibration mix	six-domain mix (40 samples)
Reference for sensitivity	uniform-4-bit (bf16 doesn't fit in 36 GB RAM)
Bundled KV-cache recipe	`kv_config.json`, 6 attention layers @ 4-bit (4.0 avg KV bits)

We follow the same naming convention llama.cpp uses for Q4_K_M-style mixed-precision quants: the "4-bit" label is for the predominant precision, not the weighted average. Most of the 8-bit layers are the small mamba / attention projections; the big routed-expert tensors mostly stay at 4-bit, which is how the model lands at 5.05 BPW.

Usage

Load it with mlx-lm (the custom NemotronH modeling files ship in the repo and are picked up automatically):

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit")
response = generate(
    model, tokenizer,
    prompt="Explain how a sparse mixture-of-experts router decides which experts to activate.",
    max_tokens=400,
)

For mixed-precision KV-cache serving and sensitivity-aware LoRA fine-tuning, install mlx-optiq:

pip install mlx-optiq

# Serve with the bundled KV-cache recipe
optiq serve --model mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit \
            --kv-config kv_config.json

Benchmarks

Six-metric Capability Score (mean of MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop). Apples-to-apples comparison against stock uniform 4-bit:

Metric	OptiQ	Uniform 4-bit	Δ
MMLU (5-shot, 1000 samples)	76.2%	74.8%	+1.3
GSM8K (1000 samples, 3-shot CoT)	81.6%	78.5%	+3.1
IFEval (full set, strict)	69.1%	67.5%	+1.7
BFCL-V3 simple (200 calls)	74.0%	74.0%	+0.0
HumanEval (164 problems, pass@1)	89.0%	86.0%	+3.0
HashHop (long-context retrieval)	25.0%	22.0%	+3.0
Capability Score (mean of 6)	69.15	67.13	+2.02
On-disk size	20.6 GB	16.6 GB	+4.0

OptiQ wins or ties every benchmark. The mixed-precision allocation costs ~4 GB more on disk than stock uniform 4-bit, that disk buys a clean sweep across math, code, instruction-following, and long-context retrieval. Every metric gets one equal vote; disk size is reported next to the score as an honest second axis instead of being folded in. See the eval-framework writeup for the full methodology.

Base model

This is a quantized derivative of NVIDIA Nemotron 3 Nano 30B-A3B. See the NVIDIA Open Model License for terms, the quant is distributed under the same license as the base.

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune

Downloads last month: 254

Safetensors

Model size

32B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Finetuned

mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-BF16

Finetuned

(1)

this model

mlx-community
/

NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit

mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit

Quantization details

Usage

Benchmarks

Links

Base model

Quantize your own

Model tree for mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit

Datasets used to train mlx-community/NVIDIA-Nemotron-3-Nano-30B-A3B-OptiQ-4bit