Instructions to use s-batman/Nex-N2-mini-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Notebooks
Google Colab
Kaggle
Local Apps Settings

How to use s-batman/Nex-N2-mini-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Use Docker

docker model run hf.co/s-batman/Nex-N2-mini-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use s-batman/Nex-N2-mini-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "s-batman/Nex-N2-mini-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "s-batman/Nex-N2-mini-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/s-batman/Nex-N2-mini-GGUF:Q4_K_M

Ollama
How to use s-batman/Nex-N2-mini-GGUF with Ollama:
```
ollama run hf.co/s-batman/Nex-N2-mini-GGUF:Q4_K_M
```

Unsloth Studio

How to use s-batman/Nex-N2-mini-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for s-batman/Nex-N2-mini-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for s-batman/Nex-N2-mini-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for s-batman/Nex-N2-mini-GGUF to start chatting

How to use s-batman/Nex-N2-mini-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "s-batman/Nex-N2-mini-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use s-batman/Nex-N2-mini-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default s-batman/Nex-N2-mini-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use s-batman/Nex-N2-mini-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf s-batman/Nex-N2-mini-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "s-batman/Nex-N2-mini-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use s-batman/Nex-N2-mini-GGUF with Docker Model Runner:
```
docker model run hf.co/s-batman/Nex-N2-mini-GGUF:Q4_K_M
```

Lemonade

How to use s-batman/Nex-N2-mini-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull s-batman/Nex-N2-mini-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Nex-N2-mini-GGUF-Q4_K_M

List all available models

lemonade list

s-batman/Nex-N2-mini-GGUF

GGUF quantizations of Nex-N2-mini by Nex AGI — an agentic multimodal model with Agentic Thinking, post-trained on Qwen3.5-35B-A3B-Base. Includes standard integer quants (Q4_K_S through Q8_0) and an NVFP4 mixed-precision variant optimised for NVIDIA Blackwell GPUs.

Model Creator

Nex AGI

Original Model

nex-agi/Nex-N2-mini

Architecture Details

Property	Value
Architecture	`Qwen3_5MoeForConditionalGeneration`
Base model	Qwen3.5-35B-A3B-Base
Total parameters	~35B
Active parameters	~3B per forward pass
Experts	256 total, 8 routed + 1 shared per token
Hidden size	2048
Layers	40 (hybrid: 3× Gated DeltaNet + 1× Full Attention per group)
Context length	262,144 tokens
Vocabulary	248,320
Vision encoder	ViT-based, 27 blocks, 1152 hidden dim, patch 16×16
Multi-Token Prediction	Not included (no MTP weights in source release)
License	Apache 2.0

Tensor Architecture Breakdown

Category	Tensors	Size (F16)	% of Model	Sensitivity
Routed experts (`ffn_*_exps`)	120	60.00 GB	92.9%	🟢 Low — only 8/256 active
Embeddings + output head	2	1.89 GB	2.9%	🟡 Moderate
Attention QKV	60	1.29 GB	2.0%	🟡 Moderate
SSM/DeltaNet (`ssm_*`)	150	0.48 GB	0.7%	🔴 Critical — state tracking
Attention gate	30	0.47 GB	0.7%	🟡 Moderate
Shared expert (`ffn_*_shexp`)	120	0.23 GB	0.4%	🟡 Always active
Attention output	10	0.16 GB	0.2%	🟡 Moderate
Router (`ffn_gate_inp`)	80	0.08 GB	0.1%	🔴 Critical — expert routing
Norms/biases	161	~0 GB	~0%	🔴 Critical

Provided Files

Standard Quantizations

Quant	File	Size	Use Case
F16	`Nex-N2-mini-F16.gguf`	64.6 GB	Full precision, maximum quality
Q8_0	`Nex-N2-mini-Q8_0.gguf`	34.4 GB	Near-lossless, good balance
Q6_K	`Nex-N2-mini-Q6_K.gguf`	26.6 GB	Very high quality
Q5_K_M	`Nex-N2-mini-Q5_K_M.gguf`	23.0 GB	High quality, good size
Q5_K_S	`Nex-N2-mini-Q5_K_S.gguf`	22.3 GB	Good quality, smaller
Q5_0	`Nex-N2-mini-Q5_0.gguf`	22.3 GB	Good quality baseline
Q4_K_M	`Nex-N2-mini-Q4_K_M.gguf`	19.7 GB	Best quality/size tradeoff
Q4_K_S	`Nex-N2-mini-Q4_K_S.gguf`	18.5 GB	Smallest, acceptable quality

Blackwell-Optimised (NVFP4)

Quant	File	Size	Tensor Composition	Use Case
NVFP4	`Nex-N2-mini-NVFP4.gguf`	19.4 GB	120× NVFP4 + 312× Q8_0 + 301× F32	Fastest on Blackwell GPUs

Vision Projector

File	Size	Notes
`mmproj-Nex-N2-mini-F16.gguf`	0.84 GB	Required for image/vision input

Note: The mmproj file is required for multimodal (vision) capabilities. For text-only use, it is not needed.

NVFP4 Mixed-Precision Details

The NVFP4 variant uses architecture-aware tensor mapping:

Tensor Category	Quantization	Rationale
Routed experts (`ffn_down_exps`, `ffn_gate_exps`, `ffn_up_exps`)	NVFP4	92.9% of model, only 8/256 active per token. Hardware-native FP4 dequant on Blackwell provides best throughput.
Router (`ffn_gate_inp`, `ffn_gate_inp_shexp`)	F32	0.1% of model. Critical for expert routing decisions — bad routing = wrong experts = garbage output.
SSM/DeltaNet (`ssm_a`, `ssm_conv1d`, `ssm_dt`, `ssm_alpha`, `ssm_beta`, `ssm_norm`, `ssm_out`)	F32	0.7% of model. Critical for linear attention state tracking across the sequence.
Shared expert, attention, embeddings, norms	Q8_0	Moderate sensitivity, always active or frequently accessed.

Base quant type: Q8_0 — ensures router, SSM, shared expert, and attention tensors maintain high quality while only the expert weights use NVFP4.

# Reproduction
cat > nvfp4-tensor-types.txt << 'EOF'
ffn_down_exps=nvfp4
ffn_gate_exps=nvfp4
ffn_up_exps=nvfp4
EOF

llama-quantize \
  --allow-requantize \
  --tensor-type-file nvfp4-tensor-types.txt \
  Nex-N2-mini-F16.gguf \
  Nex-N2-mini-NVFP4.gguf \
  Q8_0

Conversion Notes

Converted with --no-mtp — the source model does not include Multi-Token Prediction weights despite mtp_num_hidden_layers: 1 in config. Speculative decoding with --spec-type draft-mtp is not supported for this model.
All quants produced from F16 GGUF using llama-quantize (standard quantization, no imatrix).
The hybrid DeltaNet + Full Attention architecture is fully supported in llama.cpp builds with qwen3_5_moe architecture support.

Usage with llama.cpp

Requirements

llama.cpp build with Qwen3_5MoeForConditionalGeneration architecture support
For NVFP4: build 8967+ with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 (Blackwell)
For vision: build with multimodal support (llama-mtmd-cli)

Text-Only Server

llama-server \
  -m Nex-N2-mini-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --mlock \
  --cont-batching \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 40

Multimodal Server (with Vision)

llama-server \
  -m Nex-N2-mini-Q4_K_M.gguf \
  --mmproj mmproj-Nex-N2-mini-F16.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --mlock \
  --cont-batching \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 40

NVFP4 on DGX Spark / Blackwell

llama-server \
  -m Nex-N2-mini-NVFP4.gguf \
  --mmproj mmproj-Nex-N2-mini-F16.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk f16 -ctv f16 \
  --no-mmap \
  --mlock \
  --cont-batching \
  --ubatch-size 2048 \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 40

Download with llama.cpp

# Standard quant
llama-cli --hf-repo s-batman/Nex-N2-mini-GGUF --hf-file Nex-N2-mini-Q4_K_M.gguf -p "Hello"

# NVFP4 (Blackwell only)
llama-cli --hf-repo s-batman/Nex-N2-mini-GGUF --hf-file Nex-N2-mini-NVFP4.gguf -p "Hello"

Recommended Sampling Parameters

Per the model creators:

Parameter	Value
Temperature	0.7
top_p	0.95
top_k	40

About Nex-N2

Nex-N2 is an agentic model built for real-world productivity scenarios. It unifies reasoning, tool use, and environment execution through an Agentic Thinking framework:

Adaptive Thinking — the model decides when to think and how deeply, executing simple actions quickly while reasoning thoroughly on critical decisions
Coherent Thinking — one consistent reasoning paradigm across general reasoning and diverse agentic tasks

Nex-N2-mini reaches first-tier performance on agentic coding, deep research, tool calling, and terminal execution benchmarks, with substantial gains over the previous-generation Nex-N1.

Important Notes

Unified memory: On DGX Spark and similar unified memory architectures, --no-mmap is recommended to avoid severe slowdowns
mmproj required for vision: The mmproj-Nex-N2-mini-F16.gguf file must be loaded with --mmproj for image/vision input
NVFP4 is Blackwell-only: The NVFP4 quantization requires NVIDIA Blackwell GPU hardware (RTX 5090, RTX PRO 6000, DGX Spark/GB10, B200, etc.)
DeltaNet layers: This model uses hybrid Gated DeltaNet + Full Attention. Ensure your llama.cpp build supports the qwen3_5_moe architecture
No MTP: The source model does not include Multi-Token Prediction weights. Do not use --spec-type draft-mtp with this model

Licensing

Apache 2.0 — same as the original nex-agi/Nex-N2-mini model.

Acknowledgments

Nex AGI — Nex-N2-mini model
Qwen Team (Alibaba Cloud) — Qwen3.5-35B-A3B-Base foundation model
ggml-org/llama.cpp — GGUF format, conversion tools, and inference engine

Downloads last month: 344

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for s-batman/Nex-N2-mini-GGUF

Base model

nex-agi/Nex-N2-mini

Quantized

(54)

this model