How to use from
Ollama
ollama run hf.co/s-batman/Nex-N2-mini-GGUF:
Quick Links

s-batman/Nex-N2-mini-GGUF

GGUF quantizations of Nex-N2-mini by Nex AGI — an agentic multimodal model with Agentic Thinking, post-trained on Qwen3.5-35B-A3B-Base. Includes standard integer quants (Q4_K_S through Q8_0) and an NVFP4 mixed-precision variant optimised for NVIDIA Blackwell GPUs.

Model Creator

Nex AGI

Original Model

nex-agi/Nex-N2-mini

Architecture Details

Property Value
Architecture Qwen3_5MoeForConditionalGeneration
Base model Qwen3.5-35B-A3B-Base
Total parameters ~35B
Active parameters ~3B per forward pass
Experts 256 total, 8 routed + 1 shared per token
Hidden size 2048
Layers 40 (hybrid: 3× Gated DeltaNet + 1× Full Attention per group)
Context length 262,144 tokens
Vocabulary 248,320
Vision encoder ViT-based, 27 blocks, 1152 hidden dim, patch 16×16
Multi-Token Prediction Not included (no MTP weights in source release)
License Apache 2.0

Tensor Architecture Breakdown

Category Tensors Size (F16) % of Model Sensitivity
Routed experts (ffn_*_exps) 120 60.00 GB 92.9% 🟢 Low — only 8/256 active
Embeddings + output head 2 1.89 GB 2.9% 🟡 Moderate
Attention QKV 60 1.29 GB 2.0% 🟡 Moderate
SSM/DeltaNet (ssm_*) 150 0.48 GB 0.7% 🔴 Critical — state tracking
Attention gate 30 0.47 GB 0.7% 🟡 Moderate
Shared expert (ffn_*_shexp) 120 0.23 GB 0.4% 🟡 Always active
Attention output 10 0.16 GB 0.2% 🟡 Moderate
Router (ffn_gate_inp) 80 0.08 GB 0.1% 🔴 Critical — expert routing
Norms/biases 161 ~0 GB ~0% 🔴 Critical

Provided Files

Standard Quantizations

Quant File Size Use Case
F16 Nex-N2-mini-F16.gguf 64.6 GB Full precision, maximum quality
Q8_0 Nex-N2-mini-Q8_0.gguf 34.4 GB Near-lossless, good balance
Q6_K Nex-N2-mini-Q6_K.gguf 26.6 GB Very high quality
Q5_K_M Nex-N2-mini-Q5_K_M.gguf 23.0 GB High quality, good size
Q5_K_S Nex-N2-mini-Q5_K_S.gguf 22.3 GB Good quality, smaller
Q5_0 Nex-N2-mini-Q5_0.gguf 22.3 GB Good quality baseline
Q4_K_M Nex-N2-mini-Q4_K_M.gguf 19.7 GB Best quality/size tradeoff
Q4_K_S Nex-N2-mini-Q4_K_S.gguf 18.5 GB Smallest, acceptable quality

Blackwell-Optimised (NVFP4)

Quant File Size Tensor Composition Use Case
NVFP4 Nex-N2-mini-NVFP4.gguf 19.4 GB 120× NVFP4 + 312× Q8_0 + 301× F32 Fastest on Blackwell GPUs

Vision Projector

File Size Notes
mmproj-Nex-N2-mini-F16.gguf 0.84 GB Required for image/vision input

Note: The mmproj file is required for multimodal (vision) capabilities. For text-only use, it is not needed.

NVFP4 Mixed-Precision Details

The NVFP4 variant uses architecture-aware tensor mapping:

Tensor Category Quantization Rationale
Routed experts (ffn_down_exps, ffn_gate_exps, ffn_up_exps) NVFP4 92.9% of model, only 8/256 active per token. Hardware-native FP4 dequant on Blackwell provides best throughput.
Router (ffn_gate_inp, ffn_gate_inp_shexp) F32 0.1% of model. Critical for expert routing decisions — bad routing = wrong experts = garbage output.
SSM/DeltaNet (ssm_a, ssm_conv1d, ssm_dt, ssm_alpha, ssm_beta, ssm_norm, ssm_out) F32 0.7% of model. Critical for linear attention state tracking across the sequence.
Shared expert, attention, embeddings, norms Q8_0 Moderate sensitivity, always active or frequently accessed.

Base quant type: Q8_0 — ensures router, SSM, shared expert, and attention tensors maintain high quality while only the expert weights use NVFP4.

# Reproduction
cat > nvfp4-tensor-types.txt << 'EOF'
ffn_down_exps=nvfp4
ffn_gate_exps=nvfp4
ffn_up_exps=nvfp4
EOF

llama-quantize \
  --allow-requantize \
  --tensor-type-file nvfp4-tensor-types.txt \
  Nex-N2-mini-F16.gguf \
  Nex-N2-mini-NVFP4.gguf \
  Q8_0

Conversion Notes

  • Converted with --no-mtp — the source model does not include Multi-Token Prediction weights despite mtp_num_hidden_layers: 1 in config. Speculative decoding with --spec-type draft-mtp is not supported for this model.
  • All quants produced from F16 GGUF using llama-quantize (standard quantization, no imatrix).
  • The hybrid DeltaNet + Full Attention architecture is fully supported in llama.cpp builds with qwen3_5_moe architecture support.

Usage with llama.cpp

Requirements

  • llama.cpp build with Qwen3_5MoeForConditionalGeneration architecture support
  • For NVFP4: build 8967+ with -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 (Blackwell)
  • For vision: build with multimodal support (llama-mtmd-cli)

Text-Only Server

llama-server \
  -m Nex-N2-mini-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --mlock \
  --cont-batching \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 40

Multimodal Server (with Vision)

llama-server \
  -m Nex-N2-mini-Q4_K_M.gguf \
  --mmproj mmproj-Nex-N2-mini-F16.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --no-mmap \
  --mlock \
  --cont-batching \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 40

NVFP4 on DGX Spark / Blackwell

llama-server \
  -m Nex-N2-mini-NVFP4.gguf \
  --mmproj mmproj-Nex-N2-mini-F16.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk f16 -ctv f16 \
  --no-mmap \
  --mlock \
  --cont-batching \
  --ubatch-size 2048 \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 40

Download with llama.cpp

# Standard quant
llama-cli --hf-repo s-batman/Nex-N2-mini-GGUF --hf-file Nex-N2-mini-Q4_K_M.gguf -p "Hello"

# NVFP4 (Blackwell only)
llama-cli --hf-repo s-batman/Nex-N2-mini-GGUF --hf-file Nex-N2-mini-NVFP4.gguf -p "Hello"

Recommended Sampling Parameters

Per the model creators:

Parameter Value
Temperature 0.7
top_p 0.95
top_k 40

About Nex-N2

Nex-N2 is an agentic model built for real-world productivity scenarios. It unifies reasoning, tool use, and environment execution through an Agentic Thinking framework:

  • Adaptive Thinking — the model decides when to think and how deeply, executing simple actions quickly while reasoning thoroughly on critical decisions
  • Coherent Thinking — one consistent reasoning paradigm across general reasoning and diverse agentic tasks

Nex-N2-mini reaches first-tier performance on agentic coding, deep research, tool calling, and terminal execution benchmarks, with substantial gains over the previous-generation Nex-N1.

Important Notes

  • Unified memory: On DGX Spark and similar unified memory architectures, --no-mmap is recommended to avoid severe slowdowns
  • mmproj required for vision: The mmproj-Nex-N2-mini-F16.gguf file must be loaded with --mmproj for image/vision input
  • NVFP4 is Blackwell-only: The NVFP4 quantization requires NVIDIA Blackwell GPU hardware (RTX 5090, RTX PRO 6000, DGX Spark/GB10, B200, etc.)
  • DeltaNet layers: This model uses hybrid Gated DeltaNet + Full Attention. Ensure your llama.cpp build supports the qwen3_5_moe architecture
  • No MTP: The source model does not include Multi-Token Prediction weights. Do not use --spec-type draft-mtp with this model

Licensing

Apache 2.0 — same as the original nex-agi/Nex-N2-mini model.

Acknowledgments

  • Nex AGI — Nex-N2-mini model
  • Qwen Team (Alibaba Cloud) — Qwen3.5-35B-A3B-Base foundation model
  • ggml-org/llama.cpp — GGUF format, conversion tools, and inference engine
Downloads last month
2,309
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for s-batman/Nex-N2-mini-GGUF

Quantized
(48)
this model