Ex0bit's picture
Add README (PRISM-NVFP4)
6cff1c7 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3.6-35B-A3B
tags:
  - qwen3.5
  - moe
  - nvfp4
  - w4a4
  - compressed-tensors
  - vllm
pipeline_tag: text-generation

Qwen3.6-35B-A3B-PRISM-NVFP4

NVFP4 (W4A4) quantization of a PRISM-tuned Qwen3.6-35B-A3B. ~24 GB on disk, multimodal + MTP draft head preserved. Designed for NVIDIA Blackwell (SM120/SM121).

PRISM softens over-refusal behaviour and removes bias / propaganda patterns while maintaining and enhancing task performance, coherence, and multimodal capability.

Model details

  • Base: Qwen/Qwen3.6-35B-A3B (35B total, ~3B active per token, 256 routed experts)
  • PRISM: refusal-softening, bias + propaganda removal
  • Format: compressed-tensors NVFP4 (FP4 E2M1 weights + activations, UE4M3 per-block-16 scales)
  • Kept BF16: vision encoder, lm_head, router gates, embeddings, linear-attention SSM state
  • Runtime targets: vLLM (--quantization compressed-tensors), Blackwell tensor cores

Files

File Purpose
model.safetensors language-model + vision encoder weights
model_mtp.safetensors MTP draft head (optional, for speculative decode)
model.safetensors.index.json weight map
config.json, generation_config.json model + generation config
tokenizer*, processor_config.json, chat_template.jinja tokenizer + chat template

Serving (vLLM)

vllm serve Ex0bit/Qwen3.6-35B-A3B-PRISM-NVFP4 \
  --quantization compressed-tensors \
  --dtype auto \
  --max-model-len 32768 \
  --trust-remote-code

Requires vLLM with Blackwell NVFP4 kernels. On SM121 (DGX Spark), use a vLLM build with SM121-aware patches — stock PyPI wheels will fault on the missing cvt.rn.satfinite.e2m1x2.f32 PTX instruction.

Known-working community Docker images (Apache 2.0, tested on GB10):

  • ghcr.io/aeon-7/vllm-spark-omni-q36 — vLLM HEAD + GB10 patches + flashinfer sm_120 kernels; also supports DFlash speculative decoding.
  • avarok/dgx-vllm-nvfp4-kernel — generic NVFP4 MoE image with software-E2M1 conversion and Marlin-MoE default.

License

Apache 2.0, inherited from the base model.

☕ Support Our Work

If you enjoy our work and find it useful, please consider sponsoring or supporting us!

Ko-fi