Add README (PRISM-NVFP4)

6cff1c7 verified about 2 months ago

2.42 kB

license: apache-2.0
base_model: Qwen/Qwen3.6-35B-A3B
tags:
  - qwen3.5
  - moe
  - nvfp4
  - w4a4
  - compressed-tensors
  - vllm
pipeline_tag: text-generation

Qwen3.6-35B-A3B-PRISM-NVFP4

NVFP4 (W4A4) quantization of a PRISM-tuned Qwen3.6-35B-A3B. ~24 GB on disk, multimodal + MTP draft head preserved. Designed for NVIDIA Blackwell (SM120/SM121).

PRISM softens over-refusal behaviour and removes bias / propaganda patterns while maintaining and enhancing task performance, coherence, and multimodal capability.

Model details

Base: Qwen/Qwen3.6-35B-A3B (35B total, ~3B active per token, 256 routed experts)
PRISM: refusal-softening, bias + propaganda removal
Format: compressed-tensors NVFP4 (FP4 E2M1 weights + activations, UE4M3 per-block-16 scales)
Kept BF16: vision encoder, lm_head, router gates, embeddings, linear-attention SSM state
Runtime targets: vLLM (--quantization compressed-tensors), Blackwell tensor cores

Files

File	Purpose
`model.safetensors`	language-model + vision encoder weights
`model_mtp.safetensors`	MTP draft head (optional, for speculative decode)
`model.safetensors.index.json`	weight map
`config.json`, `generation_config.json`	model + generation config
`tokenizer*`, `processor_config.json`, `chat_template.jinja`	tokenizer + chat template

Serving (vLLM)

vllm serve Ex0bit/Qwen3.6-35B-A3B-PRISM-NVFP4 \
  --quantization compressed-tensors \
  --dtype auto \
  --max-model-len 32768 \
  --trust-remote-code

Requires vLLM with Blackwell NVFP4 kernels. On SM121 (DGX Spark), use a vLLM build with SM121-aware patches — stock PyPI wheels will fault on the missing cvt.rn.satfinite.e2m1x2.f32 PTX instruction.

Known-working community Docker images (Apache 2.0, tested on GB10):

ghcr.io/aeon-7/vllm-spark-omni-q36 — vLLM HEAD + GB10 patches + flashinfer sm_120 kernels; also supports DFlash speculative decoding.
avarok/dgx-vllm-nvfp4-kernel — generic NVFP4 MoE image with software-E2M1 conversion and Marlin-MoE default.

License

Apache 2.0, inherited from the base model.

☕ Support Our Work

If you enjoy our work and find it useful, please consider sponsoring or supporting us!