--- license: apache-2.0 base_model: HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive base_model_relation: quantized library_name: transformers pipeline_tag: image-text-to-text model_type: qwen3_5_moe tags: - qwen3.6 - qwen3_5_moe - nvfp4 - compressed-tensors - quantized - vllm - blackwell - rtx-5090 - sm120 - moe - multimodal - agentic - tool-calling - coding - uncensored - conversational --- # Qwen3.6 35B A3B HauhauCS Uncensored NVFP4 Uncensored Qwen3.6 35B A3B MoE quantized to NVFP4 `compressed-tensors` for vLLM on NVIDIA Blackwell / RTX 5090. - **35B total / 3B active MoE** - **HauhauCS Aggressive uncensored source** - **Conservative NVFP4 profile**: linear attention and MTP kept in bf16 for quality - **NVFP4 W4A4 compressed-tensors** - **~22 GB** - **Runs on one RTX 5090** - **100K-131K text context target** - **vLLM native loading** The model files are placed at the repository root so Hugging Face shows the weights in the right-side download panel and `vllm serve` can load the repo directly. The repo intentionally keeps a single root weight set to avoid full-repo snapshot downloads pulling multiple profile variants. ## Download ```bash hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \ --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4 ``` ## vLLM quickstart ```bash VLLM_NVFP4_GEMM_BACKEND=marlin \ vllm serve lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \ --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \ --quantization compressed-tensors \ --kv-cache-dtype fp8 \ --max-model-len 131072 \ --max-num-seqs 1 \ --max-num-batched-tokens 4096 \ --gpu-memory-utilization 0.90 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --trust-remote-code ``` Local path quickstart: ```bash hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \ --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4 VLLM_NVFP4_GEMM_BACKEND=marlin \ vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \ --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \ --quantization compressed-tensors \ --kv-cache-dtype fp8 \ --max-model-len 131072 \ --max-num-seqs 1 \ --max-num-batched-tokens 4096 \ --gpu-memory-utilization 0.90 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --trust-remote-code ``` ## Quantization recipe ```python recipe = QuantizationModifier( targets="Linear", scheme="NVFP4", ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"], ) oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=1024, num_calibration_samples=128, moe_calibrate_all_experts=True, pipeline="basic", ) ``` - Calibration: `HuggingFaceH4/ultrachat_200k`, 128 samples x 1024 tokens - MTP tensors copied from [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) - Converted using [li-yifei/gguf-to-nvfp4](https://github.com/li-yifei/gguf-to-nvfp4) Pipeline: ```text Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4 ``` ## Source models - Uncensored source: [HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive) - Original base: [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) ## Acknowledgments - [HauhauCS](https://huggingface.co/HauhauCS) for the uncensored GGUF source - [Qwen](https://huggingface.co/Qwen) for the base model and MTP weights - [AEON-7](https://huggingface.co/AEON-7) and [RedHatAI](https://huggingface.co/RedHatAI) for conservative quantization approach reference