Image-Text-to-Text
Transformers
Safetensors
qwen3_5_moe
qwen3.6
nvfp4
compressed-tensors
quantized
vllm
blackwell
rtx-5090
sm120
Mixture of Experts
multimodal
agentic
tool-calling
coding
uncensored
conversational
8-bit precision
Instructions to use lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4") model = AutoModelForImageTextToText.from_pretrained("lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4
- SGLang
How to use lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 with Docker Model Runner:
docker model run hf.co/lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4
Update README for root-only model layout
Browse filesRemove references to deleted aggressive and conservative profile folders. Document the repository as a single root weight set for direct vLLM loading.
README.md
CHANGED
|
@@ -30,22 +30,14 @@ Uncensored Qwen3.6 35B A3B MoE quantized to NVFP4 `compressed-tensors` for vLLM
|
|
| 30 |
|
| 31 |
- **35B total / 3B active MoE**
|
| 32 |
- **HauhauCS Aggressive uncensored source**
|
|
|
|
| 33 |
- **NVFP4 W4A4 compressed-tensors**
|
| 34 |
- **~22 GB**
|
| 35 |
- **Runs on one RTX 5090**
|
| 36 |
- **100K-131K text context target**
|
| 37 |
- **vLLM native loading**
|
| 38 |
|
| 39 |
-
The
|
| 40 |
-
|
| 41 |
-
## Which profile should I use?
|
| 42 |
-
|
| 43 |
-
| Profile | Path | Use |
|
| 44 |
-
| --- | --- | --- |
|
| 45 |
-
| Conservative | repo root / `conservative/` | Recommended default. Linear attention and MTP kept bf16 for quality. |
|
| 46 |
-
| Aggressive | `aggressive/` | More aggressive NVFP4 coverage for smaller footprint / longer context experiments. |
|
| 47 |
-
|
| 48 |
-
Recommended default: **root / conservative**.
|
| 49 |
|
| 50 |
## Download
|
| 51 |
|
|
@@ -54,14 +46,6 @@ hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
|
|
| 54 |
--local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
|
| 55 |
```
|
| 56 |
|
| 57 |
-
Aggressive profile only:
|
| 58 |
-
|
| 59 |
-
```bash
|
| 60 |
-
hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
|
| 61 |
-
--include "aggressive/*" \
|
| 62 |
-
--local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
## vLLM quickstart
|
| 66 |
|
| 67 |
```bash
|
|
@@ -103,28 +87,6 @@ vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \
|
|
| 103 |
--trust-remote-code
|
| 104 |
```
|
| 105 |
|
| 106 |
-
Aggressive subfolder quickstart:
|
| 107 |
-
|
| 108 |
-
```bash
|
| 109 |
-
hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
|
| 110 |
-
--local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
|
| 111 |
-
|
| 112 |
-
VLLM_NVFP4_GEMM_BACKEND=marlin \
|
| 113 |
-
vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4/aggressive \
|
| 114 |
-
--served-model-name qwen36-35b-a3b-hauhaucs-nvfp4-aggressive \
|
| 115 |
-
--quantization compressed-tensors \
|
| 116 |
-
--kv-cache-dtype fp8 \
|
| 117 |
-
--max-model-len 131072 \
|
| 118 |
-
--max-num-seqs 1 \
|
| 119 |
-
--max-num-batched-tokens 4096 \
|
| 120 |
-
--gpu-memory-utilization 0.90 \
|
| 121 |
-
--enable-prefix-caching \
|
| 122 |
-
--enable-auto-tool-choice \
|
| 123 |
-
--tool-call-parser qwen3_coder \
|
| 124 |
-
--reasoning-parser qwen3 \
|
| 125 |
-
--trust-remote-code
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
## Quantization recipe
|
| 129 |
|
| 130 |
```python
|
|
@@ -144,14 +106,14 @@ oneshot(
|
|
| 144 |
)
|
| 145 |
```
|
| 146 |
|
| 147 |
-
- Calibration: `HuggingFaceH4/ultrachat_200k`, 128 samples
|
| 148 |
- MTP tensors copied from [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
|
| 149 |
- Converted using [li-yifei/gguf-to-nvfp4](https://github.com/li-yifei/gguf-to-nvfp4)
|
| 150 |
|
| 151 |
Pipeline:
|
| 152 |
|
| 153 |
```text
|
| 154 |
-
Q8_K_P GGUF
|
| 155 |
```
|
| 156 |
|
| 157 |
## Source models
|
|
@@ -163,4 +125,4 @@ Q8_K_P GGUF → step1_convert_qwen36_moe.py → HF bf16 → step2_quantize_qwen3
|
|
| 163 |
|
| 164 |
- [HauhauCS](https://huggingface.co/HauhauCS) for the uncensored GGUF source
|
| 165 |
- [Qwen](https://huggingface.co/Qwen) for the base model and MTP weights
|
| 166 |
-
- [AEON-7](https://huggingface.co/AEON-7) and [RedHatAI](https://huggingface.co/RedHatAI) for conservative quantization approach reference
|
|
|
|
| 30 |
|
| 31 |
- **35B total / 3B active MoE**
|
| 32 |
- **HauhauCS Aggressive uncensored source**
|
| 33 |
+
- **Conservative NVFP4 profile**: linear attention and MTP kept in bf16 for quality
|
| 34 |
- **NVFP4 W4A4 compressed-tensors**
|
| 35 |
- **~22 GB**
|
| 36 |
- **Runs on one RTX 5090**
|
| 37 |
- **100K-131K text context target**
|
| 38 |
- **vLLM native loading**
|
| 39 |
|
| 40 |
+
The model files are placed at the repository root so Hugging Face shows the weights in the right-side download panel and `vllm serve` can load the repo directly. The repo intentionally keeps a single root weight set to avoid full-repo snapshot downloads pulling multiple profile variants.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Download
|
| 43 |
|
|
|
|
| 46 |
--local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
|
| 47 |
```
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
## vLLM quickstart
|
| 50 |
|
| 51 |
```bash
|
|
|
|
| 87 |
--trust-remote-code
|
| 88 |
```
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
## Quantization recipe
|
| 91 |
|
| 92 |
```python
|
|
|
|
| 106 |
)
|
| 107 |
```
|
| 108 |
|
| 109 |
+
- Calibration: `HuggingFaceH4/ultrachat_200k`, 128 samples x 1024 tokens
|
| 110 |
- MTP tensors copied from [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
|
| 111 |
- Converted using [li-yifei/gguf-to-nvfp4](https://github.com/li-yifei/gguf-to-nvfp4)
|
| 112 |
|
| 113 |
Pipeline:
|
| 114 |
|
| 115 |
```text
|
| 116 |
+
Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4
|
| 117 |
```
|
| 118 |
|
| 119 |
## Source models
|
|
|
|
| 125 |
|
| 126 |
- [HauhauCS](https://huggingface.co/HauhauCS) for the uncensored GGUF source
|
| 127 |
- [Qwen](https://huggingface.co/Qwen) for the base model and MTP weights
|
| 128 |
+
- [AEON-7](https://huggingface.co/AEON-7) and [RedHatAI](https://huggingface.co/RedHatAI) for conservative quantization approach reference
|