lyf commited on
Commit
67a56fa
·
1 Parent(s): 2e152bf

Update README for root-only model layout

Browse files

Remove references to deleted aggressive and conservative profile folders. Document the repository as a single root weight set for direct vLLM loading.

Files changed (1) hide show
  1. README.md +5 -43
README.md CHANGED
@@ -30,22 +30,14 @@ Uncensored Qwen3.6 35B A3B MoE quantized to NVFP4 `compressed-tensors` for vLLM
30
 
31
  - **35B total / 3B active MoE**
32
  - **HauhauCS Aggressive uncensored source**
 
33
  - **NVFP4 W4A4 compressed-tensors**
34
  - **~22 GB**
35
  - **Runs on one RTX 5090**
36
  - **100K-131K text context target**
37
  - **vLLM native loading**
38
 
39
- The default model files are placed at the repository root so Hugging Face shows the weights in the right-side download panel and `vllm serve` can load the repo directly.
40
-
41
- ## Which profile should I use?
42
-
43
- | Profile | Path | Use |
44
- | --- | --- | --- |
45
- | Conservative | repo root / `conservative/` | Recommended default. Linear attention and MTP kept bf16 for quality. |
46
- | Aggressive | `aggressive/` | More aggressive NVFP4 coverage for smaller footprint / longer context experiments. |
47
-
48
- Recommended default: **root / conservative**.
49
 
50
  ## Download
51
 
@@ -54,14 +46,6 @@ hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
54
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
55
  ```
56
 
57
- Aggressive profile only:
58
-
59
- ```bash
60
- hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
61
- --include "aggressive/*" \
62
- --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
63
- ```
64
-
65
  ## vLLM quickstart
66
 
67
  ```bash
@@ -103,28 +87,6 @@ vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \
103
  --trust-remote-code
104
  ```
105
 
106
- Aggressive subfolder quickstart:
107
-
108
- ```bash
109
- hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
110
- --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
111
-
112
- VLLM_NVFP4_GEMM_BACKEND=marlin \
113
- vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4/aggressive \
114
- --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4-aggressive \
115
- --quantization compressed-tensors \
116
- --kv-cache-dtype fp8 \
117
- --max-model-len 131072 \
118
- --max-num-seqs 1 \
119
- --max-num-batched-tokens 4096 \
120
- --gpu-memory-utilization 0.90 \
121
- --enable-prefix-caching \
122
- --enable-auto-tool-choice \
123
- --tool-call-parser qwen3_coder \
124
- --reasoning-parser qwen3 \
125
- --trust-remote-code
126
- ```
127
-
128
  ## Quantization recipe
129
 
130
  ```python
@@ -144,14 +106,14 @@ oneshot(
144
  )
145
  ```
146
 
147
- - Calibration: `HuggingFaceH4/ultrachat_200k`, 128 samples × 1024 tokens
148
  - MTP tensors copied from [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
149
  - Converted using [li-yifei/gguf-to-nvfp4](https://github.com/li-yifei/gguf-to-nvfp4)
150
 
151
  Pipeline:
152
 
153
  ```text
154
- Q8_K_P GGUF step1_convert_qwen36_moe.py HF bf16 step2_quantize_qwen36_moe.py NVFP4
155
  ```
156
 
157
  ## Source models
@@ -163,4 +125,4 @@ Q8_K_P GGUF → step1_convert_qwen36_moe.py → HF bf16 → step2_quantize_qwen3
163
 
164
  - [HauhauCS](https://huggingface.co/HauhauCS) for the uncensored GGUF source
165
  - [Qwen](https://huggingface.co/Qwen) for the base model and MTP weights
166
- - [AEON-7](https://huggingface.co/AEON-7) and [RedHatAI](https://huggingface.co/RedHatAI) for conservative quantization approach reference
 
30
 
31
  - **35B total / 3B active MoE**
32
  - **HauhauCS Aggressive uncensored source**
33
+ - **Conservative NVFP4 profile**: linear attention and MTP kept in bf16 for quality
34
  - **NVFP4 W4A4 compressed-tensors**
35
  - **~22 GB**
36
  - **Runs on one RTX 5090**
37
  - **100K-131K text context target**
38
  - **vLLM native loading**
39
 
40
+ The model files are placed at the repository root so Hugging Face shows the weights in the right-side download panel and `vllm serve` can load the repo directly. The repo intentionally keeps a single root weight set to avoid full-repo snapshot downloads pulling multiple profile variants.
 
 
 
 
 
 
 
 
 
41
 
42
  ## Download
43
 
 
46
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
47
  ```
48
 
 
 
 
 
 
 
 
 
49
  ## vLLM quickstart
50
 
51
  ```bash
 
87
  --trust-remote-code
88
  ```
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ## Quantization recipe
91
 
92
  ```python
 
106
  )
107
  ```
108
 
109
+ - Calibration: `HuggingFaceH4/ultrachat_200k`, 128 samples x 1024 tokens
110
  - MTP tensors copied from [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
111
  - Converted using [li-yifei/gguf-to-nvfp4](https://github.com/li-yifei/gguf-to-nvfp4)
112
 
113
  Pipeline:
114
 
115
  ```text
116
+ Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4
117
  ```
118
 
119
  ## Source models
 
125
 
126
  - [HauhauCS](https://huggingface.co/HauhauCS) for the uncensored GGUF source
127
  - [Qwen](https://huggingface.co/Qwen) for the base model and MTP weights
128
+ - [AEON-7](https://huggingface.co/AEON-7) and [RedHatAI](https://huggingface.co/RedHatAI) for conservative quantization approach reference