nandukmelath commited on
Commit
c38dfe7
Β·
verified Β·
1 Parent(s): 3b2c8bc

Add viral model card with full docs, benchmarks, and patch explanation

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive
5
+ - Qwen/Qwen3.5-9B
6
+ tags:
7
+ - uncensored
8
+ - gguf
9
+ - qwen3
10
+ - local-llm
11
+ - no-think
12
+ - apple-silicon
13
+ - lm-studio
14
+ - zero-guardrail
15
+ - fast-inference
16
+ language:
17
+ - en
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # Qwen3.5-9B Uncensored β€” No-Think Edition (GGUF)
22
+
23
+ > ⚑ Zero refusals. Zero thinking delay. 100% local.
24
+
25
+ This is a patched GGUF of [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) with one key modification: **thinking is disabled at the GGUF template level**, giving you instant responses without the 15–30 second reasoning delay.
26
+
27
+ ## What's different
28
+
29
+ Qwen3.5 is a thinking model. By default it outputs a `<think>...</think>` block before every response. This is great for hard problems but brutal for everyday use β€” you wait 20 seconds for a simple answer.
30
+
31
+ This model patches the embedded Jinja2 chat template to always output an **empty** think block:
32
+
33
+ ```
34
+ Original flow: <think> [400 tokens] </think> β†’ answer (~25s wait)
35
+ This model: <think></think> β†’ answer (<1s wait)
36
+ ```
37
+
38
+ The model's intelligence is encoded in its **weights**, not the thinking trace. Quality is the same. Speed is 25x better for time-to-first-token.
39
+
40
+ > **Want reasoning on demand?** Add `/think` to any message β€” the model will reason through it fully for that turn only.
41
+
42
+ ## Model details
43
+
44
+ | Property | Value |
45
+ |----------|-------|
46
+ | Base | Qwen3.5-9B |
47
+ | Fine-tune | HauhauCS Uncensored Aggressive |
48
+ | Quantization | Q4_K_M |
49
+ | Context | Up to 65,536 tokens |
50
+ | Parameters | 9B |
51
+ | Format | GGUF |
52
+ | Refusal rate | 0% |
53
+
54
+ ## Benchmarks (MacBook Pro M2 Pro, 16 GB)
55
+
56
+ | Metric | Value |
57
+ |--------|-------|
58
+ | Generation speed | ~22–25 tok/s |
59
+ | Time to first token | **< 1 second** |
60
+ | Context window | 65,536 tokens |
61
+ | VRAM usage | ~8.5 GB |
62
+
63
+ ## How to use
64
+
65
+ ### LM Studio (recommended)
66
+ 1. Download the Q4_K_M file below
67
+ 2. Load in LM Studio with `--context-length 65536 --gpu max`
68
+ 3. Done β€” no config needed, thinking is already patched off
69
+
70
+ ### Optimal sampling (Qwen3 official recommended)
71
+ ```
72
+ Temperature: 0.6
73
+ Top-P: 0.95
74
+ Top-K: 20
75
+ Repeat penalty: 1.0
76
+ Max tokens: 4096
77
+ ```
78
+
79
+ ### llama.cpp
80
+ ```bash
81
+ ./llama-cli -m Qwen3.5-9B-Uncensored-nothink-Q4_K_M.gguf \
82
+ --ctx-size 65536 \
83
+ --n-gpu-layers 99 \
84
+ -p "Your prompt here"
85
+ ```
86
+
87
+ ## Full automated setup for Mac
88
+
89
+ πŸ‘‰ **[github.com/nandukmelath/lmstudio-uncensored-setup](https://github.com/nandukmelath/lmstudio-uncensored-setup)**
90
+
91
+ One command: VRAM boost + auto-start + model load + Hermes Agent config:
92
+ ```bash
93
+ git clone https://github.com/nandukmelath/lmstudio-uncensored-setup
94
+ cd lmstudio-uncensored-setup && ./scripts/setup.sh
95
+ ```
96
+
97
+ ## How the patch works
98
+
99
+ The Qwen3.5 GGUF contains an embedded Jinja2 chat template with this block:
100
+
101
+ ```jinja
102
+ {%- if enable_thinking is defined and enable_thinking is false %}
103
+ {{- '<think>\n\n</think>\n\n' }}
104
+ {%- else %}
105
+ {{- '<think>\n' }}
106
+ {%- endif %}
107
+ ```
108
+
109
+ The patch replaces it with just:
110
+ ```jinja
111
+ {{- '<think>\n\n</think>\n\n' }}
112
+ ```
113
+
114
+ Same file size (padded with spaces), same structure, zero thinking overhead.
115
+ The patcher script is open source: [patch_nothink.py](https://github.com/nandukmelath/lmstudio-uncensored-setup/blob/main/scripts/patch_nothink.py)
116
+
117
+ ## Credits
118
+
119
+ - Base model: [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5) by Alibaba Cloud (Apache 2.0)
120
+ - Uncensored fine-tune: [HauhauCS](https://huggingface.co/HauhauCS) (Apache 2.0)
121
+ - No-think patch & automated setup: [@nandukmelath](https://huggingface.co/nandukmelath)
122
+
123
+ ## License
124
+
125
+ Apache 2.0