hotdogs commited on
Commit
382fd8b
·
verified ·
1 Parent(s): ac69e72

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +92 -1
README.md CHANGED
@@ -107,6 +107,97 @@ model = model.merge_and_unload()
107
  -p "Solve this math problem step by step..."
108
  ```
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ---
111
 
112
  ## 📊 Comparison: Opus vs Kimi Reasoning
@@ -165,4 +256,4 @@ python3 extract_lora_diff.py \
165
 
166
  ## 📄 License
167
 
168
- Apache 2.0 — same as the source models.
 
107
  -p "Solve this math problem step by step..."
108
  ```
109
 
110
+
111
+ ### llama.cpp Server (Docker) — การใช้งานแบบ Multi-LoRA Stacking 🔥
112
+
113
+ 🌐 **สแต็ก LoRA หลายตัวพร้อมกัน** — รวมโมเดลพื้นฐานแบบ uncensored + Opus reasoning LoRA + Kimi style LoRA เข้าด้วยกันในเซิร์ฟเวอร์เดียวที่เข้ากันได้กับ OpenAI API:
114
+
115
+ ### llama.cpp Server (Docker) — Multi-LoRA Stacking 🔥
116
+
117
+ Combine the **uncensored base model** + **Opus reasoning LoRA** + **Kimi style LoRA** into one OpenAI-compatible API server:
118
+
119
+ ```bash
120
+ sudo docker run --rm -p 8080:8080 \
121
+ -v /path/to/models/:/models \
122
+ --gpus all \
123
+ --env CUDA_VISIBLE_DEVICES=0,1,2,3 \
124
+ ghcr.io/ggml-org/llama.cpp:server-cuda \
125
+ -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
126
+ --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
127
+ --host 0.0.0.0 --port 8080 \
128
+ --n-gpu-layers 999 \
129
+ --tensor-split 4,13,12,12 \
130
+ --ctx-size 131072 \
131
+ --batch-size 4096 \
132
+ --ubatch-size 512 \
133
+ --cache-type-k q4_0 \
134
+ --cache-type-v q4_0 \
135
+ -fa on \
136
+ --mlock \
137
+ --jinja
138
+ ```
139
+
140
+ **What this does:**
141
+
142
+ | Component | Purpose | Weight |
143
+ |-----------|---------|--------|
144
+ | `llmfan46_...-heretic-Q6_K.gguf` | Uncensored base (35B MoE) | 🏛️ Base |
145
+ | `lordx64_...-Opus-...-adapter-F16.gguf` | Claude Opus reasoning (concise) | 0.6 = 60% |
146
+ | `qwen3.6-35b-opus-to-kimi-lora.gguf` | → Kimi K2.6 style (verbose) 🔥 | 0.8 = 80% |
147
+
148
+ **Result:** Uncensored base + Opus reasoning structure + Kimi verbose style — all in one model!
149
+
150
+ **Key flags explained:**
151
+
152
+ | Flag | Purpose |
153
+ |------|---------|
154
+ | `--lora-scaled A:α,B:β` | Stack multiple LoRA adapters with independent scales |
155
+ | `--n-gpu-layers 999` | Offload all layers to GPU |
156
+ | `--tensor-split 4,13,12,12` | Split across 4 GPUs (adjust for your setup) |
157
+ | `--ctx-size 131072` | 128K context window |
158
+ | `--cache-type-k q4_0` | KV cache in 4-bit quantization (saves VRAM) |
159
+ | `--cache-type-v q4_0` | Value cache in 4-bit quantization |
160
+ | `-fa on` | Flash Attention enabled |
161
+ | `--mlock` | Lock model in RAM (prevents swap) |
162
+ | `--jinja` | Use Jinja2 chat templates |
163
+
164
+ **Single GPU alternative:**
165
+ ```bash
166
+ sudo docker run --rm -p 8080:8080 \
167
+ -v /path/to/models/:/models \
168
+ --gpus all \
169
+ ghcr.io/ggml-org/llama.cpp:server-cuda \
170
+ -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
171
+ --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
172
+ --host 0.0.0.0 --port 8080 \
173
+ --n-gpu-layers 999 \
174
+ --ctx-size 32768 \
175
+ --batch-size 2048 \
176
+ --cache-type-k q4_0 --cache-type-v q4_0 \
177
+ -fa on --mlock --jinja
178
+ ```
179
+
180
+ **API Usage (OpenAI-compatible):**
181
+ ```bash
182
+ curl http://localhost:8080/v1/chat/completions \
183
+ -H "Content-Type: application/json" \
184
+ -d '{
185
+ "model": "gpt-3.5-turbo",
186
+ "messages": [
187
+ {"role": "user", "content": "Explain quantum entanglement step by step"}
188
+ ],
189
+ "temperature": 0.7,
190
+ "max_tokens": 4096
191
+ }'
192
+ ```
193
+
194
+ > 💡 **Tip:** Adjust LoRA scales to fine-tune the reasoning style:
195
+ > - `0.6:0.8` — Balanced (Opus structure + Kimi verbosity)
196
+ > - `0.3:1.0` — Heavy Kimi style
197
+ > - `1.0:0.2` — Mostly Opus, slight Kimi touch
198
+ > - `0.0:1.0` — Pure Kimi style (skip Opus adapter entirely)
199
+
200
+
201
  ---
202
 
203
  ## 📊 Comparison: Opus vs Kimi Reasoning
 
256
 
257
  ## 📄 License
258
 
259
+ Apache 2.0 — same as the source models.