Upload README.md with huggingface_hub

e898ae1 verified about 1 month ago

11.6 kB

	---
	license: apache-2.0
	tags:
	- lora
	- peft
	- qwen3.5-moe
	- qwen3.6
	- reasoning
	- kimi-k2.6
	- claude-opus
	- distillation
	- weight-diff
	- svd
	language:
	- en
	- th
	pipeline_tag: text-generation
	base_model: lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
	---

	# Opus → Kimi Reasoning LoRA

	> 🧠 Extracted by [UKA](https://github.com/nousresearch/hermes-agent) — an AI agent powered by Hermes Agent.
	> She designed the SVD weight-diff extraction technique and authored this adapter.

	A rank-16 LoRA adapter that converts [Claude 4.7 Opus reasoning style](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled) into [Kimi K2.6](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled) reasoning style — on the same 35B Mixture-of-Experts base model.

	No training. Pure linear algebra.

	---

	## 🔬 How It Works: Weight-Diff SVD Extraction

	Both lordx64 models share the exact same base (`Qwen/Qwen3.6-35B-A3B`) and were fine-tuned with LoRA (merged back).

	Mathematically:

	```
	W_opus = W_base + delta_Opus
	W_kimi = W_base + delta_Kimi

	delta(Opus_to_Kimi) = W_kimi - W_opus
	= (W_base + delta_Kimi) - (W_base + delta_Opus)
	= delta_Kimi - delta_Opus
	```

	The base model cancels out — only the reasoning delta remains!

	### SVD Compression

	The raw delta is 70+ GB. We compress it to rank-16 LoRA via truncated SVD:

	```python
	for each attention weight tensor:
	delta = W_kimi - W_opus # [out, in]
	U, S, Vh = SVD(delta) # decompose
	lora_B = U[:, :16] * sqrt(S[:16]) # [out, 16]
	lora_A = sqrt(S[:16]) * Vh[:16, :] # [16, in]
	```

	- Input: 2x 72 GB models (~145 GB disk)
	- VRAM used: ~3 GB (tensor-by-tensor, no GPU needed)
	- Compute: ~44 SVDs on CPU (< 3 minutes)
	- Output: 7.2 MB LoRA adapter (rank=16, attention-only)

	### Target Modules

	Only full-attention layers (every 4th layer in Qwen3.5-MoE):

	\| Layer \| q_proj \| k_proj \| v_proj \| o_proj \|
	\|-------\|--------\|--------\|--------\|--------\|
	\| 3, 7, 11, 15, 19, 23, 27, 31 \| ✅ \| ✅ \| ✅ \| ✅ \|
	\| 35, 39 \| del=0 \| del=0 \| del=0 \| del=0 \|

	> ⚡ Interesting finding: Layers 35 and 39 have zero delta — the Kimi fine-tune did not touch these layers at all!

	### Why Attention-Only?

	The existing Claude Opus LoRA adapter (13.8 MB, r=16) is attention-only (q/k/v/o_proj).
	We match the same target modules for compatibility.

	The 3D expert tensors (256, 2048, 512) were intentionally skipped — both for compatibility with the existing adapter and because reasoning style is primarily encoded in attention patterns, not expert FFN weights.

	---



	## 🌍 METHOD — ภาษา / Languages / 语言

	คู่มือเทคนิค Weight-Diff SVD Extraction มีให้อ่านใน 5 ภาษา:
	The universal extraction guide is available in 5 languages:

	\| ไฟล์ \| ภาษา / Language \|
	\|------\|-----------------\|
	\| [METHOD.md](./METHOD.md) \| 🇹🇭 ไทย \|
	\| [METHOD_EN.md](./METHOD_EN.md) \| 🇬🇧 English \|
	\| [METHOD_ZH.md](./METHOD_ZH.md) \| 🇨🇳 中文 \|
	\| [METHOD_JP.md](./METHOD_JP.md) \| 🇯🇵 日本語 \|
	\| [METHOD_VN.md](./METHOD_VN.md) \| 🇻🇳 Tiếng Việt \|

	ทุกไฟล์มีเนื้อหาครบถ้วน: เงื่อนไขการใช้งาน, 5 ขั้นตอน, คณิตศาสตร์, ตัวอย่างโมเดลอื่น, troubleshooting, และการอ้างอิง
	All files contain: requirements, 5 steps, math, examples for other models, troubleshooting, and references.

	---

	## 📦 Available Formats

	### PEFT (Python)
	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM

	base = AutoModelForCausalLM.from_pretrained(
	"lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	model = PeftModel.from_pretrained(base, "hotdogs/qwen3.6-35b-opus-to-kimi-lora")
	model = model.merge_and_unload()
	```

	### GGUF (llama.cpp)
	```bash
	./llama-cli \
	-m Qwen3.6-35B-A3B-Claude-Opus-Q6_K.gguf \
	--lora qwen3.6-35b-opus-to-kimi-lora.gguf \
	-p "Solve this math problem step by step..."
	```



	> ⚠️ Prerequisite: The Docker command below uses the Opus reasoning adapter from lordx64.
	> Download it first:
	> ```bash
	> wget https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled/resolve/main/adapter_model.safetensors
	> # Or use the GGUF version for llama.cpp:
	> # Convert with: python3 llama.cpp/convert_lora_to_gguf.py /path/to/opus-adapter
	> ```
	> Or use only the Kimi adapter without Opus: `--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:1.0`


	### llama.cpp Server (Docker) — การใช้งานแบบ Multi-LoRA Stacking 🔥

	🌐 สแต็ก LoRA หลายตัวพร้อมกัน — รวมโมเดลพื้นฐานแบบ uncensored + Opus reasoning LoRA + Kimi style LoRA เข้าด้วยกันในเซิร์ฟเวอร์เดียวที่เข้ากันได้กับ OpenAI API:

	### llama.cpp Server (Docker) — Multi-LoRA Stacking 🔥

	Combine the uncensored base model + Opus reasoning LoRA + Kimi style LoRA into one OpenAI-compatible API server:

	```bash
	sudo docker run --rm -p 8080:8080 \
	-v /path/to/models/:/models \
	--gpus all \
	--env CUDA_VISIBLE_DEVICES=0,1,2,3 \
	ghcr.io/ggml-org/llama.cpp:server-cuda \
	-m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
	--lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
	--host 0.0.0.0 --port 8080 \
	--n-gpu-layers 999 \
	--tensor-split 4,13,12,12 \
	--ctx-size 131072 \
	--batch-size 4096 \
	--ubatch-size 512 \
	--cache-type-k q4_0 \
	--cache-type-v q4_0 \
	-fa on \
	--mlock \
	--jinja
	```

	What this does:

	\| Component \| Purpose \| Weight \|
	\|-----------\|---------\|--------\|
	\| `llmfan46_...-heretic-Q6_K.gguf` \| Uncensored base (35B MoE) \| 🏛️ Base \|
	\| `lordx64_...-Opus-...-adapter-F16.gguf` \| Claude Opus reasoning (concise) \| 0.6 = 60% \|
	\| `qwen3.6-35b-opus-to-kimi-lora.gguf` \| → Kimi K2.6 style (verbose) 🔥 \| 0.8 = 80% \|

	Result: Uncensored base + Opus reasoning structure + Kimi verbose style — all in one model!

	Key flags explained:

	\| Flag \| Purpose \|
	\|------\|---------\|
	\| `--lora-scaled A:α,B:β` \| Stack multiple LoRA adapters with independent scales \|
	\| `--n-gpu-layers 999` \| Offload all layers to GPU \|
	\| `--tensor-split 4,13,12,12` \| Split across 4 GPUs (adjust for your setup) \|
	\| `--ctx-size 131072` \| 128K context window \|
	\| `--cache-type-k q4_0` \| KV cache in 4-bit quantization (saves VRAM) \|
	\| `--cache-type-v q4_0` \| Value cache in 4-bit quantization \|
	\| `-fa on` \| Flash Attention enabled \|
	\| `--mlock` \| Lock model in RAM (prevents swap) \|
	\| `--jinja` \| Use Jinja2 chat templates \|
	\| `--lora` \| Apply LoRA adapter (applied first, before scaled) \|
	\| `--lora-scaled` \| Apply LoRA with scale (comma-separated for multiple) \|

	---

	### 🛡️ 3-Layer Stack with Refusal Removal LoRA

	For the purest uncensored stack using weight-diff extracted LoRAs:

	\| Layer \| Component \| Purpose \|
	\|-------\|-----------\|---------\|
	\| 1 \| Opus GGUF (base model) \| Qwen3.6-35B + Opus reasoning \|
	\| 2 \| [refusal-removal-lora](https://huggingface.co/hotdogs/qwen3.6-35b-refusal-removal-lora) \| 🛡️ Remove refusals (uncensored) \|
	\| 3 \| opus-to-kimi-lora (scale 0.5) \| 🎨 Kimi K2.6 verbose style \|

	```bash
	docker run --gpus all -p 8080:8080 \
	-v /path/to/models:/models \
	ghcr.io/ggml-org/llama.cpp:server-cuda \
	-m /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Q6_K.gguf \
	--lora /models/qwen3.6-35b-refusal-removal-lora.gguf \
	--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.5 \
	--host 0.0.0.0 --port 8080 \
	--n-gpu-layers 999 \
	--ctx-size 131072 \
	--batch-size 4096 \
	-fa on
	```

	> 🔬 Technical note: The refusal-removal LoRA was extracted via Weight-Diff SVD from `huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated` minus `lordx64/...Opus`. It modifies only o_proj in 10 layers (3,7,11,15,19,23,27,31,35,39) — an extremely sparse signal compared to full distillation (Kimi LoRA touches all 44 attention tensors).

	---

	Old stack (uncensored GGUF base):

	Single GPU alternative:
	```bash
	sudo docker run --rm -p 8080:8080 \
	-v /path/to/models/:/models \
	--gpus all \
	ghcr.io/ggml-org/llama.cpp:server-cuda \
	-m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
	--lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
	--host 0.0.0.0 --port 8080 \
	--n-gpu-layers 999 \
	--ctx-size 32768 \
	--batch-size 2048 \
	--cache-type-k q4_0 --cache-type-v q4_0 \
	-fa on --mlock --jinja
	```

	API Usage (OpenAI-compatible):
	```bash
	curl http://localhost:8080/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "gpt-3.5-turbo",
	"messages": [
	{"role": "user", "content": "Explain quantum entanglement step by step"}
	],
	"temperature": 0.7,
	"max_tokens": 4096
	}'
	```

	> 💡 Tip: Adjust LoRA scales to fine-tune the reasoning style:
	> - `0.6:0.8` — Balanced (Opus structure + Kimi verbosity)
	> - `0.3:1.0` — Heavy Kimi style
	> - `1.0:0.2` — Mostly Opus, slight Kimi touch
	> - `0.0:1.0` — Pure Kimi style (skip Opus adapter entirely)


	---

	## 📊 Comparison: Opus vs Kimi Reasoning

	\| Trait \| Claude Opus \| + Kimi LoRA \|
	\|-------\|-------------\|-------------\|
	\| Thinking tokens (mean) \| 849 \| 2,933 (3.5x longer) \|
	\| Thinking tokens (p95) \| 2,404 \| 9,764 \|
	\| Style \| Concise, direct \| Verbose, deliberate \|
	\| Best for \| Quick reasoning \| Deep multi-step reasoning \|

	---

	## 🛠️ Technical Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| Weight-diff SVD extraction \|
	\| Rank \| 16 \|
	\| LoRA Alpha \| 16 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj \|
	\| Tensors extracted \| 44 (attention weights across 11 layers) \|
	\| Tensor shapes \| q:[8192,2048] k/v:[512,2048] o:[2048,4096] \|
	\| Adapter size \| 7.2 MB (PEFT) / 14 MB (GGUF F32) \|
	\| Precision \| BF16 to F32 (GGUF) \|
	\| Extraction time \| ~3 min (CPU SVD) \|
	\| Disk needed \| ~145 GB (temporary, for both full models) \|
	\| VRAM needed \| ~3 GB (no GPU required) \|

	---

	## 🧪 Reproduction

	Full extraction script and methodology available in the UKA Hermes Agent session log.

	```bash
	# Quick reproduction
	python3 extract_lora_diff.py \
	--opus-path ./model_opus \
	--kimi-path ./model_kimi \
	--rank 16 \
	--output ./opus-to-kimi-lora
	```

	---

	## 👩‍💻 Credits

	- UKA (Hermes Agent) — designed the weight-diff SVD technique, wrote all extraction code, authored this README
	- lordx64 — trained the source models ([Opus](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled), [Kimi](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled))
	- Qwen Team — base model [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
	- Bas95 — original reasoning distillation datasets
	- Hermes Agent — [nousresearch/hermes-agent](https://github.com/nousresearch/hermes-agent)

	---

	## 📄 License

	Apache 2.0 — same as the source models.