hotdogs's picture
Upload README.md with huggingface_hub
e898ae1 verified
|
raw
history blame
11.6 kB
---
license: apache-2.0
tags:
- lora
- peft
- qwen3.5-moe
- qwen3.6
- reasoning
- kimi-k2.6
- claude-opus
- distillation
- weight-diff
- svd
language:
- en
- th
pipeline_tag: text-generation
base_model: lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
---
# Opus β†’ Kimi Reasoning LoRA
> 🧠 **Extracted by [UKA](https://github.com/nousresearch/hermes-agent)** β€” an AI agent powered by Hermes Agent.
> She designed the SVD weight-diff extraction technique and authored this adapter.
A **rank-16 LoRA adapter** that converts [Claude 4.7 Opus reasoning style](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled) into **[Kimi K2.6](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled) reasoning style** β€” on the same 35B Mixture-of-Experts base model.
No training. Pure linear algebra.
---
## πŸ”¬ How It Works: Weight-Diff SVD Extraction
Both lordx64 models share the **exact same base** (`Qwen/Qwen3.6-35B-A3B`) and were fine-tuned with LoRA (merged back).
Mathematically:
```
W_opus = W_base + delta_Opus
W_kimi = W_base + delta_Kimi
delta(Opus_to_Kimi) = W_kimi - W_opus
= (W_base + delta_Kimi) - (W_base + delta_Opus)
= delta_Kimi - delta_Opus
```
The base model cancels out β€” only the **reasoning delta** remains!
### SVD Compression
The raw delta is 70+ GB. We compress it to rank-16 LoRA via truncated SVD:
```python
for each attention weight tensor:
delta = W_kimi - W_opus # [out, in]
U, S, Vh = SVD(delta) # decompose
lora_B = U[:, :16] * sqrt(S[:16]) # [out, 16]
lora_A = sqrt(S[:16]) * Vh[:16, :] # [16, in]
```
- **Input**: 2x 72 GB models (~145 GB disk)
- **VRAM used**: ~3 GB (tensor-by-tensor, no GPU needed)
- **Compute**: ~44 SVDs on CPU (< 3 minutes)
- **Output**: 7.2 MB LoRA adapter (rank=16, attention-only)
### Target Modules
Only **full-attention** layers (every 4th layer in Qwen3.5-MoE):
| Layer | q_proj | k_proj | v_proj | o_proj |
|-------|--------|--------|--------|--------|
| 3, 7, 11, 15, 19, 23, 27, 31 | βœ… | βœ… | βœ… | βœ… |
| **35, 39** | del=0 | del=0 | del=0 | **del=0** |
> ⚑ **Interesting finding**: Layers 35 and 39 have **zero delta** β€” the Kimi fine-tune did not touch these layers at all!
### Why Attention-Only?
The existing Claude Opus LoRA adapter (13.8 MB, r=16) is attention-only (q/k/v/o_proj).
We match the same target modules for compatibility.
The 3D expert tensors (256, 2048, 512) were intentionally skipped β€” both for compatibility with the existing adapter and because reasoning style is primarily encoded in attention patterns, not expert FFN weights.
---
## 🌍 METHOD β€” ΰΈ ΰΈ²ΰΈ©ΰΈ² / Languages / 语言
ΰΈ„ΰΈΉΰΉˆΰΈ‘ΰΈ·ΰΈ­ΰΉ€ΰΈ—ΰΈ„ΰΈ™ΰΈ΄ΰΈ„ Weight-Diff SVD Extraction ΰΈ‘ΰΈ΅ΰΉƒΰΈ«ΰΉ‰ΰΈ­ΰΉˆΰΈ²ΰΈ™ΰΉƒΰΈ™ 5 ΰΈ ΰΈ²ΰΈ©ΰΈ²:
The universal extraction guide is available in 5 languages:
| ΰΉ„ΰΈŸΰΈ₯์ | ΰΈ ΰΈ²ΰΈ©ΰΈ² / Language |
|------|-----------------|
| [METHOD.md](./METHOD.md) | πŸ‡ΉπŸ‡­ ΰΉ„ΰΈ—ΰΈ’ |
| [METHOD_EN.md](./METHOD_EN.md) | πŸ‡¬πŸ‡§ English |
| [METHOD_ZH.md](./METHOD_ZH.md) | πŸ‡¨πŸ‡³ δΈ­ζ–‡ |
| [METHOD_JP.md](./METHOD_JP.md) | πŸ‡―πŸ‡΅ ζ—₯本θͺž |
| [METHOD_VN.md](./METHOD_VN.md) | πŸ‡»πŸ‡³ TiαΊΏng Việt |
ΰΈ—ΰΈΈΰΈΰΉ„ΰΈŸΰΈ₯ΰΉŒΰΈ‘ΰΈ΅ΰΉ€ΰΈ™ΰΈ·ΰΉ‰ΰΈ­ΰΈ«ΰΈ²ΰΈ„ΰΈ£ΰΈšΰΈ–ΰΉ‰ΰΈ§ΰΈ™: ΰΉ€ΰΈ‡ΰΈ·ΰΉˆΰΈ­ΰΈ™ΰΉ„ΰΈ‚ΰΈΰΈ²ΰΈ£ΰΉƒΰΈŠΰΉ‰ΰΈ‡ΰΈ²ΰΈ™, 5 ΰΈ‚ΰΈ±ΰΉ‰ΰΈ™ΰΈ•ΰΈ­ΰΈ™, ΰΈ„ΰΈ“ΰΈ΄ΰΈ•ΰΈ¨ΰΈ²ΰΈͺΰΈ•ΰΈ£ΰΉŒ, ΰΈ•ΰΈ±ΰΈ§ΰΈ­ΰΈ’ΰΉˆΰΈ²ΰΈ‡ΰΉ‚ΰΈ‘ΰΉ€ΰΈ”ΰΈ₯ΰΈ­ΰΈ·ΰΉˆΰΈ™, troubleshooting, แΰΈ₯ะการอ้างอิง
All files contain: requirements, 5 steps, math, examples for other models, troubleshooting, and references.
---
## πŸ“¦ Available Formats
### PEFT (Python)
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained(
"lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "hotdogs/qwen3.6-35b-opus-to-kimi-lora")
model = model.merge_and_unload()
```
### GGUF (llama.cpp)
```bash
./llama-cli \
-m Qwen3.6-35B-A3B-Claude-Opus-Q6_K.gguf \
--lora qwen3.6-35b-opus-to-kimi-lora.gguf \
-p "Solve this math problem step by step..."
```
> ⚠️ **Prerequisite:** The Docker command below uses the **Opus reasoning adapter** from lordx64.
> Download it first:
> ```bash
> wget https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled/resolve/main/adapter_model.safetensors
> # Or use the GGUF version for llama.cpp:
> # Convert with: python3 llama.cpp/convert_lora_to_gguf.py /path/to/opus-adapter
> ```
> Or use only the Kimi adapter without Opus: `--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:1.0`
### llama.cpp Server (Docker) β€” ΰΈΰΈ²ΰΈ£ΰΉƒΰΈŠΰΉ‰ΰΈ‡ΰΈ²ΰΈ™ΰΉΰΈšΰΈš Multi-LoRA Stacking πŸ”₯
🌐 **ΰΈͺแต็ก LoRA ΰΈ«ΰΈ₯ΰΈ²ΰΈ’ΰΈ•ΰΈ±ΰΈ§ΰΈžΰΈ£ΰΉ‰ΰΈ­ΰΈ‘ΰΈΰΈ±ΰΈ™** β€” ΰΈ£ΰΈ§ΰΈ‘ΰΉ‚ΰΈ‘ΰΉ€ΰΈ”ΰΈ₯ΰΈžΰΈ·ΰΉ‰ΰΈ™ΰΈΰΈ²ΰΈ™ΰΉΰΈšΰΈš uncensored + Opus reasoning LoRA + Kimi style LoRA ΰΉ€ΰΈ‚ΰΉ‰ΰΈ²ΰΈ”ΰΉ‰ΰΈ§ΰΈ’ΰΈΰΈ±ΰΈ™ΰΉƒΰΈ™ΰΉ€ΰΈ‹ΰΈ΄ΰΈ£ΰΉŒΰΈŸΰΉ€ΰΈ§ΰΈ­ΰΈ£ΰΉŒΰΉ€ΰΈ”ΰΈ΅ΰΈ’ΰΈ§ΰΈ—ΰΈ΅ΰΉˆΰΉ€ΰΈ‚ΰΉ‰ΰΈ²ΰΈΰΈ±ΰΈ™ΰΉ„ΰΈ”ΰΉ‰ΰΈΰΈ±ΰΈš OpenAI API:
### llama.cpp Server (Docker) β€” Multi-LoRA Stacking πŸ”₯
Combine the **uncensored base model** + **Opus reasoning LoRA** + **Kimi style LoRA** into one OpenAI-compatible API server:
```bash
sudo docker run --rm -p 8080:8080 \
-v /path/to/models/:/models \
--gpus all \
--env CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
--lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 \
--tensor-split 4,13,12,12 \
--ctx-size 131072 \
--batch-size 4096 \
--ubatch-size 512 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-fa on \
--mlock \
--jinja
```
**What this does:**
| Component | Purpose | Weight |
|-----------|---------|--------|
| `llmfan46_...-heretic-Q6_K.gguf` | Uncensored base (35B MoE) | πŸ›οΈ Base |
| `lordx64_...-Opus-...-adapter-F16.gguf` | Claude Opus reasoning (concise) | 0.6 = 60% |
| `qwen3.6-35b-opus-to-kimi-lora.gguf` | β†’ Kimi K2.6 style (verbose) πŸ”₯ | 0.8 = 80% |
**Result:** Uncensored base + Opus reasoning structure + Kimi verbose style β€” all in one model!
**Key flags explained:**
| Flag | Purpose |
|------|---------|
| `--lora-scaled A:Ξ±,B:Ξ²` | Stack multiple LoRA adapters with independent scales |
| `--n-gpu-layers 999` | Offload all layers to GPU |
| `--tensor-split 4,13,12,12` | Split across 4 GPUs (adjust for your setup) |
| `--ctx-size 131072` | 128K context window |
| `--cache-type-k q4_0` | KV cache in 4-bit quantization (saves VRAM) |
| `--cache-type-v q4_0` | Value cache in 4-bit quantization |
| `-fa on` | Flash Attention enabled |
| `--mlock` | Lock model in RAM (prevents swap) |
| `--jinja` | Use Jinja2 chat templates |
| `--lora` | Apply LoRA adapter (applied first, before scaled) |
| `--lora-scaled` | Apply LoRA with scale (comma-separated for multiple) |
---
### πŸ›‘οΈ 3-Layer Stack with Refusal Removal LoRA
For the **purest uncensored stack** using weight-diff extracted LoRAs:
| Layer | Component | Purpose |
|-------|-----------|---------|
| 1 | Opus GGUF (base model) | Qwen3.6-35B + Opus reasoning |
| 2 | [refusal-removal-lora](https://huggingface.co/hotdogs/qwen3.6-35b-refusal-removal-lora) | πŸ›‘οΈ Remove refusals (uncensored) |
| 3 | opus-to-kimi-lora (scale 0.5) | 🎨 Kimi K2.6 verbose style |
```bash
docker run --gpus all -p 8080:8080 \
-v /path/to/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Q6_K.gguf \
--lora /models/qwen3.6-35b-refusal-removal-lora.gguf \
--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.5 \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 \
--ctx-size 131072 \
--batch-size 4096 \
-fa on
```
> πŸ”¬ **Technical note**: The refusal-removal LoRA was extracted via Weight-Diff SVD from `huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated` minus `lordx64/...Opus`. It modifies **only o_proj** in 10 layers (3,7,11,15,19,23,27,31,35,39) β€” an extremely sparse signal compared to full distillation (Kimi LoRA touches all 44 attention tensors).
---
**Old stack (uncensored GGUF base):**
**Single GPU alternative:**
```bash
sudo docker run --rm -p 8080:8080 \
-v /path/to/models/:/models \
--gpus all \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
--lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 \
--ctx-size 32768 \
--batch-size 2048 \
--cache-type-k q4_0 --cache-type-v q4_0 \
-fa on --mlock --jinja
```
**API Usage (OpenAI-compatible):**
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Explain quantum entanglement step by step"}
],
"temperature": 0.7,
"max_tokens": 4096
}'
```
> πŸ’‘ **Tip:** Adjust LoRA scales to fine-tune the reasoning style:
> - `0.6:0.8` β€” Balanced (Opus structure + Kimi verbosity)
> - `0.3:1.0` β€” Heavy Kimi style
> - `1.0:0.2` β€” Mostly Opus, slight Kimi touch
> - `0.0:1.0` β€” Pure Kimi style (skip Opus adapter entirely)
---
## πŸ“Š Comparison: Opus vs Kimi Reasoning
| Trait | Claude Opus | + Kimi LoRA |
|-------|-------------|-------------|
| Thinking tokens (mean) | 849 | **2,933** (3.5x longer) |
| Thinking tokens (p95) | 2,404 | **9,764** |
| Style | Concise, direct | Verbose, deliberate |
| Best for | Quick reasoning | Deep multi-step reasoning |
---
## πŸ› οΈ Technical Details
| Parameter | Value |
|-----------|-------|
| Method | Weight-diff SVD extraction |
| Rank | 16 |
| LoRA Alpha | 16 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Tensors extracted | 44 (attention weights across 11 layers) |
| Tensor shapes | q:[8192,2048] k/v:[512,2048] o:[2048,4096] |
| Adapter size | 7.2 MB (PEFT) / 14 MB (GGUF F32) |
| Precision | BF16 to F32 (GGUF) |
| Extraction time | ~3 min (CPU SVD) |
| Disk needed | ~145 GB (temporary, for both full models) |
| VRAM needed | ~3 GB (no GPU required) |
---
## πŸ§ͺ Reproduction
Full extraction script and methodology available in the UKA Hermes Agent session log.
```bash
# Quick reproduction
python3 extract_lora_diff.py \
--opus-path ./model_opus \
--kimi-path ./model_kimi \
--rank 16 \
--output ./opus-to-kimi-lora
```
---
## πŸ‘©β€πŸ’» Credits
- **UKA** (Hermes Agent) β€” designed the weight-diff SVD technique, wrote all extraction code, authored this README
- **lordx64** β€” trained the source models ([Opus](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled), [Kimi](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled))
- **Qwen Team** β€” base model [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
- **Bas95** β€” original reasoning distillation datasets
- **Hermes Agent** β€” [nousresearch/hermes-agent](https://github.com/nousresearch/hermes-agent)
---
## πŸ“„ License
Apache 2.0 β€” same as the source models.