--- license: apache-2.0 tags: - lora - peft - qwen3.5-moe - qwen3.6 - reasoning - kimi-k2.6 - claude-opus - distillation - weight-diff - svd language: - en - th pipeline_tag: text-generation base_model: lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled --- # Opus → Kimi Reasoning LoRA > 🧠 **Extracted by [UKA](https://github.com/nousresearch/hermes-agent)** — an AI agent powered by Hermes Agent. > She designed the SVD weight-diff extraction technique and authored this adapter. A **rank-16 LoRA adapter** that converts [Claude 4.7 Opus reasoning style](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled) into **[Kimi K2.6](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled) reasoning style** — on the same 35B Mixture-of-Experts base model. No training. Pure linear algebra. --- ## 🔬 How It Works: Weight-Diff SVD Extraction Both lordx64 models share the **exact same base** (`Qwen/Qwen3.6-35B-A3B`) and were fine-tuned with LoRA (merged back). Mathematically: ``` W_opus = W_base + delta_Opus W_kimi = W_base + delta_Kimi delta(Opus_to_Kimi) = W_kimi - W_opus = (W_base + delta_Kimi) - (W_base + delta_Opus) = delta_Kimi - delta_Opus ``` The base model cancels out — only the **reasoning delta** remains! ### SVD Compression The raw delta is 70+ GB. We compress it to rank-16 LoRA via truncated SVD: ```python for each attention weight tensor: delta = W_kimi - W_opus # [out, in] U, S, Vh = SVD(delta) # decompose lora_B = U[:, :16] * sqrt(S[:16]) # [out, 16] lora_A = sqrt(S[:16]) * Vh[:16, :] # [16, in] ``` - **Input**: 2x 72 GB models (~145 GB disk) - **VRAM used**: ~3 GB (tensor-by-tensor, no GPU needed) - **Compute**: ~44 SVDs on CPU (< 3 minutes) - **Output**: 7.2 MB LoRA adapter (rank=16, attention-only) ### Target Modules Only **full-attention** layers (every 4th layer in Qwen3.5-MoE): | Layer | q_proj | k_proj | v_proj | o_proj | |-------|--------|--------|--------|--------| | 3, 7, 11, 15, 19, 23, 27, 31 | ✅ | ✅ | ✅ | ✅ | | **35, 39** | del=0 | del=0 | del=0 | **del=0** | > ⚡ **Interesting finding**: Layers 35 and 39 have **zero delta** — the Kimi fine-tune did not touch these layers at all! ### Why Attention-Only? The existing Claude Opus LoRA adapter (13.8 MB, r=16) is attention-only (q/k/v/o_proj). We match the same target modules for compatibility. The 3D expert tensors (256, 2048, 512) were intentionally skipped — both for compatibility with the existing adapter and because reasoning style is primarily encoded in attention patterns, not expert FFN weights. --- ## 📦 Available Formats ### PEFT (Python) ```python from peft import PeftModel from transformers import AutoModelForCausalLM base = AutoModelForCausalLM.from_pretrained( "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled", torch_dtype=torch.bfloat16, device_map="auto", ) model = PeftModel.from_pretrained(base, "hotdogs/qwen3.6-35b-opus-to-kimi-lora") model = model.merge_and_unload() ``` ### GGUF (llama.cpp) ```bash ./llama-cli \ -m Qwen3.6-35B-A3B-Claude-Opus-Q6_K.gguf \ --lora qwen3.6-35b-opus-to-kimi-lora.gguf \ -p "Solve this math problem step by step..." ``` > ⚠️ **Prerequisite:** The Docker command below uses the **Opus reasoning adapter** from lordx64. > Download it first: > ```bash > wget https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled/resolve/main/adapter_model.safetensors > # Or use the GGUF version for llama.cpp: > # Convert with: python3 llama.cpp/convert_lora_to_gguf.py /path/to/opus-adapter > ``` > Or use only the Kimi adapter without Opus: `--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:1.0` ### llama.cpp Server (Docker) — การใช้งานแบบ Multi-LoRA Stacking 🔥 🌐 **สแต็ก LoRA หลายตัวพร้อมกัน** — รวมโมเดลพื้นฐานแบบ uncensored + Opus reasoning LoRA + Kimi style LoRA เข้าด้วยกันในเซิร์ฟเวอร์เดียวที่เข้ากันได้กับ OpenAI API: ### llama.cpp Server (Docker) — Multi-LoRA Stacking 🔥 Combine the **uncensored base model** + **Opus reasoning LoRA** + **Kimi style LoRA** into one OpenAI-compatible API server: ```bash sudo docker run --rm -p 8080:8080 \ -v /path/to/models/:/models \ --gpus all \ --env CUDA_VISIBLE_DEVICES=0,1,2,3 \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \ --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \ --host 0.0.0.0 --port 8080 \ --n-gpu-layers 999 \ --tensor-split 4,13,12,12 \ --ctx-size 131072 \ --batch-size 4096 \ --ubatch-size 512 \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ -fa on \ --mlock \ --jinja ``` **What this does:** | Component | Purpose | Weight | |-----------|---------|--------| | `llmfan46_...-heretic-Q6_K.gguf` | Uncensored base (35B MoE) | 🏛️ Base | | `lordx64_...-Opus-...-adapter-F16.gguf` | Claude Opus reasoning (concise) | 0.6 = 60% | | `qwen3.6-35b-opus-to-kimi-lora.gguf` | → Kimi K2.6 style (verbose) 🔥 | 0.8 = 80% | **Result:** Uncensored base + Opus reasoning structure + Kimi verbose style — all in one model! **Key flags explained:** | Flag | Purpose | |------|---------| | `--lora-scaled A:α,B:β` | Stack multiple LoRA adapters with independent scales | | `--n-gpu-layers 999` | Offload all layers to GPU | | `--tensor-split 4,13,12,12` | Split across 4 GPUs (adjust for your setup) | | `--ctx-size 131072` | 128K context window | | `--cache-type-k q4_0` | KV cache in 4-bit quantization (saves VRAM) | | `--cache-type-v q4_0` | Value cache in 4-bit quantization | | `-fa on` | Flash Attention enabled | | `--mlock` | Lock model in RAM (prevents swap) | | `--jinja` | Use Jinja2 chat templates | **Single GPU alternative:** ```bash sudo docker run --rm -p 8080:8080 \ -v /path/to/models/:/models \ --gpus all \ ghcr.io/ggml-org/llama.cpp:server-cuda \ -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \ --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \ --host 0.0.0.0 --port 8080 \ --n-gpu-layers 999 \ --ctx-size 32768 \ --batch-size 2048 \ --cache-type-k q4_0 --cache-type-v q4_0 \ -fa on --mlock --jinja ``` **API Usage (OpenAI-compatible):** ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-3.5-turbo", "messages": [ {"role": "user", "content": "Explain quantum entanglement step by step"} ], "temperature": 0.7, "max_tokens": 4096 }' ``` > 💡 **Tip:** Adjust LoRA scales to fine-tune the reasoning style: > - `0.6:0.8` — Balanced (Opus structure + Kimi verbosity) > - `0.3:1.0` — Heavy Kimi style > - `1.0:0.2` — Mostly Opus, slight Kimi touch > - `0.0:1.0` — Pure Kimi style (skip Opus adapter entirely) --- ## 📊 Comparison: Opus vs Kimi Reasoning | Trait | Claude Opus | + Kimi LoRA | |-------|-------------|-------------| | Thinking tokens (mean) | 849 | **2,933** (3.5x longer) | | Thinking tokens (p95) | 2,404 | **9,764** | | Style | Concise, direct | Verbose, deliberate | | Best for | Quick reasoning | Deep multi-step reasoning | --- ## 🛠️ Technical Details | Parameter | Value | |-----------|-------| | Method | Weight-diff SVD extraction | | Rank | 16 | | LoRA Alpha | 16 | | Target modules | q_proj, k_proj, v_proj, o_proj | | Tensors extracted | 44 (attention weights across 11 layers) | | Tensor shapes | q:[8192,2048] k/v:[512,2048] o:[2048,4096] | | Adapter size | 7.2 MB (PEFT) / 14 MB (GGUF F32) | | Precision | BF16 to F32 (GGUF) | | Extraction time | ~3 min (CPU SVD) | | Disk needed | ~145 GB (temporary, for both full models) | | VRAM needed | ~3 GB (no GPU required) | --- ## 🧪 Reproduction Full extraction script and methodology available in the UKA Hermes Agent session log. ```bash # Quick reproduction python3 extract_lora_diff.py \ --opus-path ./model_opus \ --kimi-path ./model_kimi \ --rank 16 \ --output ./opus-to-kimi-lora ``` --- ## 👩‍💻 Credits - **UKA** (Hermes Agent) — designed the weight-diff SVD technique, wrote all extraction code, authored this README - **lordx64** — trained the source models ([Opus](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled), [Kimi](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled)) - **Qwen Team** — base model [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) - **Bas95** — original reasoning distillation datasets - **Hermes Agent** — [nousresearch/hermes-agent](https://github.com/nousresearch/hermes-agent) --- ## 📄 License Apache 2.0 — same as the source models.