Text Generation
PEFT
Safetensors
GGUF
English
Thai
lora
qwen3.5-moe
qwen3.6
reasoning
kimi-k2.6
claude-opus
distillation
weight-diff
svd
Instructions to use hotdogs/qwen3.6-35b-opus-to-kimi-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use hotdogs/qwen3.6-35b-opus-to-kimi-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled") model = PeftModel.from_pretrained(base_model, "hotdogs/qwen3.6-35b-opus-to-kimi-lora") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - lora | |
| - peft | |
| - qwen3.5-moe | |
| - qwen3.6 | |
| - reasoning | |
| - kimi-k2.6 | |
| - claude-opus | |
| - distillation | |
| - weight-diff | |
| - svd | |
| language: | |
| - en | |
| - th | |
| pipeline_tag: text-generation | |
| base_model: lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled | |
| # Opus β Kimi Reasoning LoRA | |
| > π§ **Extracted by [UKA](https://github.com/nousresearch/hermes-agent)** β an AI agent powered by Hermes Agent. | |
| > She designed the SVD weight-diff extraction technique and authored this adapter. | |
| A **rank-16 LoRA adapter** that converts [Claude 4.7 Opus reasoning style](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled) into **[Kimi K2.6](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled) reasoning style** β on the same 35B Mixture-of-Experts base model. | |
| No training. Pure linear algebra. | |
| --- | |
| ## π¬ How It Works: Weight-Diff SVD Extraction | |
| Both lordx64 models share the **exact same base** (`Qwen/Qwen3.6-35B-A3B`) and were fine-tuned with LoRA (merged back). | |
| Mathematically: | |
| ``` | |
| W_opus = W_base + delta_Opus | |
| W_kimi = W_base + delta_Kimi | |
| delta(Opus_to_Kimi) = W_kimi - W_opus | |
| = (W_base + delta_Kimi) - (W_base + delta_Opus) | |
| = delta_Kimi - delta_Opus | |
| ``` | |
| The base model cancels out β only the **reasoning delta** remains! | |
| ### SVD Compression | |
| The raw delta is 70+ GB. We compress it to rank-16 LoRA via truncated SVD: | |
| ```python | |
| for each attention weight tensor: | |
| delta = W_kimi - W_opus # [out, in] | |
| U, S, Vh = SVD(delta) # decompose | |
| lora_B = U[:, :16] * sqrt(S[:16]) # [out, 16] | |
| lora_A = sqrt(S[:16]) * Vh[:16, :] # [16, in] | |
| ``` | |
| - **Input**: 2x 72 GB models (~145 GB disk) | |
| - **VRAM used**: ~3 GB (tensor-by-tensor, no GPU needed) | |
| - **Compute**: ~44 SVDs on CPU (< 3 minutes) | |
| - **Output**: 7.2 MB LoRA adapter (rank=16, attention-only) | |
| ### Target Modules | |
| Only **full-attention** layers (every 4th layer in Qwen3.5-MoE): | |
| | Layer | q_proj | k_proj | v_proj | o_proj | | |
| |-------|--------|--------|--------|--------| | |
| | 3, 7, 11, 15, 19, 23, 27, 31 | β | β | β | β | | |
| | **35, 39** | del=0 | del=0 | del=0 | **del=0** | | |
| > β‘ **Interesting finding**: Layers 35 and 39 have **zero delta** β the Kimi fine-tune did not touch these layers at all! | |
| ### Why Attention-Only? | |
| The existing Claude Opus LoRA adapter (13.8 MB, r=16) is attention-only (q/k/v/o_proj). | |
| We match the same target modules for compatibility. | |
| The 3D expert tensors (256, 2048, 512) were intentionally skipped β both for compatibility with the existing adapter and because reasoning style is primarily encoded in attention patterns, not expert FFN weights. | |
| --- | |
| ## π METHOD β ΰΈ ΰΈ²ΰΈ©ΰΈ² / Languages / θ―θ¨ | |
| ΰΈΰΈΉΰΉΰΈ‘ΰΈ·ΰΈΰΉΰΈΰΈΰΈΰΈ΄ΰΈ Weight-Diff SVD Extraction ΰΈ‘ΰΈ΅ΰΉΰΈ«ΰΉΰΈΰΉΰΈ²ΰΈΰΉΰΈ 5 ΰΈ ΰΈ²ΰΈ©ΰΈ²: | |
| The universal extraction guide is available in 5 languages: | |
| | ΰΉΰΈΰΈ₯ΰΉ | ΰΈ ΰΈ²ΰΈ©ΰΈ² / Language | | |
| |------|-----------------| | |
| | [METHOD.md](./METHOD.md) | πΉπ ΰΉΰΈΰΈ’ | | |
| | [METHOD_EN.md](./METHOD_EN.md) | π¬π§ English | | |
| | [METHOD_ZH.md](./METHOD_ZH.md) | π¨π³ δΈζ | | |
| | [METHOD_JP.md](./METHOD_JP.md) | π―π΅ ζ₯ζ¬θͺ | | |
| | [METHOD_VN.md](./METHOD_VN.md) | π»π³ TiαΊΏng Viα»t | | |
| ΰΈΰΈΈΰΈΰΉΰΈΰΈ₯ΰΉΰΈ‘ΰΈ΅ΰΉΰΈΰΈ·ΰΉΰΈΰΈ«ΰΈ²ΰΈΰΈ£ΰΈΰΈΰΉΰΈ§ΰΈ: ΰΉΰΈΰΈ·ΰΉΰΈΰΈΰΉΰΈΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΈΰΈ²ΰΈ, 5 ΰΈΰΈ±ΰΉΰΈΰΈΰΈΰΈ, ΰΈΰΈΰΈ΄ΰΈΰΈ¨ΰΈ²ΰΈͺΰΈΰΈ£ΰΉ, ΰΈΰΈ±ΰΈ§ΰΈΰΈ’ΰΉΰΈ²ΰΈΰΉΰΈ‘ΰΉΰΈΰΈ₯ΰΈΰΈ·ΰΉΰΈ, troubleshooting, ΰΉΰΈ₯ΰΈ°ΰΈΰΈ²ΰΈ£ΰΈΰΉΰΈ²ΰΈΰΈΰΈ΄ΰΈ | |
| All files contain: requirements, 5 steps, math, examples for other models, troubleshooting, and references. | |
| --- | |
| ## π¦ Available Formats | |
| ### PEFT (Python) | |
| ```python | |
| from peft import PeftModel | |
| from transformers import AutoModelForCausalLM | |
| base = AutoModelForCausalLM.from_pretrained( | |
| "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| model = PeftModel.from_pretrained(base, "hotdogs/qwen3.6-35b-opus-to-kimi-lora") | |
| model = model.merge_and_unload() | |
| ``` | |
| ### GGUF (llama.cpp) | |
| ```bash | |
| ./llama-cli \ | |
| -m Qwen3.6-35B-A3B-Claude-Opus-Q6_K.gguf \ | |
| --lora qwen3.6-35b-opus-to-kimi-lora.gguf \ | |
| -p "Solve this math problem step by step..." | |
| ``` | |
| > β οΈ **Prerequisite:** The Docker command below uses the **Opus reasoning adapter** from lordx64. | |
| > Download it first: | |
| > ```bash | |
| > wget https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled/resolve/main/adapter_model.safetensors | |
| > # Or use the GGUF version for llama.cpp: | |
| > # Convert with: python3 llama.cpp/convert_lora_to_gguf.py /path/to/opus-adapter | |
| > ``` | |
| > Or use only the Kimi adapter without Opus: `--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:1.0` | |
| ### llama.cpp Server (Docker) β ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΈΰΈ²ΰΈΰΉΰΈΰΈ Multi-LoRA Stacking π₯ | |
| π **ΰΈͺΰΉΰΈΰΉΰΈ LoRA ΰΈ«ΰΈ₯ΰΈ²ΰΈ’ΰΈΰΈ±ΰΈ§ΰΈΰΈ£ΰΉΰΈΰΈ‘ΰΈΰΈ±ΰΈ** β ΰΈ£ΰΈ§ΰΈ‘ΰΉΰΈ‘ΰΉΰΈΰΈ₯ΰΈΰΈ·ΰΉΰΈΰΈΰΈ²ΰΈΰΉΰΈΰΈ uncensored + Opus reasoning LoRA + Kimi style LoRA ΰΉΰΈΰΉΰΈ²ΰΈΰΉΰΈ§ΰΈ’ΰΈΰΈ±ΰΈΰΉΰΈΰΉΰΈΰΈ΄ΰΈ£ΰΉΰΈΰΉΰΈ§ΰΈΰΈ£ΰΉΰΉΰΈΰΈ΅ΰΈ’ΰΈ§ΰΈΰΈ΅ΰΉΰΉΰΈΰΉΰΈ²ΰΈΰΈ±ΰΈΰΉΰΈΰΉΰΈΰΈ±ΰΈ OpenAI API: | |
| ### llama.cpp Server (Docker) β Multi-LoRA Stacking π₯ | |
| Combine the **uncensored base model** + **Opus reasoning LoRA** + **Kimi style LoRA** into one OpenAI-compatible API server: | |
| ```bash | |
| sudo docker run --rm -p 8080:8080 \ | |
| -v /path/to/models/:/models \ | |
| --gpus all \ | |
| --env CUDA_VISIBLE_DEVICES=0,1,2,3 \ | |
| ghcr.io/ggml-org/llama.cpp:server-cuda \ | |
| -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \ | |
| --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \ | |
| --host 0.0.0.0 --port 8080 \ | |
| --n-gpu-layers 999 \ | |
| --tensor-split 4,13,12,12 \ | |
| --ctx-size 131072 \ | |
| --batch-size 4096 \ | |
| --ubatch-size 512 \ | |
| --cache-type-k q4_0 \ | |
| --cache-type-v q4_0 \ | |
| -fa on \ | |
| --mlock \ | |
| --jinja | |
| ``` | |
| **What this does:** | |
| | Component | Purpose | Weight | | |
| |-----------|---------|--------| | |
| | `llmfan46_...-heretic-Q6_K.gguf` | Uncensored base (35B MoE) | ποΈ Base | | |
| | `lordx64_...-Opus-...-adapter-F16.gguf` | Claude Opus reasoning (concise) | 0.6 = 60% | | |
| | `qwen3.6-35b-opus-to-kimi-lora.gguf` | β Kimi K2.6 style (verbose) π₯ | 0.8 = 80% | | |
| **Result:** Uncensored base + Opus reasoning structure + Kimi verbose style β all in one model! | |
| **Key flags explained:** | |
| | Flag | Purpose | | |
| |------|---------| | |
| | `--lora-scaled A:Ξ±,B:Ξ²` | Stack multiple LoRA adapters with independent scales | | |
| | `--n-gpu-layers 999` | Offload all layers to GPU | | |
| | `--tensor-split 4,13,12,12` | Split across 4 GPUs (adjust for your setup) | | |
| | `--ctx-size 131072` | 128K context window | | |
| | `--cache-type-k q4_0` | KV cache in 4-bit quantization (saves VRAM) | | |
| | `--cache-type-v q4_0` | Value cache in 4-bit quantization | | |
| | `-fa on` | Flash Attention enabled | | |
| | `--mlock` | Lock model in RAM (prevents swap) | | |
| | `--jinja` | Use Jinja2 chat templates | | |
| | `--lora` | Apply LoRA adapter (applied first, before scaled) | | |
| | `--lora-scaled` | Apply LoRA with scale (comma-separated for multiple) | | |
| --- | |
| ### π‘οΈ 3-Layer Stack with Refusal Removal LoRA | |
| For the **purest uncensored stack** using weight-diff extracted LoRAs: | |
| | Layer | Component | Purpose | | |
| |-------|-----------|---------| | |
| | 1 | Opus GGUF (base model) | Qwen3.6-35B + Opus reasoning | | |
| | 2 | [refusal-removal-lora](https://huggingface.co/hotdogs/qwen3.6-35b-refusal-removal-lora) | π‘οΈ Remove refusals (uncensored) | | |
| | 3 | opus-to-kimi-lora (scale 0.5) | π¨ Kimi K2.6 verbose style | | |
| ```bash | |
| docker run --gpus all -p 8080:8080 \ | |
| -v /path/to/models:/models \ | |
| ghcr.io/ggml-org/llama.cpp:server-cuda \ | |
| -m /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Q6_K.gguf \ | |
| --lora /models/qwen3.6-35b-refusal-removal-lora.gguf \ | |
| --lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.5 \ | |
| --host 0.0.0.0 --port 8080 \ | |
| --n-gpu-layers 999 \ | |
| --ctx-size 131072 \ | |
| --batch-size 4096 \ | |
| -fa on | |
| ``` | |
| > π¬ **Technical note**: The refusal-removal LoRA was extracted via Weight-Diff SVD from `huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated` minus `lordx64/...Opus`. It modifies **only o_proj** in 10 layers (3,7,11,15,19,23,27,31,35,39) β an extremely sparse signal compared to full distillation (Kimi LoRA touches all 44 attention tensors). | |
| --- | |
| **Old stack (uncensored GGUF base):** | |
| **Single GPU alternative:** | |
| ```bash | |
| sudo docker run --rm -p 8080:8080 \ | |
| -v /path/to/models/:/models \ | |
| --gpus all \ | |
| ghcr.io/ggml-org/llama.cpp:server-cuda \ | |
| -m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \ | |
| --lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \ | |
| --host 0.0.0.0 --port 8080 \ | |
| --n-gpu-layers 999 \ | |
| --ctx-size 32768 \ | |
| --batch-size 2048 \ | |
| --cache-type-k q4_0 --cache-type-v q4_0 \ | |
| -fa on --mlock --jinja | |
| ``` | |
| **API Usage (OpenAI-compatible):** | |
| ```bash | |
| curl http://localhost:8080/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "gpt-3.5-turbo", | |
| "messages": [ | |
| {"role": "user", "content": "Explain quantum entanglement step by step"} | |
| ], | |
| "temperature": 0.7, | |
| "max_tokens": 4096 | |
| }' | |
| ``` | |
| > π‘ **Tip:** Adjust LoRA scales to fine-tune the reasoning style: | |
| > - `0.6:0.8` β Balanced (Opus structure + Kimi verbosity) | |
| > - `0.3:1.0` β Heavy Kimi style | |
| > - `1.0:0.2` β Mostly Opus, slight Kimi touch | |
| > - `0.0:1.0` β Pure Kimi style (skip Opus adapter entirely) | |
| --- | |
| ## π Comparison: Opus vs Kimi Reasoning | |
| | Trait | Claude Opus | + Kimi LoRA | | |
| |-------|-------------|-------------| | |
| | Thinking tokens (mean) | 849 | **2,933** (3.5x longer) | | |
| | Thinking tokens (p95) | 2,404 | **9,764** | | |
| | Style | Concise, direct | Verbose, deliberate | | |
| | Best for | Quick reasoning | Deep multi-step reasoning | | |
| --- | |
| ## π οΈ Technical Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Method | Weight-diff SVD extraction | | |
| | Rank | 16 | | |
| | LoRA Alpha | 16 | | |
| | Target modules | q_proj, k_proj, v_proj, o_proj | | |
| | Tensors extracted | 44 (attention weights across 11 layers) | | |
| | Tensor shapes | q:[8192,2048] k/v:[512,2048] o:[2048,4096] | | |
| | Adapter size | 7.2 MB (PEFT) / 14 MB (GGUF F32) | | |
| | Precision | BF16 to F32 (GGUF) | | |
| | Extraction time | ~3 min (CPU SVD) | | |
| | Disk needed | ~145 GB (temporary, for both full models) | | |
| | VRAM needed | ~3 GB (no GPU required) | | |
| --- | |
| ## π§ͺ Reproduction | |
| Full extraction script and methodology available in the UKA Hermes Agent session log. | |
| ```bash | |
| # Quick reproduction | |
| python3 extract_lora_diff.py \ | |
| --opus-path ./model_opus \ | |
| --kimi-path ./model_kimi \ | |
| --rank 16 \ | |
| --output ./opus-to-kimi-lora | |
| ``` | |
| --- | |
| ## π©βπ» Credits | |
| - **UKA** (Hermes Agent) β designed the weight-diff SVD technique, wrote all extraction code, authored this README | |
| - **lordx64** β trained the source models ([Opus](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled), [Kimi](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled)) | |
| - **Qwen Team** β base model [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) | |
| - **Bas95** β original reasoning distillation datasets | |
| - **Hermes Agent** β [nousresearch/hermes-agent](https://github.com/nousresearch/hermes-agent) | |
| --- | |
| ## π License | |
| Apache 2.0 β same as the source models. |