Instructions to use hotdogs/qwen3.6-35b-opus-to-kimi-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use hotdogs/qwen3.6-35b-opus-to-kimi-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled") model = PeftModel.from_pretrained(base_model, "hotdogs/qwen3.6-35b-opus-to-kimi-lora") - Notebooks
- Google Colab
- Kaggle
Opus β Kimi Reasoning LoRA
π§ Extracted by UKA β an AI agent powered by Hermes Agent. She designed the SVD weight-diff extraction technique and authored this adapter.
A rank-16 LoRA adapter that converts Claude 4.7 Opus reasoning style into Kimi K2.6 reasoning style β on the same 35B Mixture-of-Experts base model.
No training. Pure linear algebra.
π¬ How It Works: Weight-Diff SVD Extraction
Both lordx64 models share the exact same base (Qwen/Qwen3.6-35B-A3B) and were fine-tuned with LoRA (merged back).
Mathematically:
W_opus = W_base + delta_Opus
W_kimi = W_base + delta_Kimi
delta(Opus_to_Kimi) = W_kimi - W_opus
= (W_base + delta_Kimi) - (W_base + delta_Opus)
= delta_Kimi - delta_Opus
The base model cancels out β only the reasoning delta remains!
SVD Compression
The raw delta is 70+ GB. We compress it to rank-16 LoRA via truncated SVD:
for each attention weight tensor:
delta = W_kimi - W_opus # [out, in]
U, S, Vh = SVD(delta) # decompose
lora_B = U[:, :16] * sqrt(S[:16]) # [out, 16]
lora_A = sqrt(S[:16]) * Vh[:16, :] # [16, in]
- Input: 2x 72 GB models (~145 GB disk)
- VRAM used: ~3 GB (tensor-by-tensor, no GPU needed)
- Compute: ~44 SVDs on CPU (< 3 minutes)
- Output: 7.2 MB LoRA adapter (rank=16, attention-only)
Target Modules
Only full-attention layers (every 4th layer in Qwen3.5-MoE):
| Layer | q_proj | k_proj | v_proj | o_proj |
|---|---|---|---|---|
| 3, 7, 11, 15, 19, 23, 27, 31 | β | β | β | β |
| 35, 39 | del=0 | del=0 | del=0 | del=0 |
β‘ Interesting finding: Layers 35 and 39 have zero delta β the Kimi fine-tune did not touch these layers at all!
Why Attention-Only?
The existing Claude Opus LoRA adapter (13.8 MB, r=16) is attention-only (q/k/v/o_proj). We match the same target modules for compatibility.
The 3D expert tensors (256, 2048, 512) were intentionally skipped β both for compatibility with the existing adapter and because reasoning style is primarily encoded in attention patterns, not expert FFN weights.
π METHOD β ΰΈ ΰΈ²ΰΈ©ΰΈ² / Languages / θ―θ¨
ΰΈΰΈΉΰΉΰΈ‘ΰΈ·ΰΈΰΉΰΈΰΈΰΈΰΈ΄ΰΈ Weight-Diff SVD Extraction ΰΈ‘ΰΈ΅ΰΉΰΈ«ΰΉΰΈΰΉΰΈ²ΰΈΰΉΰΈ 5 ΰΈ ΰΈ²ΰΈ©ΰΈ²: The universal extraction guide is available in 5 languages:
| ΰΉΰΈΰΈ₯ΰΉ | ΰΈ ΰΈ²ΰΈ©ΰΈ² / Language |
|---|---|
| METHOD.md | πΉπ ΰΉΰΈΰΈ’ |
| METHOD_EN.md | π¬π§ English |
| METHOD_ZH.md | π¨π³ δΈζ |
| METHOD_JP.md | π―π΅ ζ₯ζ¬θͺ |
| METHOD_VN.md | π»π³ TiαΊΏng Viα»t |
ΰΈΰΈΈΰΈΰΉΰΈΰΈ₯ΰΉΰΈ‘ΰΈ΅ΰΉΰΈΰΈ·ΰΉΰΈΰΈ«ΰΈ²ΰΈΰΈ£ΰΈΰΈΰΉΰΈ§ΰΈ: ΰΉΰΈΰΈ·ΰΉΰΈΰΈΰΉΰΈΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΈΰΈ²ΰΈ, 5 ΰΈΰΈ±ΰΉΰΈΰΈΰΈΰΈ, ΰΈΰΈΰΈ΄ΰΈΰΈ¨ΰΈ²ΰΈͺΰΈΰΈ£ΰΉ, ΰΈΰΈ±ΰΈ§ΰΈΰΈ’ΰΉΰΈ²ΰΈΰΉΰΈ‘ΰΉΰΈΰΈ₯ΰΈΰΈ·ΰΉΰΈ, troubleshooting, ΰΉΰΈ₯ΰΈ°ΰΈΰΈ²ΰΈ£ΰΈΰΉΰΈ²ΰΈΰΈΰΈ΄ΰΈ All files contain: requirements, 5 steps, math, examples for other models, troubleshooting, and references.
π¦ Available Formats
PEFT (Python)
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained(
"lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "hotdogs/qwen3.6-35b-opus-to-kimi-lora")
model = model.merge_and_unload()
GGUF (llama.cpp)
./llama-cli \
-m Qwen3.6-35B-A3B-Claude-Opus-Q6_K.gguf \
--lora qwen3.6-35b-opus-to-kimi-lora.gguf \
-p "Solve this math problem step by step..."
β οΈ Prerequisite: The Docker command below uses the Opus reasoning adapter from lordx64. Download it first:
wget https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled/resolve/main/adapter_model.safetensors # Or use the GGUF version for llama.cpp: # Convert with: python3 llama.cpp/convert_lora_to_gguf.py /path/to/opus-adapterOr use only the Kimi adapter without Opus:
--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:1.0
llama.cpp Server (Docker) β ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΈΰΈ²ΰΈΰΉΰΈΰΈ Multi-LoRA Stacking π₯
π ΰΈͺΰΉΰΈΰΉΰΈ LoRA ΰΈ«ΰΈ₯ΰΈ²ΰΈ’ΰΈΰΈ±ΰΈ§ΰΈΰΈ£ΰΉΰΈΰΈ‘ΰΈΰΈ±ΰΈ β ΰΈ£ΰΈ§ΰΈ‘ΰΉΰΈ‘ΰΉΰΈΰΈ₯ΰΈΰΈ·ΰΉΰΈΰΈΰΈ²ΰΈΰΉΰΈΰΈ uncensored + Opus reasoning LoRA + Kimi style LoRA ΰΉΰΈΰΉΰΈ²ΰΈΰΉΰΈ§ΰΈ’ΰΈΰΈ±ΰΈΰΉΰΈΰΉΰΈΰΈ΄ΰΈ£ΰΉΰΈΰΉΰΈ§ΰΈΰΈ£ΰΉΰΉΰΈΰΈ΅ΰΈ’ΰΈ§ΰΈΰΈ΅ΰΉΰΉΰΈΰΉΰΈ²ΰΈΰΈ±ΰΈΰΉΰΈΰΉΰΈΰΈ±ΰΈ OpenAI API:
llama.cpp Server (Docker) β Multi-LoRA Stacking π₯
Combine the uncensored base model + Opus reasoning LoRA + Kimi style LoRA into one OpenAI-compatible API server:
sudo docker run --rm -p 8080:8080 \
-v /path/to/models/:/models \
--gpus all \
--env CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
--lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 \
--tensor-split 4,13,12,12 \
--ctx-size 131072 \
--batch-size 4096 \
--ubatch-size 512 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-fa on \
--mlock \
--jinja
What this does:
| Component | Purpose | Weight |
|---|---|---|
llmfan46_...-heretic-Q6_K.gguf |
Uncensored base (35B MoE) | ποΈ Base |
lordx64_...-Opus-...-adapter-F16.gguf |
Claude Opus reasoning (concise) | 0.6 = 60% |
qwen3.6-35b-opus-to-kimi-lora.gguf |
β Kimi K2.6 style (verbose) π₯ | 0.8 = 80% |
Result: Uncensored base + Opus reasoning structure + Kimi verbose style β all in one model!
Key flags explained:
| Flag | Purpose |
|---|---|
--lora-scaled A:Ξ±,B:Ξ² |
Stack multiple LoRA adapters with independent scales |
--n-gpu-layers 999 |
Offload all layers to GPU |
--tensor-split 4,13,12,12 |
Split across 4 GPUs (adjust for your setup) |
--ctx-size 131072 |
128K context window |
--cache-type-k q4_0 |
KV cache in 4-bit quantization (saves VRAM) |
--cache-type-v q4_0 |
Value cache in 4-bit quantization |
-fa on |
Flash Attention enabled |
--mlock |
Lock model in RAM (prevents swap) |
--jinja |
Use Jinja2 chat templates |
--lora |
Apply LoRA adapter (applied first, before scaled) |
--lora-scaled |
Apply LoRA with scale (comma-separated for multiple) |
π‘οΈ 3-Layer Stack with Refusal Removal LoRA
For the purest uncensored stack using weight-diff extracted LoRAs:
| Layer | Component | Purpose |
|---|---|---|
| 1 | Opus GGUF (base model) | Qwen3.6-35B + Opus reasoning |
| 2 | refusal-removal-lora | π‘οΈ Remove refusals (uncensored) |
| 3 | opus-to-kimi-lora (scale 0.5) | π¨ Kimi K2.6 verbose style |
docker run --gpus all -p 8080:8080 \
-v /path/to/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Q6_K.gguf \
--lora /models/qwen3.6-35b-refusal-removal-lora.gguf \
--lora-scaled /models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.5 \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 \
--ctx-size 131072 \
--batch-size 4096 \
-fa on
π¬ Technical note: The refusal-removal LoRA was extracted via Weight-Diff SVD from
huihui-ai/Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliteratedminuslordx64/...Opus. It modifies only o_proj in 10 layers (3,7,11,15,19,23,27,31,35,39) β an extremely sparse signal compared to full distillation (Kimi LoRA touches all 44 attention tensors).
Old stack (uncensored GGUF base):
Single GPU alternative:
sudo docker run --rm -p 8080:8080 \
-v /path/to/models/:/models \
--gpus all \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/llmfan46_Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf \
--lora-scaled /models/lordx64_Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-adapter-F16.gguf:0.6,/models/qwen3.6-35b-opus-to-kimi-lora.gguf:0.8 \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 999 \
--ctx-size 32768 \
--batch-size 2048 \
--cache-type-k q4_0 --cache-type-v q4_0 \
-fa on --mlock --jinja
API Usage (OpenAI-compatible):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Explain quantum entanglement step by step"}
],
"temperature": 0.7,
"max_tokens": 4096
}'
π‘ Tip: Adjust LoRA scales to fine-tune the reasoning style:
0.6:0.8β Balanced (Opus structure + Kimi verbosity)0.3:1.0β Heavy Kimi style1.0:0.2β Mostly Opus, slight Kimi touch0.0:1.0β Pure Kimi style (skip Opus adapter entirely)
π Comparison: Opus vs Kimi Reasoning
| Trait | Claude Opus | + Kimi LoRA |
|---|---|---|
| Thinking tokens (mean) | 849 | 2,933 (3.5x longer) |
| Thinking tokens (p95) | 2,404 | 9,764 |
| Style | Concise, direct | Verbose, deliberate |
| Best for | Quick reasoning | Deep multi-step reasoning |
π οΈ Technical Details
| Parameter | Value |
|---|---|
| Method | Weight-diff SVD extraction |
| Rank | 16 |
| LoRA Alpha | 16 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Tensors extracted | 44 (attention weights across 11 layers) |
| Tensor shapes | q:[8192,2048] k/v:[512,2048] o:[2048,4096] |
| Adapter size | 7.2 MB (PEFT) / 14 MB (GGUF F32) |
| Precision | BF16 to F32 (GGUF) |
| Extraction time | ~3 min (CPU SVD) |
| Disk needed | ~145 GB (temporary, for both full models) |
| VRAM needed | ~3 GB (no GPU required) |
π§ͺ Reproduction
Full extraction script and methodology available in the UKA Hermes Agent session log.
# Quick reproduction
python3 extract_lora_diff.py \
--opus-path ./model_opus \
--kimi-path ./model_kimi \
--rank 16 \
--output ./opus-to-kimi-lora
π©βπ» Credits
- UKA (Hermes Agent) β designed the weight-diff SVD technique, wrote all extraction code, authored this README
- lordx64 β trained the source models (Opus, Kimi)
- Qwen Team β base model Qwen3.6-35B-A3B
- Bas95 β original reasoning distillation datasets
- Hermes Agent β nousresearch/hermes-agent
π License
Apache 2.0 β same as the source models.
- Downloads last month
- 296
We're not able to determine the quantization variants.
Model tree for hotdogs/qwen3.6-35b-opus-to-kimi-lora
Base model
Qwen/Qwen3.6-35B-A3B