---
base_model: unsloth/Qwen3.5-9B
tags:
- text-generation
- llama.cpp
- gguf
- unsloth
- qwen
- qwen3.5
- reasoning
- distillation
- sft
- lora
- rs-lora
- quantized
license: apache-2.0
language:
- en
datasets:
- trjxter/Kimi-K2.6-Reasoning-3300x-WandB
- Jackrong/Claude-opus-4.6-TraceInversion-9000x
- Jackrong/Qwen3.5-reasoning-700x
---
# Qwimi3.5-9B-Kimik2.6-Opus-Distill-GGUF
**Qwimi3.5-9B-Kimik2.6-Opus-Distill-GGUF** contains GGUF quantized releases of `Qwimi3.5-9B-Kimik2.6-Opus-Distill`, a reasoning-focused fine-tune of `unsloth/Qwen3.5-9B`.
This model was trained as a supervised fine-tuning/distillation run using a curated mixture of Kimi K2.6, Qwen reasoning, and Claude Opus TraceInversion-style reasoning data. The goal of the run was to improve structured reasoning behavior while preserving Qwen-style chat formatting and `...` reasoning traces.
- **Developed by:** `trjxter`
- **Base model:** `unsloth/Qwen3.5-9B`
- **Model type:** GGUF quantized causal language model
- **Training method:** LoRA / RS-LoRA SFT with Unsloth + TRL
- **License:** Apache 2.0
- **Language:** English
---
## Available Quantizations
This repository contains GGUF quantized versions of the merged fine-tuned model for use with llama.cpp-compatible runtimes.
Expected quantization set:
| Quant | Notes |
|---|---|
| `Q3_K_L` | Smaller size, lower memory usage, more quality loss |
| `Q4_K_M` | Good default balance of size, speed, and quality |
| `Q5_K_M` | Higher quality than Q4, moderate size increase |
| `Q6_K` | Strong quality retention, larger file size |
| `Q8_0` | Highest quality quant in this set, largest file size |
For most local inference setups, start with **Q4_K_M**. If you have more VRAM/RAM and want better quality, try **Q5_K_M**, **Q6_K**, or **Q8_0**.
---
## Training Overview
This GGUF release was created from a merged version of the LoRA fine-tune.
Training used Unsloth and Hugging Face TRL with a LoRA-based supervised fine-tuning setup.
### Training configuration
| Setting | Value |
|---|---|
| Base model | `unsloth/Qwen3.5-9B` |
| Sequence length | 16,384 |
| Training examples | 12,000 |
| Held-out eval examples | 366 |
| Trainer eval subset | 200 |
| Epochs | 1 |
| Effective batch size | 16 |
| Per-device batch size | 2 |
| Gradient accumulation steps | 8 |
| LoRA rank | 128 |
| LoRA alpha | 128 |
| RS-LoRA | Enabled |
| Base loading | 8-bit |
| Optimizer | `adamw_8bit` |
| Learning rate | `2e-5` |
| Scheduler | Linear |
| Gradient checkpointing | Unsloth |
| Runtime | ~4.37 hours on an 80GB GPU |
### Final training metrics
| Metric | Value |
|---|---:|
| Final training loss | `0.5517` |
| Final lightweight eval loss | `~0.3161` |
| Train runtime | `15,728.8s` |
| Train samples/sec | `0.763` |
| Train steps/sec | `0.048` |
| Total FLOPs | `1.45e18` |
The lightweight eval loss was measured on a 200-example eval subset during training.
---
## Dataset Mix and Attribution
This run used a combined reasoning/distillation dataset made from:
1. `trjxter/Kimi-K2.6-Reasoning-3300x-WandB`
2. `Jackrong/Qwen3.5-reasoning-700x`
3. `Jackrong/Claude-opus-4.6-TraceInversion-9000x`
The dataset was normalized into Qwen chat format, preserving assistant reasoning traces in the form:
```text
...
final answer
```
After formatting and 16k token filtering, the final usable dataset contained **12,366 examples**:
- **12,000** examples used for training
- **366** examples held out for evaluation
- **200** examples used as the lightweight trainer eval subset
### Special thanks
Special thanks to **Jackrong** and **Kyle Hessling** for the Qwen reasoning and Claude Opus TraceInversion datasets used in this run. Those datasets were not created by me, and this release builds on their dataset work.
---
## Intended Use
This model is intended for experimentation with:
- local reasoning model inference
- llama.cpp-compatible workflows
- structured reasoning prompts
- math and problem solving
- coding and technical reasoning
- long-context reasoning experiments
- comparing GGUF quantization quality across Q4, Q5, Q6, and Q8 variants
This is an experimental fine-tune and should be evaluated carefully before use in production or high-stakes settings.
---
## Prompt Format
The model follows Qwen-style chat formatting.
Example:
```text
<|im_start|>user
Solve this step by step: A shop earns $72 from hourly pay, $105 from restringing, $20 from grommets, and $5 from stencils. What is the total?
<|im_end|>
<|im_start|>assistant
...
...
<|im_end|>
```
When using a runtime that supports chat templates, prefer applying the Qwen chat template rather than manually formatting prompts.
---
## Example llama.cpp Usage
Example command:
```bash
./llama-cli \
-m Qwimi3.5-9B-Kimik2.6-Opus-Distill-Q4_K_M.gguf \
-c 16384 \
-ngl 99 \
--temp 0.6 \
--top-p 0.95 \
-p "<|im_start|>user\nSolve this step by step: If a worker earns $9/hour for 8 hours, plus $15 for each of 7 racquets, $10 for each of 2 grommet replacements, and $1 for each of 5 stencils, how much do they earn?\n<|im_end|>\n<|im_start|>assistant\n"
```
Adjust `-ngl` based on your GPU/VRAM. For CPU-only inference, omit or reduce `-ngl`.
---
## Related Releases
This run may also be released in adapter and merged BF16 formats:
- LoRA adapter: `trjxter/Qwimi3.5-9B-Kimik2.6-Opus-Distill-LoRA`
- Merged BF16: `trjxter/Qwimi3.5-9B-Kimik2.6-Opus-Distill-BF16`
---
## Notes
This model was trained using Unsloth for efficient fine-tuning and Hugging Face TRL for SFT training. The GGUF files were generated from the merged fine-tuned model.
[
](https://github.com/unslothai/unsloth)
---
## Disclaimer
This is an experimental research fine-tune. Outputs may contain mistakes, hallucinations, or incorrect reasoning. Always validate important outputs independently.