---
base_model: unsloth/Qwen3.5-9B
tags:
- text-generation
- llama.cpp
- gguf
- unsloth
- qwen
- qwen3.5
- reasoning
- distillation
- sft
- lora
- rs-lora
- quantized
license: apache-2.0
language:
- en
datasets:
- trjxter/Kimi-K2.6-Reasoning-3300x-WandB
- Jackrong/Claude-opus-4.6-TraceInversion-9000x
- Jackrong/Qwen3.5-reasoning-700x
---

# Qwimi3.5-9B-Kimik2.6-Opus-Distill-GGUF

<p align="center">
  <img src="./qwimi_3.5_9b_launch_overview.png" alt="Qwimi3.5-9B launch overview" width="100%">
</p>

**Qwimi3.5-9B-Kimik2.6-Opus-Distill-GGUF** contains GGUF quantized releases of `Qwimi3.5-9B-Kimik2.6-Opus-Distill`, a reasoning-focused fine-tune of `unsloth/Qwen3.5-9B`.

This model was trained as a supervised fine-tuning/distillation run using a curated mixture of Kimi K2.6, Qwen reasoning, and Claude Opus TraceInversion-style reasoning data. The goal of the run was to improve structured reasoning behavior while preserving Qwen-style chat formatting and `<think>...</think>` reasoning traces.

- **Developed by:** `trjxter`
- **Base model:** `unsloth/Qwen3.5-9B`
- **Model type:** GGUF quantized causal language model
- **Training method:** LoRA / RS-LoRA SFT with Unsloth + TRL
- **License:** Apache 2.0
- **Language:** English

---

## Available Quantizations

This repository contains GGUF quantized versions of the merged fine-tuned model for use with llama.cpp-compatible runtimes.

Expected quantization set:

| Quant | Notes |
|---|---|
| `Q3_K_L` | Smaller size, lower memory usage, more quality loss |
| `Q4_K_M` | Good default balance of size, speed, and quality |
| `Q5_K_M` | Higher quality than Q4, moderate size increase |
| `Q6_K` | Strong quality retention, larger file size |
| `Q8_0` | Highest quality quant in this set, largest file size |

For most local inference setups, start with **Q4_K_M**. If you have more VRAM/RAM and want better quality, try **Q5_K_M**, **Q6_K**, or **Q8_0**.

---

## Training Overview

This GGUF release was created from a merged version of the LoRA fine-tune.

Training used Unsloth and Hugging Face TRL with a LoRA-based supervised fine-tuning setup.

### Training configuration

| Setting | Value |
|---|---|
| Base model | `unsloth/Qwen3.5-9B` |
| Sequence length | 16,384 |
| Training examples | 12,000 |
| Held-out eval examples | 366 |
| Trainer eval subset | 200 |
| Epochs | 1 |
| Effective batch size | 16 |
| Per-device batch size | 2 |
| Gradient accumulation steps | 8 |
| LoRA rank | 128 |
| LoRA alpha | 128 |
| RS-LoRA | Enabled |
| Base loading | 8-bit |
| Optimizer | `adamw_8bit` |
| Learning rate | `2e-5` |
| Scheduler | Linear |
| Gradient checkpointing | Unsloth |
| Runtime | ~4.37 hours on an 80GB GPU |

### Final training metrics

| Metric | Value |
|---|---:|
| Final training loss | `0.5517` |
| Final lightweight eval loss | `~0.3161` |
| Train runtime | `15,728.8s` |
| Train samples/sec | `0.763` |
| Train steps/sec | `0.048` |
| Total FLOPs | `1.45e18` |

The lightweight eval loss was measured on a 200-example eval subset during training.

---

## Dataset Mix and Attribution

This run used a combined reasoning/distillation dataset made from:

1. `trjxter/Kimi-K2.6-Reasoning-3300x-WandB`
2. `Jackrong/Qwen3.5-reasoning-700x`
3. `Jackrong/Claude-opus-4.6-TraceInversion-9000x`

The dataset was normalized into Qwen chat format, preserving assistant reasoning traces in the form:

```text
<think>
...
</think>
final answer
```

After formatting and 16k token filtering, the final usable dataset contained **12,366 examples**:

- **12,000** examples used for training
- **366** examples held out for evaluation
- **200** examples used as the lightweight trainer eval subset

### Special thanks

Special thanks to **Jackrong** and **Kyle Hessling** for the Qwen reasoning and Claude Opus TraceInversion datasets used in this run. Those datasets were not created by me, and this release builds on their dataset work.

---

## Intended Use

This model is intended for experimentation with:

- local reasoning model inference
- llama.cpp-compatible workflows
- structured reasoning prompts
- math and problem solving
- coding and technical reasoning
- long-context reasoning experiments
- comparing GGUF quantization quality across Q4, Q5, Q6, and Q8 variants

This is an experimental fine-tune and should be evaluated carefully before use in production or high-stakes settings.

---

## Prompt Format

The model follows Qwen-style chat formatting.

Example:

```text
<|im_start|>user
Solve this step by step: A shop earns $72 from hourly pay, $105 from restringing, $20 from grommets, and $5 from stencils. What is the total?
<|im_end|>
<|im_start|>assistant
<think>
...
</think>
...
<|im_end|>
```

When using a runtime that supports chat templates, prefer applying the Qwen chat template rather than manually formatting prompts.

---

## Example llama.cpp Usage

Example command:

```bash
./llama-cli \
  -m Qwimi3.5-9B-Kimik2.6-Opus-Distill-Q4_K_M.gguf \
  -c 16384 \
  -ngl 99 \
  --temp 0.6 \
  --top-p 0.95 \
  -p "<|im_start|>user\nSolve this step by step: If a worker earns $9/hour for 8 hours, plus $15 for each of 7 racquets, $10 for each of 2 grommet replacements, and $1 for each of 5 stencils, how much do they earn?\n<|im_end|>\n<|im_start|>assistant\n"
```

Adjust `-ngl` based on your GPU/VRAM. For CPU-only inference, omit or reduce `-ngl`.

---

## Related Releases

This run may also be released in adapter and merged BF16 formats:

- LoRA adapter: `trjxter/Qwimi3.5-9B-Kimik2.6-Opus-Distill-LoRA`
- Merged BF16: `trjxter/Qwimi3.5-9B-Kimik2.6-Opus-Distill-BF16`


---

## Notes

This model was trained using Unsloth for efficient fine-tuning and Hugging Face TRL for SFT training. The GGUF files were generated from the merged fine-tuned model.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

---

## Disclaimer

This is an experimental research fine-tune. Outputs may contain mistakes, hallucinations, or incorrect reasoning. Always validate important outputs independently.