---
license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/LICENSE
base_model: moonshotai/Kimi-K2-Instruct
base_model_relation: quantized
library_name: mlx
pipeline_tag: text-generation
tags:
  - mlx
  - vmlx
  - apple-silicon
  - quantized
  - jangtq
  - jangtq-3l
  - mxtq
  - 3-bit
  - moe
  - reap
  - kimi
  - kimi-k2
language:
  - en
  - zh
---

# Kimi K2.6 — JANGTQ_3L (vMLX / Apple Silicon)

3-bit MXTQ quantized build of [Moonshot AI's Kimi K2.6](https://huggingface.co/moonshotai/Kimi-K2-Instruct) for the vMLX inference engine on Apple Silicon. Routed MoE experts are quantized to 3-bit MXTQ (rotation + Lloyd-Max codebook + per-row packed indices + fp16 norms); attention, routers, shared experts, embeddings, and `lm_head` remain at fp16.

For the upstream Moonshot model card (architecture, license, intended use, evaluations), see [`README_UPSTREAM.md`](./README_UPSTREAM.md).

## Bundle facts

| | |
|---|---|
| Format | JANGTQ_3L (vMLX) |
| Routed experts | 3-bit MXTQ, per-row packed |
| Other weights | fp16 |
| Pruning | REAP-30 (routing-aware, ~30% of routed experts dropped) |
| Total size | ~288 GB (304 safetensor shards) |
| Source size | ~554 GB (FP8) |
| Compression vs. source | ~52% |

## Hardware requirements

- Apple Silicon Mac with **≥ 512 GB unified memory** (tested on M3 Ultra 512 GB).
- Smaller memory configs will not load this bundle.

## Build provenance

Produced by `build_kimi_jangtq3l.sh` (included in this directory). Pipeline:

1. `jangreap.py` — streaming layer-by-layer REAP saliency profile (~25 GB peak; replaces `profile.py`, which OOMs on the FP8 source).
2. `prune.py` — drops the lowest-saliency 30% of routed experts per layer.
3. `convert_kimi_jangtq --profile 3L` — quantizes the pruned model to MXTQ 3-bit (per-row pack handles `in_feat % vals_per_u32 != 0`, e.g. 7168 / 2048 with 3 bits).
4. Tokenizer + `chat_template` finalization.
5. `generation_config.json` patch — Kimi turn-boundary IDs:
   - `<|im_end|>` = 163586
   - `<|im_user|>` = 163587
   - `<|im_assistant|>` = 163588
   - `eos_token_id = [163586, 163587, 163588]`

## Reproducing

```bash
./build_kimi_jangtq3l.sh apply-patches    # idempotent jang_tools patches
./build_kimi_jangtq3l.sh all-pruned       # download → prune → convert → finalize → patch
```

The script applies three required patches to the bundled `jang_tools` install:

- `turboquant/linear.py` — per-row pack in `tq_quantize_weight`
- `turboquant/linear.py` — per-row unpack in `TurboQuantSwitchLinear._dequant_experts`
- `load_jangtq.py` — read `in_features` / `input_dims` from the existing module (avoids overshoot when per-row pad is used: 7168 → 7170)

Marker comment `# JANG3L_PATCH_v1` makes patch application idempotent. `rollback-patches` restores `.jang3l.bak` backups.

## Serving

```bash
./build_kimi_jangtq3l.sh serve
# or directly:
python -m vmlx_engine.cli serve \
  --model ~/.cache/huggingface/hub/deviad/Kimi-K2.6-JANGTQ_3L \
  --port 8012 \
  --max-tokens 4096 \
  --default-temperature 0.5 \
  --default-top-p 0.9 \
  --default-repetition-penalty 1.1 \
  --tool-call-parser moonshot \
  --enable-auto-tool-choice
```

OpenAI-compatible endpoints at `http://127.0.0.1:8012/v1/...`.

## Known caveats

- **Paged KV cache is incompatible with this build.** Kimi uses MLA attention with a `CacheList` layout that the paged-cache path does not handle, producing degenerate output (e.g. only `!` tokens). Do **not** pass `--use-paged-cache`, `--enable-block-disk-cache`, `--paged-cache-block-size`, or `--max-cache-blocks`.
- 3-bit MXTQ + per-row pack is slower per token than 2-bit affine routes; quality is the tradeoff.
- Tokenizer is tiktoken-based (no `tokenizer.json`); `trust_remote_code=True` is required.

## Files

- `build_kimi_jangtq3l.sh` — self-contained build script (patches + pipeline + serve).
- `model-*.safetensors` (304 shards) — quantized weights.
- `config.json`, `generation_config.json`, `chat_template.jinja`, `kimi_k25_*.py`, `tiktoken.model` — model + tokenizer config.
- `jang_config.json` — vMLX/jang_tools profile metadata.
- `README_UPSTREAM.md` — original Moonshot model card.
- `LICENSE` — modified MIT (inherited from upstream).

## License

Modified MIT, inherited from the upstream Moonshot Kimi K2.6 release. See [`LICENSE`](./LICENSE) and [`README_UPSTREAM.md`](./README_UPSTREAM.md).

## Credits

- Upstream model: Moonshot AI — Kimi K2.6.
- Quantization toolchain: `jang_tools` / vMLX (JANGQ-AI).
- This bundle: produced locally on M3 Ultra 512 GB by user `dvd.pugliese@gmail.com`.