--- license: other license_name: modified-mit license_link: https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/LICENSE base_model: moonshotai/Kimi-K2-Instruct base_model_relation: quantized library_name: mlx pipeline_tag: text-generation tags: - mlx - vmlx - apple-silicon - quantized - jangtq - jangtq-3l - mxtq - 3-bit - moe - reap - kimi - kimi-k2 language: - en - zh --- # Kimi K2.6 — JANGTQ_3L (vMLX / Apple Silicon) 3-bit MXTQ quantized build of [Moonshot AI's Kimi K2.6](https://huggingface.co/moonshotai/Kimi-K2-Instruct) for the vMLX inference engine on Apple Silicon. Routed MoE experts are quantized to 3-bit MXTQ (rotation + Lloyd-Max codebook + per-row packed indices + fp16 norms); attention, routers, shared experts, embeddings, and `lm_head` remain at fp16. For the upstream Moonshot model card (architecture, license, intended use, evaluations), see [`README_UPSTREAM.md`](./README_UPSTREAM.md). ## Bundle facts | | | |---|---| | Format | JANGTQ_3L (vMLX) | | Routed experts | 3-bit MXTQ, per-row packed | | Other weights | fp16 | | Pruning | REAP-30 (routing-aware, ~30% of routed experts dropped) | | Total size | ~288 GB (304 safetensor shards) | | Source size | ~554 GB (FP8) | | Compression vs. source | ~52% | ## Hardware requirements - Apple Silicon Mac with **≥ 512 GB unified memory** (tested on M3 Ultra 512 GB). - Smaller memory configs will not load this bundle. ## Build provenance Produced by `build_kimi_jangtq3l.sh` (included in this directory). Pipeline: 1. `jangreap.py` — streaming layer-by-layer REAP saliency profile (~25 GB peak; replaces `profile.py`, which OOMs on the FP8 source). 2. `prune.py` — drops the lowest-saliency 30% of routed experts per layer. 3. `convert_kimi_jangtq --profile 3L` — quantizes the pruned model to MXTQ 3-bit (per-row pack handles `in_feat % vals_per_u32 != 0`, e.g. 7168 / 2048 with 3 bits). 4. Tokenizer + `chat_template` finalization. 5. `generation_config.json` patch — Kimi turn-boundary IDs: - `<|im_end|>` = 163586 - `<|im_user|>` = 163587 - `<|im_assistant|>` = 163588 - `eos_token_id = [163586, 163587, 163588]` ## Reproducing ```bash ./build_kimi_jangtq3l.sh apply-patches # idempotent jang_tools patches ./build_kimi_jangtq3l.sh all-pruned # download → prune → convert → finalize → patch ``` The script applies three required patches to the bundled `jang_tools` install: - `turboquant/linear.py` — per-row pack in `tq_quantize_weight` - `turboquant/linear.py` — per-row unpack in `TurboQuantSwitchLinear._dequant_experts` - `load_jangtq.py` — read `in_features` / `input_dims` from the existing module (avoids overshoot when per-row pad is used: 7168 → 7170) Marker comment `# JANG3L_PATCH_v1` makes patch application idempotent. `rollback-patches` restores `.jang3l.bak` backups. ## Serving ```bash ./build_kimi_jangtq3l.sh serve # or directly: python -m vmlx_engine.cli serve \ --model ~/.cache/huggingface/hub/deviad/Kimi-K2.6-JANGTQ_3L \ --port 8012 \ --max-tokens 4096 \ --default-temperature 0.5 \ --default-top-p 0.9 \ --default-repetition-penalty 1.1 \ --tool-call-parser moonshot \ --enable-auto-tool-choice ``` OpenAI-compatible endpoints at `http://127.0.0.1:8012/v1/...`. ## Known caveats - **Paged KV cache is incompatible with this build.** Kimi uses MLA attention with a `CacheList` layout that the paged-cache path does not handle, producing degenerate output (e.g. only `!` tokens). Do **not** pass `--use-paged-cache`, `--enable-block-disk-cache`, `--paged-cache-block-size`, or `--max-cache-blocks`. - 3-bit MXTQ + per-row pack is slower per token than 2-bit affine routes; quality is the tradeoff. - Tokenizer is tiktoken-based (no `tokenizer.json`); `trust_remote_code=True` is required. ## Files - `build_kimi_jangtq3l.sh` — self-contained build script (patches + pipeline + serve). - `model-*.safetensors` (304 shards) — quantized weights. - `config.json`, `generation_config.json`, `chat_template.jinja`, `kimi_k25_*.py`, `tiktoken.model` — model + tokenizer config. - `jang_config.json` — vMLX/jang_tools profile metadata. - `README_UPSTREAM.md` — original Moonshot model card. - `LICENSE` — modified MIT (inherited from upstream). ## License Modified MIT, inherited from the upstream Moonshot Kimi K2.6 release. See [`LICENSE`](./LICENSE) and [`README_UPSTREAM.md`](./README_UPSTREAM.md). ## Credits - Upstream model: Moonshot AI — Kimi K2.6. - Quantization toolchain: `jang_tools` / vMLX (JANGQ-AI). - This bundle: produced locally on M3 Ultra 512 GB by user `dvd.pugliese@gmail.com`.