--- license: mit base_model: - zai-org/GLM-5.2 base_model_relation: quantized language: - ja - en - zh tags: - quantization - vector-quantization - aqlm - mixture-of-experts - glm - vllm - long-context pipeline_tag: text-generation --- # GLM-5.2 — mixed-bit VQ (AQLM) ~1.71-bit, Japanese-tuned, thinking-stable, longer-context A **~167 GiB** quantization of [GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) (744B Chinese-native reasoning MoE, MIT), tuned for **Japanese / English / Chinese** general quality, a **larger usable context**, and **well-behaved thinking-mode** (reasoning ends and answers instead of running away), using **vector quantization with GPTQ error compensation (AQLM-style)**. Runs on **2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each)** via vLLM. ## What's new vs the first VQ release (`GLM-5.2-VQ-Arith`, ~180 GiB) Smaller **(~167 vs ~180 GiB)**, quality holds, **longer context**, and the **reasoning terminates cleanly** instead of over-thinking: | | VQ-Arith (~180 GiB) | **This release (~167 GiB)** | |---|---|---| | Avg bits/weight (experts) | ~1.86 | **~1.71** | | Calibration focus | arithmetic-heavy | **de-math, Japanese-strong multilingual + reasoning traces** | | `down_proj` compensation | naive VQ | **GPTQ-VQ compensated (like gate/up)** | | Context (dense MLA, sm_120) | ≤ 4096 | **≤ 16384 (validated needle retrieval to ~8.5K)** | | Thinking-mode termination | could over-think | **calibrated to close reasoning & answer** | Four changes, together: (1) a **de-math, Japanese-strong multilingual** bit allocation + calibration — the first release over-protected arithmetic, which turned out unnecessary once VQ+compensation is in place (VQ is robust enough to keep arithmetic without spending extra bits on it); (2) **Japanese-strong Hessians** for the GPTQ-VQ compensation (the first release's Hessians were ~0.1 % Japanese); (3) **`down_proj` is now GPTQ-VQ compensated** too; (4) **reasoning trajectories in the calibration**, so the model reliably ends its thinking and answers — the first release could keep thinking for too long on open-ended prompts. ## Quality (JA / EN / ZH, temp 0.6) No language collapse; arithmetic preserved despite the de-emphasis: - JA / EN / ZH **127×8 = 1016**, JA **269×6 = 1614**, word problems — all correct. - Hard factual (JA "second-tallest mountain in Japan" → 北岳), general JA explanations — coherent. - **Thinking mode terminates cleanly** (no runaway over-thinking) — use **temp 0.6**, not greedy. - **Deep needle retrieval** works at long context (validated to ~8.5K tokens; max context 16384). ## Tip — keeping the *reasoning* in Japanese GLM-5.2 is Chinese-native and **reasons in English/Chinese by default** (the final answer is already correct Japanese). To make the **thinking itself Japanese**, add a system prompt — for example: ``` You are a helpful assistant for a Japanese user. Write your ENTIRE response in Japanese. Your internal reasoning (the thinking process) MUST also be written in Japanese — do not reason in English or Chinese. ``` This reliably switches the reasoning to Japanese. The instruction can be written in English or Japanese; both work. ## Serving Not plug-and-play GGUF — needs the matching sm_120 stack: - **vLLM** with `GlmMoeDsaForCausalLM` + sm_120 kernels. - **transformers 5.12**. - The **VQ serving plugin** from **[mmzz164/OneCompression @ `glm-serving-v1`](https://github.com/mmzz164/OneCompression)** — see [`example/glm-5.2/`](https://github.com/mmzz164/OneCompression/tree/glm-serving-v1/example/glm-5.2). - **2× ~95 GiB sm_120 GPUs**, EP=1 / TP=2 (VQ codes can't be tensor-parallel-sharded). ```bash GLM_CKPT=/path/to/this/model GLM_MAXLEN=16384 bash start_glm_api_vq.sh # OpenAI API :8001, served as "glm-5.2" ``` `MixedVQMoEMethod` is auto-selected from the `format:"vq"` markers in `quantization_config`. ## Performance & limits - **~16 tok/s** steady-state decode (single stream) — the ceiling is the MoE all-reduce over PCIe (no NVLink), not the VQ kernel; same as the first release. - **Context ≤ 16384** on this hardware (dense MLA — sm_120 has no sparse-DSA forward kernel). Dense long-context prefill is O(n²), so very long prompts are slow; the comfortable interactive zone is ~12–16K. True 100K+ needs a sparse-attention kernel (not available on sm_120 yet). - Default template is **thinking-on**; use **temp 0.6** (not greedy) to avoid reasoning loops. ## License & attribution - This quantized model: **MIT**. - Base **GLM-5.2**: **MIT**, © Zhipu AI — this is a derivative; all rights/attribution to upstream. - Quantization/serving built on **OneCompression** (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).