Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8
GLM-5.2 — mixed-bit VQ (AQLM) ~1.71-bit, Japanese-tuned, thinking-stable, longer-context
A ~167 GiB quantization of GLM-5.2 (744B Chinese-native reasoning MoE, MIT), tuned for Japanese / English / Chinese general quality, a larger usable context, and well-behaved thinking-mode (reasoning ends and answers instead of running away), using vector quantization with GPTQ error compensation (AQLM-style).
Runs on 2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each) via vLLM.
What's new vs the first VQ release (GLM-5.2-VQ-Arith, ~180 GiB)
Smaller (~167 vs ~180 GiB), quality holds, longer context, and the reasoning terminates cleanly instead of over-thinking:
| VQ-Arith (~180 GiB) | This release (~167 GiB) | |
|---|---|---|
| Avg bits/weight (experts) | ~1.86 | ~1.71 |
| Calibration focus | arithmetic-heavy | de-math, Japanese-strong multilingual + reasoning traces |
down_proj compensation |
naive VQ | GPTQ-VQ compensated (like gate/up) |
| Context (dense MLA, sm_120) | ≤ 4096 | ≤ 16384 (validated needle retrieval to ~8.5K) |
| Thinking-mode termination | could over-think | calibrated to close reasoning & answer |
Four changes, together: (1) a de-math, Japanese-strong multilingual bit allocation + calibration
— the first release over-protected arithmetic, which turned out unnecessary once VQ+compensation is in
place (VQ is robust enough to keep arithmetic without spending extra bits on it); (2)
Japanese-strong Hessians for the GPTQ-VQ compensation (the first release's Hessians were ~0.1 %
Japanese); (3) down_proj is now GPTQ-VQ compensated too; (4) reasoning trajectories in the
calibration, so the model reliably ends its thinking and answers — the first release could keep
thinking for too long on open-ended prompts.
Quality (JA / EN / ZH, temp 0.6)
No language collapse; arithmetic preserved despite the de-emphasis:
- JA / EN / ZH 127×8 = 1016, JA 269×6 = 1614, word problems — all correct.
- Hard factual (JA "second-tallest mountain in Japan" → 北岳), general JA explanations — coherent.
- Thinking mode terminates cleanly (no runaway over-thinking) — use temp 0.6, not greedy.
- Deep needle retrieval works at long context (validated to ~8.5K tokens; max context 16384).
Tip — keeping the reasoning in Japanese
GLM-5.2 is Chinese-native and reasons in English/Chinese by default (the final answer is already correct Japanese). To make the thinking itself Japanese, add a system prompt — for example:
You are a helpful assistant for a Japanese user. Write your ENTIRE response in
Japanese. Your internal reasoning (the thinking process) MUST also be written in
Japanese — do not reason in English or Chinese.
This reliably switches the reasoning to Japanese. The instruction can be written in English or Japanese; both work.
Serving
Not plug-and-play GGUF — needs the matching sm_120 stack:
- vLLM with
GlmMoeDsaForCausalLM+ sm_120 kernels. - transformers 5.12.
- The VQ serving plugin from mmzz164/OneCompression @
glm-serving-v1— seeexample/glm-5.2/. - 2× ~95 GiB sm_120 GPUs, EP=1 / TP=2 (VQ codes can't be tensor-parallel-sharded).
GLM_CKPT=/path/to/this/model GLM_MAXLEN=16384 bash start_glm_api_vq.sh # OpenAI API :8001, served as "glm-5.2"
MixedVQMoEMethod is auto-selected from the format:"vq" markers in quantization_config.
Performance & limits
- ~16 tok/s steady-state decode (single stream) — the ceiling is the MoE all-reduce over PCIe (no NVLink), not the VQ kernel; same as the first release.
- Context ≤ 16384 on this hardware (dense MLA — sm_120 has no sparse-DSA forward kernel). Dense long-context prefill is O(n²), so very long prompts are slow; the comfortable interactive zone is ~12–16K. True 100K+ needs a sparse-attention kernel (not available on sm_120 yet).
- Default template is thinking-on; use temp 0.6 (not greedy) to avoid reasoning loops.
License & attribution
- This quantized model: MIT.
- Base GLM-5.2: MIT, © Zhipu AI — this is a derivative; all rights/attribution to upstream.
- Quantization/serving built on OneCompression (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).
- Downloads last month
- 454
Model tree for aquaman164/GLM-5.2-VQ-1.7bit-JA
Base model
zai-org/GLM-5.2