---
license: mit
base_model:
  - zai-org/GLM-5.2
base_model_relation: quantized
language:
  - ja
  - en
  - zh
tags:
  - quantization
  - vector-quantization
  - aqlm
  - mixture-of-experts
  - glm
  - vllm
  - long-context
pipeline_tag: text-generation
---

# GLM-5.2 — mixed-bit VQ (AQLM) ~1.71-bit, Japanese-tuned, thinking-stable, longer-context

A **~167 GiB** quantization of [GLM-5.2](https://huggingface.co/zai-org/GLM-5.2) (744B
Chinese-native reasoning MoE, MIT), tuned for **Japanese / English / Chinese** general quality, a
**larger usable context**, and **well-behaved thinking-mode** (reasoning ends and answers instead of
running away), using **vector quantization with GPTQ error compensation (AQLM-style)**.

Runs on **2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each)** via vLLM.

## What's new vs the first VQ release (`GLM-5.2-VQ-Arith`, ~180 GiB)

Smaller **(~167 vs ~180 GiB)**, quality holds, **longer context**, and the **reasoning terminates
cleanly** instead of over-thinking:

| | VQ-Arith (~180 GiB) | **This release (~167 GiB)** |
|---|---|---|
| Avg bits/weight (experts) | ~1.86 | **~1.71** |
| Calibration focus | arithmetic-heavy | **de-math, Japanese-strong multilingual + reasoning traces** |
| `down_proj` compensation | naive VQ | **GPTQ-VQ compensated (like gate/up)** |
| Context (dense MLA, sm_120) | ≤ 4096 | **≤ 16384 (validated needle retrieval to ~8.5K)** |
| Thinking-mode termination | could over-think | **calibrated to close reasoning & answer** |

Four changes, together: (1) a **de-math, Japanese-strong multilingual** bit allocation + calibration
— the first release over-protected arithmetic, which turned out unnecessary once VQ+compensation is in
place (VQ is robust enough to keep arithmetic without spending extra bits on it); (2)
**Japanese-strong Hessians** for the GPTQ-VQ compensation (the first release's Hessians were ~0.1 %
Japanese); (3) **`down_proj` is now GPTQ-VQ compensated** too; (4) **reasoning trajectories in the
calibration**, so the model reliably ends its thinking and answers — the first release could keep
thinking for too long on open-ended prompts.

## Quality (JA / EN / ZH, temp 0.6)

No language collapse; arithmetic preserved despite the de-emphasis:
- JA / EN / ZH **127×8 = 1016**, JA **269×6 = 1614**, word problems — all correct.
- Hard factual (JA "second-tallest mountain in Japan" → 北岳), general JA explanations — coherent.
- **Thinking mode terminates cleanly** (no runaway over-thinking) — use **temp 0.6**, not greedy.
- **Deep needle retrieval** works at long context (validated to ~8.5K tokens; max context 16384).

## Tip — keeping the *reasoning* in Japanese

GLM-5.2 is Chinese-native and **reasons in English/Chinese by default** (the final answer is already
correct Japanese). To make the **thinking itself Japanese**, add a system prompt — for example:

```
You are a helpful assistant for a Japanese user. Write your ENTIRE response in
Japanese. Your internal reasoning (the thinking process) MUST also be written in
Japanese — do not reason in English or Chinese.
```

This reliably switches the reasoning to Japanese. The instruction can be written in English or
Japanese; both work.

## Serving

Not plug-and-play GGUF — needs the matching sm_120 stack:

- **vLLM** with `GlmMoeDsaForCausalLM` + sm_120 kernels.
- **transformers 5.12**.
- The **VQ serving plugin** from **[mmzz164/OneCompression @ `glm-serving-v1`](https://github.com/mmzz164/OneCompression)** — see [`example/glm-5.2/`](https://github.com/mmzz164/OneCompression/tree/glm-serving-v1/example/glm-5.2).
- **2× ~95 GiB sm_120 GPUs**, EP=1 / TP=2 (VQ codes can't be tensor-parallel-sharded).

```bash
GLM_CKPT=/path/to/this/model GLM_MAXLEN=16384 bash start_glm_api_vq.sh   # OpenAI API :8001, served as "glm-5.2"
```

`MixedVQMoEMethod` is auto-selected from the `format:"vq"` markers in `quantization_config`.

## Performance & limits

- **~16 tok/s** steady-state decode (single stream) — the ceiling is the MoE all-reduce over PCIe
  (no NVLink), not the VQ kernel; same as the first release.
- **Context ≤ 16384** on this hardware (dense MLA — sm_120 has no sparse-DSA forward kernel). Dense
  long-context prefill is O(n²), so very long prompts are slow; the comfortable interactive zone is
  ~12–16K. True 100K+ needs a sparse-attention kernel (not available on sm_120 yet).
- Default template is **thinking-on**; use **temp 0.6** (not greedy) to avoid reasoning loops.

## License & attribution

- This quantized model: **MIT**.
- Base **GLM-5.2**: **MIT**, © Zhipu AI — this is a derivative; all rights/attribution to upstream.
- Quantization/serving built on **OneCompression** (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).