Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8

GLM-5.2 — mixed-bit VQ (AQLM) ~1.71-bit, Japanese-tuned, thinking-stable, longer-context

A ~167 GiB quantization of GLM-5.2 (744B Chinese-native reasoning MoE, MIT), tuned for Japanese / English / Chinese general quality, a larger usable context, and well-behaved thinking-mode (reasoning ends and answers instead of running away), using vector quantization with GPTQ error compensation (AQLM-style).

Runs on 2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each) via vLLM.

What's new vs the first VQ release (`GLM-5.2-VQ-Arith`, ~180 GiB)

Smaller (~167 vs ~180 GiB), quality holds, longer context, and the reasoning terminates cleanly instead of over-thinking:

	VQ-Arith (~180 GiB)	This release (~167 GiB)
Avg bits/weight (experts)	~1.86	~1.71
Calibration focus	arithmetic-heavy	de-math, Japanese-strong multilingual + reasoning traces
`down_proj` compensation	naive VQ	GPTQ-VQ compensated (like gate/up)
Context (dense MLA, sm_120)	≤ 4096	≤ 16384 (validated needle retrieval to ~8.5K)
Thinking-mode termination	could over-think	calibrated to close reasoning & answer

Four changes, together: (1) a de-math, Japanese-strong multilingual bit allocation + calibration — the first release over-protected arithmetic, which turned out unnecessary once VQ+compensation is in place (VQ is robust enough to keep arithmetic without spending extra bits on it); (2) Japanese-strong Hessians for the GPTQ-VQ compensation (the first release's Hessians were ~0.1 % Japanese); (3) down_proj is now GPTQ-VQ compensated too; (4) reasoning trajectories in the calibration, so the model reliably ends its thinking and answers — the first release could keep thinking for too long on open-ended prompts.

Quality (JA / EN / ZH, temp 0.6)

No language collapse; arithmetic preserved despite the de-emphasis:

JA / EN / ZH 127×8 = 1016, JA 269×6 = 1614, word problems — all correct.
Hard factual (JA "second-tallest mountain in Japan" → 北岳), general JA explanations — coherent.
Thinking mode terminates cleanly (no runaway over-thinking) — use temp 0.6, not greedy.
Deep needle retrieval works at long context (validated to ~8.5K tokens; max context 16384).

Tip — keeping the reasoning in Japanese

GLM-5.2 is Chinese-native and reasons in English/Chinese by default (the final answer is already correct Japanese). To make the thinking itself Japanese, add a system prompt — for example:

You are a helpful assistant for a Japanese user. Write your ENTIRE response in
Japanese. Your internal reasoning (the thinking process) MUST also be written in
Japanese — do not reason in English or Chinese.

This reliably switches the reasoning to Japanese. The instruction can be written in English or Japanese; both work.

Serving

Not plug-and-play GGUF — needs the matching sm_120 stack:

vLLM with GlmMoeDsaForCausalLM + sm_120 kernels.
transformers 5.12.
The VQ serving plugin from mmzz164/OneCompression @ glm-serving-v1 — see example/glm-5.2/.
2× ~95 GiB sm_120 GPUs, EP=1 / TP=2 (VQ codes can't be tensor-parallel-sharded).

GLM_CKPT=/path/to/this/model GLM_MAXLEN=16384 bash start_glm_api_vq.sh   # OpenAI API :8001, served as "glm-5.2"

MixedVQMoEMethod is auto-selected from the format:"vq" markers in quantization_config.

Performance & limits

~16 tok/s steady-state decode (single stream) — the ceiling is the MoE all-reduce over PCIe (no NVLink), not the VQ kernel; same as the first release.
Context ≤ 16384 on this hardware (dense MLA — sm_120 has no sparse-DSA forward kernel). Dense long-context prefill is O(n²), so very long prompts are slow; the comfortable interactive zone is ~12–16K. True 100K+ needs a sparse-attention kernel (not available on sm_120 yet).
Default template is thinking-on; use temp 0.6 (not greedy) to avoid reasoning loops.

License & attribution

This quantized model: MIT.
Base GLM-5.2: MIT, © Zhipu AI — this is a derivative; all rights/attribution to upstream.
Quantization/serving built on OneCompression (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).

Downloads last month: 454

Safetensors

Model size

49B params

Tensor type

I32

F16

F32

BF16

Model tree for aquaman164/GLM-5.2-VQ-1.7bit-JA

Base model

zai-org/GLM-5.2

Quantized

(80)

this model