Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8

GLM-5.2 — mixed-bit VQ (AQLM) ~1.71-bit, Japanese-tuned, thinking-stable, longer-context

A ~167 GiB quantization of GLM-5.2 (744B Chinese-native reasoning MoE, MIT), tuned for Japanese / English / Chinese general quality, a larger usable context, and well-behaved thinking-mode (reasoning ends and answers instead of running away), using vector quantization with GPTQ error compensation (AQLM-style).

Runs on 2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each) via vLLM.

What's new vs the first VQ release (GLM-5.2-VQ-Arith, ~180 GiB)

Smaller (~167 vs ~180 GiB), quality holds, longer context, and the reasoning terminates cleanly instead of over-thinking:

VQ-Arith (~180 GiB) This release (~167 GiB)
Avg bits/weight (experts) ~1.86 ~1.71
Calibration focus arithmetic-heavy de-math, Japanese-strong multilingual + reasoning traces
down_proj compensation naive VQ GPTQ-VQ compensated (like gate/up)
Context (dense MLA, sm_120) ≤ 4096 ≤ 16384 (validated needle retrieval to ~8.5K)
Thinking-mode termination could over-think calibrated to close reasoning & answer

Four changes, together: (1) a de-math, Japanese-strong multilingual bit allocation + calibration — the first release over-protected arithmetic, which turned out unnecessary once VQ+compensation is in place (VQ is robust enough to keep arithmetic without spending extra bits on it); (2) Japanese-strong Hessians for the GPTQ-VQ compensation (the first release's Hessians were ~0.1 % Japanese); (3) down_proj is now GPTQ-VQ compensated too; (4) reasoning trajectories in the calibration, so the model reliably ends its thinking and answers — the first release could keep thinking for too long on open-ended prompts.

Quality (JA / EN / ZH, temp 0.6)

No language collapse; arithmetic preserved despite the de-emphasis:

  • JA / EN / ZH 127×8 = 1016, JA 269×6 = 1614, word problems — all correct.
  • Hard factual (JA "second-tallest mountain in Japan" → 北岳), general JA explanations — coherent.
  • Thinking mode terminates cleanly (no runaway over-thinking) — use temp 0.6, not greedy.
  • Deep needle retrieval works at long context (validated to ~8.5K tokens; max context 16384).

Tip — keeping the reasoning in Japanese

GLM-5.2 is Chinese-native and reasons in English/Chinese by default (the final answer is already correct Japanese). To make the thinking itself Japanese, add a system prompt — for example:

You are a helpful assistant for a Japanese user. Write your ENTIRE response in
Japanese. Your internal reasoning (the thinking process) MUST also be written in
Japanese — do not reason in English or Chinese.

This reliably switches the reasoning to Japanese. The instruction can be written in English or Japanese; both work.

Serving

Not plug-and-play GGUF — needs the matching sm_120 stack:

GLM_CKPT=/path/to/this/model GLM_MAXLEN=16384 bash start_glm_api_vq.sh   # OpenAI API :8001, served as "glm-5.2"

MixedVQMoEMethod is auto-selected from the format:"vq" markers in quantization_config.

Performance & limits

  • ~16 tok/s steady-state decode (single stream) — the ceiling is the MoE all-reduce over PCIe (no NVLink), not the VQ kernel; same as the first release.
  • Context ≤ 16384 on this hardware (dense MLA — sm_120 has no sparse-DSA forward kernel). Dense long-context prefill is O(n²), so very long prompts are slow; the comfortable interactive zone is ~12–16K. True 100K+ needs a sparse-attention kernel (not available on sm_120 yet).
  • Default template is thinking-on; use temp 0.6 (not greedy) to avoid reasoning loops.

License & attribution

  • This quantized model: MIT.
  • Base GLM-5.2: MIT, © Zhipu AI — this is a derivative; all rights/attribution to upstream.
  • Quantization/serving built on OneCompression (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).
Downloads last month
454
Safetensors
Model size
49B params
Tensor type
I32
·
F16
·
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aquaman164/GLM-5.2-VQ-1.7bit-JA

Base model

zai-org/GLM-5.2
Quantized
(80)
this model