GLM-5.2 ds4-native 64GB Q2_K Experimental GGUF

This repository contains an experimental GLM-5.2 GGUF artifact generated for the glm-dsa preview/diagnostic path in ds4. It is intended for runtime bring-up, short manual probes, and independent verification, not for normal chat use yet.

The artifact is derived from zai-org/GLM-5.2 and targets the ds4 GLM work in:

Status

This is very experimental.

  • Normal ds4 generation for glm-dsa is still disabled.
  • The current runnable path is --glm-dsa-preview, a short-context preview path backed by CPU GLM kernels plus optional Metal/CUDA output-head execution.
  • The preview uses a small GLM K-nope/K-rot/V cache after prompt prefill, but it is still capped to short prompts and a limited number of generated tokens.
  • Long-context DSA indexer cache, production session integration, MTP, sampling, and Metal/CUDA/ROCm GLM graph execution are not implemented in this artifact path.
  • Early validation was done remotely on AWS, and local 64 GiB Metal smoke tests now also pass. Broader quality tests are still pending.

Artifact

The original generated file is:

glm52-ds4-native-64g.gguf
size:         244.14 GiB
tensor bytes: 244.13 GiB
resident:     19.56 GiB, q8_0/f16
routed MoE:  224.44 GiB, q2_k
sha256:       99e18cf31620845f6ff8e08ec8074865b613a3684383fa53ebb93e77260d7dd5

For Hugging Face upload and download reliability, the file is published as 6 chunked parts (part-000 through part-005) plus SHA-256 manifests. Reconstruct it with:

cat glm52-ds4-native-64g.gguf.part-* > glm52-ds4-native-64g.gguf
sha256sum -c glm52-ds4-native-64g.gguf.parts.sha256
sha256sum -c glm52-ds4-native-64g.gguf.sha256

Verified Diagnostics

The current evidence is deliberately narrow but real:

inspect:                 1524/1524 tensors recognized
first-token diagnostic:  rc=0, 37.15s, 20.1 GiB max RSS, 0 swaps
prompt test "Hi":        rc=0, top token Hello, 1:55.04, 57.43 GiB max RSS, 0 swaps
prompt test France:      rc=0, top token Paris, 1:08.86, 60.08 GiB max RSS, 0 swaps
cached greedy France -n 1:
                         rc=0, output Paris, 1:00.19, 63.08 GiB max RSS, 0 swaps
cached greedy France -n 2:
                         rc=0, output Paris., 1:45.69, 63.22 GiB max RSS, 0 swaps
cache-check France -n 1:
                         ds4 exit=0, prefill logits_max=0, logits_rms=0,
                         output Paris, 4:43.37, max RSS 63361448 kB
                         (~60.43 GiB), 0 swaps
cache-check France -n 2:
                         ds4 exit=0, prefill logits_max=0, decode logits_max=0,
                         logits_rms=0, output Paris., 5:34.42,
                         max RSS 63080908 kB (~60.16 GiB), 0 swaps
CUDA output-head smoke:
                         rc=0 on AWS g6e.4xlarge/L40S, output Paris,
                         CUDA output-head logits_max=7.62939e-06,
                         logits_rms=7.13435e-07, same top token,
                         0:41.29, 71033280 kB max RSS, 0 swaps
CUDA + cache check -n 2:
                         rc=0 on AWS g6e.4xlarge/L40S, output Paris.,
                         CUDA output-head checks below 8e-06,
                         cache checks below 8e-06, 3:03.61,
                         73175024 kB max RSS, 0 swaps
Metal preview one-shot:
                         rc=0 on Apple M5 Pro / 64 GiB, output Paris.,
                         stopped before the next role marker, 63.34s,
                         45976141824 bytes max RSS, 0 process swaps,
                         through the local ds4_session GLM preview path
Metal preview REPL:
                         rc=0 on Apple M5 Pro / 64 GiB, prompt glm-dsa>,
                         output Paris., clean /quit, 89.68s including startup
                         and one turn, 47677816832 bytes max RSS, 0 swaps
Metal preview server:
                         rc=0 on Apple M5 Pro / 64 GiB through
                         /v1/chat/completions, model glm-dsa-preview,
                         output content Paris, finish_reason stop, 75.99s
                         prefill plus 21.43s decode

The recommended short one-shot command is:

./ds4 --metal --glm-dsa-preview --temp 0 -n 8 \
  -m glm52-ds4-native-64g.gguf \
  -p "The capital of France is"

Expected output:

Paris.

Without -p, the same flag starts an experimental interactive prompt:

./ds4 --metal --glm-dsa-preview \
  -m glm52-ds4-native-64g.gguf

In the interactive prompt, Ctrl-C interrupts the active preview turn between GLM layers and returns to glm-dsa> without keeping the partial turn in the local transcript.

For cache validation builds from andreaborio/ds4 at commit 405d32f or newer also support:

DS4_GLM_DSA_CACHE_CHECK=1 ./ds4 --cpu --glm-dsa-greedy-test -n 2 \
  -m glm52-ds4-native-64g.gguf \
  -p "The capital of France is"

That opt-in mode recomputes checked prefixes through the full short-prompt CPU path and compares logits against the incremental GLM cache. It is intentionally slow and should be treated as a correctness diagnostic, not a benchmark.

The public upload was verified at:

https://huggingface.co/andreaborio/glm52-ds4-native-64g-q2k-experimental
files: .gitattributes, README.md, 6 GGUF parts, per-part SHA-256 manifest,
       full SHA-256 manifest, reconstruct.sh

An optional tested sidecar pack for the Metal-required selected-id hotlist path is published under:

sidecars/selected-hotlist-n32-pack2600/

This sidecar is not a standalone model and it is not a reordered GGUF. It must be used together with the original reconstructed glm52-ds4-native-64g.gguf. It copies 2600 hot layer/expert pairs into a compact local pack while leaving the source GGUF unchanged. Local 64 GiB Metal-required mixed-backend smokes with this sidecar measured 2.52 t/s at n=8, 2.28 t/s at n=32, and 2.01 t/s at n=64. Those warm runs reported zero process swaps and zero process block input/output operations via /usr/bin/time -l; they are not cold-cache device-level disk traffic measurements.

Why It Is Slow

The tested GLM layer path is still CPU diagnostic code. Prompt prefill runs all short-prompt tokens through all 78 base GLM layers and fills a diagnostic K-nope/K-rot/V cache; later generated tokens use that cache instead of replaying the whole prompt. On CUDA builds at commit a9022d9 or newer, --cuda initializes CUDA and runs the final GLM RMSNorm plus Q8_0 vocabulary projection on CUDA. On current macOS Metal builds, --metal does the same output-head work on Metal. The attention, MoE, indexer, and cache-update parts remain CPU diagnostic kernels for now.

Recent ds4 branches also expose DS4_GLM_DSA_ENABLE_BACKEND_Q8_2D=1, an explicit opt-in that maps the resident GLM Q8_0 2D projection tensors and runs them through the active backend. On the current local Apple M5 Pro / 64 GiB machine, a clean Metal --first-token-test -p x run with that flag completed in 13.89 seconds including 5.24 seconds of Metal residency setup, used about 8.43 GB max RSS, and produced the same top token as the CPU diagnostic. The full short-prompt preview remained slow, however: profiling showed routed MoE, not the resident Q8_0 projections, dominates the preview runtime. Treat this as backend bring-up, not as a full GLM graph.

There is also an intentionally off-by-default DS4_GLM_DSA_ENABLE_BACKEND_MOE=1 probe. It keeps GLM's sigmoid top-8 router on CPU and calls the backend routed MoE entry point with the selected ids and weights. On Metal this validated the generic backend shape, but it is not a usable 64 GB path yet: with DS4_METAL_NO_RESIDENCY=1 DS4_METAL_NO_MODEL_WARMUP=1, the first-token run mapped 227767.77 MiB of routed tensor spans, took 92.85 seconds, and spent 77.74 seconds in routed MoE. The next runtime step is top-8 selected-expert streaming/cache support, not broad generic routed-tensor binding.

The runtime still reads and dequantizes routed Q2_K expert weights from a 244 GiB mmap-backed GGUF. This proves that the quantized artifact and ds4 tensor binding can execute end-to-end and that a first CUDA tranche is wired correctly; it does not measure the speed of a future full Metal/CUDA/ROCm GLM generation graph. A true GLM GPU path still needs GLM-specific graph work for its DSA/MLA attention layout, K/V split, indexer policy, and routed execution.

How To Run

Use the ds4 fork/branch that contains the GLM-DSA diagnostics:

git clone https://github.com/andreaborio/ds4.git
cd ds4
git checkout wip/glm52-metal64-strict-probe
make

For the CUDA output-head diagnostic on an NVIDIA L40S-class machine:

make cuda CUDA_ARCH=sm_89 CUDA_HOME=/usr/local/cuda-12.8
DS4_GLM_DSA_CUDA_CHECK=1 ./ds4 --cuda --glm-dsa-greedy-test -n 1 \
  -m /path/to/glm52-ds4-native-64g.gguf \
  -p "The capital of France is"

Inspect the reconstructed GGUF:

./ds4 --inspect -m /path/to/glm52-ds4-native-64g.gguf

Run the short greedy diagnostic:

./ds4 --cpu --glm-dsa-greedy-test -n 1 \
  -m /path/to/glm52-ds4-native-64g.gguf \
  -p "The capital of France is"

Quantization Policy

This is a ds4-native PTQ experiment:

  • resident attention, router, shared expert, dense FFN, embeddings, and output tensors use q8_0/f16 where required by the current diagnostic kernels
  • routed MoE expert tensors use q2_k
  • GLM kv_b_proj tensors are split into ds4-native K/V B projection tensors
  • the file is not expected to run in generic GGUF runtimes

A smaller follow-up recipe is now planned but not included in this artifact: ds4-native-64g-iq2gateup keeps routed expert down projections in q2_k and switches routed expert gate/up projections to iq2_xxs. Header planning on the public GLM-5.2 safetensors estimates 212.07 GiB of tensor payload instead of 244.13 GiB, with no current ds4 type, quantizer, or block-alignment blockers. It still needs a newly generated GGUF and top-8 routed-MoE backend work before it can be called usable.

Acknowledgements

This artifact exists because ds4 is simple enough to make a very narrow runtime port auditable. Please credit antirez/ds4 for the runtime, zai-org/GLM-5.2 for the base model, and andreaborio/ds4 for the experimental GLM branch used here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for andreaborio/glm52-ds4-native-64g-q2k-experimental

Base model

zai-org/GLM-5.2
Finetuned
(11)
this model