GLM-5.2 ds4-native 64GB Q2_K Experimental GGUF
This repository contains an experimental GLM-5.2 GGUF artifact generated for the
glm-dsa preview/diagnostic path in ds4. It is intended for runtime bring-up,
short manual probes, and independent verification, not for normal chat use yet.
The artifact is derived from zai-org/GLM-5.2
and targets the ds4 GLM work in:
- upstream ds4:
antirez/ds4 - current runnable fork/branch:
andreaborio/ds4#wip/glm52-metal64-strict-probe
Status
This is very experimental.
- Normal ds4 generation for
glm-dsais still disabled. - The current runnable path is
--glm-dsa-preview, a short-context preview path backed by CPU GLM kernels plus optional Metal/CUDA output-head execution. - The preview uses a small GLM K-nope/K-rot/V cache after prompt prefill, but it is still capped to short prompts and a limited number of generated tokens.
- Long-context DSA indexer cache, production session integration, MTP, sampling, and Metal/CUDA/ROCm GLM graph execution are not implemented in this artifact path.
- Early validation was done remotely on AWS, and local 64 GiB Metal smoke tests now also pass. Broader quality tests are still pending.
Artifact
The original generated file is:
glm52-ds4-native-64g.gguf
size: 244.14 GiB
tensor bytes: 244.13 GiB
resident: 19.56 GiB, q8_0/f16
routed MoE: 224.44 GiB, q2_k
sha256: 99e18cf31620845f6ff8e08ec8074865b613a3684383fa53ebb93e77260d7dd5
For Hugging Face upload and download reliability, the file is published as
6 chunked parts (part-000 through part-005) plus SHA-256 manifests.
Reconstruct it with:
cat glm52-ds4-native-64g.gguf.part-* > glm52-ds4-native-64g.gguf
sha256sum -c glm52-ds4-native-64g.gguf.parts.sha256
sha256sum -c glm52-ds4-native-64g.gguf.sha256
Verified Diagnostics
The current evidence is deliberately narrow but real:
inspect: 1524/1524 tensors recognized
first-token diagnostic: rc=0, 37.15s, 20.1 GiB max RSS, 0 swaps
prompt test "Hi": rc=0, top token Hello, 1:55.04, 57.43 GiB max RSS, 0 swaps
prompt test France: rc=0, top token Paris, 1:08.86, 60.08 GiB max RSS, 0 swaps
cached greedy France -n 1:
rc=0, output Paris, 1:00.19, 63.08 GiB max RSS, 0 swaps
cached greedy France -n 2:
rc=0, output Paris., 1:45.69, 63.22 GiB max RSS, 0 swaps
cache-check France -n 1:
ds4 exit=0, prefill logits_max=0, logits_rms=0,
output Paris, 4:43.37, max RSS 63361448 kB
(~60.43 GiB), 0 swaps
cache-check France -n 2:
ds4 exit=0, prefill logits_max=0, decode logits_max=0,
logits_rms=0, output Paris., 5:34.42,
max RSS 63080908 kB (~60.16 GiB), 0 swaps
CUDA output-head smoke:
rc=0 on AWS g6e.4xlarge/L40S, output Paris,
CUDA output-head logits_max=7.62939e-06,
logits_rms=7.13435e-07, same top token,
0:41.29, 71033280 kB max RSS, 0 swaps
CUDA + cache check -n 2:
rc=0 on AWS g6e.4xlarge/L40S, output Paris.,
CUDA output-head checks below 8e-06,
cache checks below 8e-06, 3:03.61,
73175024 kB max RSS, 0 swaps
Metal preview one-shot:
rc=0 on Apple M5 Pro / 64 GiB, output Paris.,
stopped before the next role marker, 63.34s,
45976141824 bytes max RSS, 0 process swaps,
through the local ds4_session GLM preview path
Metal preview REPL:
rc=0 on Apple M5 Pro / 64 GiB, prompt glm-dsa>,
output Paris., clean /quit, 89.68s including startup
and one turn, 47677816832 bytes max RSS, 0 swaps
Metal preview server:
rc=0 on Apple M5 Pro / 64 GiB through
/v1/chat/completions, model glm-dsa-preview,
output content Paris, finish_reason stop, 75.99s
prefill plus 21.43s decode
The recommended short one-shot command is:
./ds4 --metal --glm-dsa-preview --temp 0 -n 8 \
-m glm52-ds4-native-64g.gguf \
-p "The capital of France is"
Expected output:
Paris.
Without -p, the same flag starts an experimental interactive prompt:
./ds4 --metal --glm-dsa-preview \
-m glm52-ds4-native-64g.gguf
In the interactive prompt, Ctrl-C interrupts the active preview turn between GLM
layers and returns to glm-dsa> without keeping the partial turn in the local
transcript.
For cache validation builds from andreaborio/ds4 at commit 405d32f or newer
also support:
DS4_GLM_DSA_CACHE_CHECK=1 ./ds4 --cpu --glm-dsa-greedy-test -n 2 \
-m glm52-ds4-native-64g.gguf \
-p "The capital of France is"
That opt-in mode recomputes checked prefixes through the full short-prompt CPU path and compares logits against the incremental GLM cache. It is intentionally slow and should be treated as a correctness diagnostic, not a benchmark.
The public upload was verified at:
https://huggingface.co/andreaborio/glm52-ds4-native-64g-q2k-experimental
files: .gitattributes, README.md, 6 GGUF parts, per-part SHA-256 manifest,
full SHA-256 manifest, reconstruct.sh
An optional tested sidecar pack for the Metal-required selected-id hotlist path is published under:
sidecars/selected-hotlist-n32-pack2600/
This sidecar is not a standalone model and it is not a reordered GGUF. It must
be used together with the original reconstructed glm52-ds4-native-64g.gguf.
It copies 2600 hot layer/expert pairs into a compact local pack while leaving
the source GGUF unchanged. Local 64 GiB Metal-required mixed-backend smokes
with this sidecar measured 2.52 t/s at n=8, 2.28 t/s at n=32, and 2.01 t/s at
n=64. Those warm runs reported zero process swaps and zero process block
input/output operations via /usr/bin/time -l; they are not cold-cache
device-level disk traffic measurements.
Why It Is Slow
The tested GLM layer path is still CPU diagnostic code. Prompt prefill runs all
short-prompt tokens through all 78 base GLM layers and fills a diagnostic
K-nope/K-rot/V cache; later generated tokens use that cache instead of replaying
the whole prompt. On CUDA builds at commit a9022d9 or newer, --cuda
initializes CUDA and runs the final GLM RMSNorm plus Q8_0 vocabulary projection
on CUDA. On current macOS Metal builds, --metal does the same output-head work
on Metal. The attention, MoE, indexer, and cache-update parts remain CPU
diagnostic kernels for now.
Recent ds4 branches also expose DS4_GLM_DSA_ENABLE_BACKEND_Q8_2D=1, an
explicit opt-in that maps the resident GLM Q8_0 2D projection tensors and runs
them through the active backend. On the current local Apple M5 Pro / 64 GiB
machine, a clean Metal --first-token-test -p x run with that flag completed in
13.89 seconds including 5.24 seconds of Metal residency setup, used about 8.43
GB max RSS, and produced the same top token as the CPU diagnostic. The full
short-prompt preview remained slow, however: profiling showed routed MoE, not the
resident Q8_0 projections, dominates the preview runtime. Treat this as backend
bring-up, not as a full GLM graph.
There is also an intentionally off-by-default DS4_GLM_DSA_ENABLE_BACKEND_MOE=1
probe. It keeps GLM's sigmoid top-8 router on CPU and calls the backend routed
MoE entry point with the selected ids and weights. On Metal this validated the
generic backend shape, but it is not a usable 64 GB path yet: with
DS4_METAL_NO_RESIDENCY=1 DS4_METAL_NO_MODEL_WARMUP=1, the first-token run
mapped 227767.77 MiB of routed tensor spans, took 92.85 seconds, and spent 77.74
seconds in routed MoE. The next runtime step is top-8 selected-expert
streaming/cache support, not broad generic routed-tensor binding.
The runtime still reads and dequantizes routed Q2_K expert weights from a 244 GiB mmap-backed GGUF. This proves that the quantized artifact and ds4 tensor binding can execute end-to-end and that a first CUDA tranche is wired correctly; it does not measure the speed of a future full Metal/CUDA/ROCm GLM generation graph. A true GLM GPU path still needs GLM-specific graph work for its DSA/MLA attention layout, K/V split, indexer policy, and routed execution.
How To Run
Use the ds4 fork/branch that contains the GLM-DSA diagnostics:
git clone https://github.com/andreaborio/ds4.git
cd ds4
git checkout wip/glm52-metal64-strict-probe
make
For the CUDA output-head diagnostic on an NVIDIA L40S-class machine:
make cuda CUDA_ARCH=sm_89 CUDA_HOME=/usr/local/cuda-12.8
DS4_GLM_DSA_CUDA_CHECK=1 ./ds4 --cuda --glm-dsa-greedy-test -n 1 \
-m /path/to/glm52-ds4-native-64g.gguf \
-p "The capital of France is"
Inspect the reconstructed GGUF:
./ds4 --inspect -m /path/to/glm52-ds4-native-64g.gguf
Run the short greedy diagnostic:
./ds4 --cpu --glm-dsa-greedy-test -n 1 \
-m /path/to/glm52-ds4-native-64g.gguf \
-p "The capital of France is"
Quantization Policy
This is a ds4-native PTQ experiment:
- resident attention, router, shared expert, dense FFN, embeddings, and output
tensors use
q8_0/f16where required by the current diagnostic kernels - routed MoE expert tensors use
q2_k - GLM
kv_b_projtensors are split into ds4-native K/V B projection tensors - the file is not expected to run in generic GGUF runtimes
A smaller follow-up recipe is now planned but not included in this artifact:
ds4-native-64g-iq2gateup keeps routed expert down projections in q2_k and
switches routed expert gate/up projections to iq2_xxs. Header planning on the
public GLM-5.2 safetensors estimates 212.07 GiB of tensor payload instead of
244.13 GiB, with no current ds4 type, quantizer, or block-alignment blockers.
It still needs a newly generated GGUF and top-8 routed-MoE backend work before it
can be called usable.
Acknowledgements
This artifact exists because ds4 is simple enough to make a very narrow runtime
port auditable. Please credit antirez/ds4
for the runtime, zai-org/GLM-5.2 for
the base model, and andreaborio/ds4 for
the experimental GLM branch used here.
Model tree for andreaborio/glm52-ds4-native-64g-q2k-experimental
Base model
zai-org/GLM-5.2