---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
datasets:
- agentlans/common-crawl-sample
- bigcode/the-stack-smol-xl
- rombodawg/Everything_Instruct
tags:
- draft
- speculative-decoding
---

A `0.6B` parameter draft (speculative decoding) model for use with [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), [GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air) and [GLM-4-32B-0414](https://huggingface.co/zai-org/GLM-4-32B-0414).

See [GLM-4.5-DRAFT-0.6B-v3.0](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0) for the models in `transformers` format, and a detailed explanation of how the model was created.

---

I've included the `Q4_0` quants for 3 different context lengths:

- [GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF/resolve/main/GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf)
- [GLM-4.5-DRAFT-0.6B-64k-Q4_0.gguf](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF/resolve/main/GLM-4.5-DRAFT-0.6B-64k-Q4_0.gguf)
- [GLM-4.5-DRAFT-0.6B-128k-Q4_0.gguf](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF/resolve/main/GLM-4.5-DRAFT-0.6B-128k-Q4_0.gguf)

---

## NOTES:

- The 14 heads of `Qwen2.5-0.5B` doesn't allow for any of the other 4-bit quants to be made (and experimentation has shown using more or less than 4-bits for speculative decoding is a waste of time anwyay).
- Due to `llama.cpp` using "static-YaRN" the scaling factor remains constant regardless of input length! Only use the longer context versions when processing long contexts is required...