--- license: apache-2.0 base_model: - Qwen/Qwen2.5-0.5B-Instruct datasets: - agentlans/common-crawl-sample - bigcode/the-stack-smol-xl - rombodawg/Everything_Instruct tags: - draft - speculative-decoding --- A `0.6B` parameter draft (speculative decoding) model for use with [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5), [GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air) and [GLM-4-32B-0414](https://huggingface.co/zai-org/GLM-4-32B-0414). See [GLM-4.5-DRAFT-0.6B-v3.0](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0) for the models in `transformers` format, and a detailed explanation of how the model was created. --- I've included the `Q4_0` quants for 3 different context lengths: - [GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF/resolve/main/GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf) - [GLM-4.5-DRAFT-0.6B-64k-Q4_0.gguf](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF/resolve/main/GLM-4.5-DRAFT-0.6B-64k-Q4_0.gguf) - [GLM-4.5-DRAFT-0.6B-128k-Q4_0.gguf](https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF/resolve/main/GLM-4.5-DRAFT-0.6B-128k-Q4_0.gguf) --- ## NOTES: - The 14 heads of `Qwen2.5-0.5B` doesn't allow for any of the other 4-bit quants to be made (and experimentation has shown using more or less than 4-bits for speculative decoding is a waste of time anwyay). - Due to `llama.cpp` using "static-YaRN" the scaling factor remains constant regardless of input length! Only use the longer context versions when processing long contexts is required...