Large Models (27B-35B) HLWQ
Collection
HLWQ + EOQ quantized large models · Claude Opus distilled + MoE variants • 5 items • Updated
EOQ (Entropy-Optimal Quantization) Q5 compressed version of TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill.
GLM-4.7-Flash is a 30B parameter MoE model (Glm4MoeLite architecture / DeepSeek2), distilled from Claude Opus 4.5 reasoning traces. 262K native context.
All benchmarks on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), verified on Google Colab Pro G4.
| Metric | FP16 (Baseline) | EOQ Q5 Compressed |
|---|---|---|
| Size | 59.9 GB | 30.4 GB |
| Compression | 1.0x | 2.0x |
| PPL (WikiText-2) | 37.71 | 41.12 |
| PPL delta | -- | +3.41 |
| Throughput (tok/s) | 3.2 | 3.2 (no degradation) |
from huggingface_hub import snapshot_download
import sys
local = snapshot_download("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")
sys.path.insert(0, local)
from eoq_loader import load_eoq_model
model, tokenizer = load_eoq_model("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")
inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
import torch, json, torch.nn.functional as F
from safetensors.torch import load_file
from huggingface_hub import snapshot_download
local = snapshot_download("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")
# Load metadata and compressed weights, then dequantize
# See eoq_loader.py for full decompression logic
Base model
zai-org/GLM-4.7-Flash