GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill EOQ Q5 (Compressed)

EOQ (Entropy-Optimal Quantization) Q5 compressed version of TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill.

GLM-4.7-Flash is a 30B parameter MoE model (Glm4MoeLite architecture / DeepSeek2), distilled from Claude Opus 4.5 reasoning traces. 262K native context.

Verified Benchmark Results

All benchmarks on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), verified on Google Colab Pro G4.

Metric	FP16 (Baseline)	EOQ Q5 Compressed
Size	59.9 GB	30.4 GB
Compression	1.0x	2.0x
PPL (WikiText-2)	37.71	41.12
PPL delta	--	+3.41
Throughput (tok/s)	3.2	3.2 (no degradation)

Bits: 5
Block size: 128
Architecture: Glm4MoeLite (MoE, DeepSeek2-based)
Note: High base PPL is expected -- this model is optimized for chat/reasoning with a specific template, not raw text completion. WikiText-2 is not the ideal benchmark for this model type.

Usage

Method 1: Using eoq_loader.py (recommended)

from huggingface_hub import snapshot_download
import sys
local = snapshot_download("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")
sys.path.insert(0, local)
from eoq_loader import load_eoq_model
model, tokenizer = load_eoq_model("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")

inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Method 2: Manual loading

import torch, json, torch.nn.functional as F
from safetensors.torch import load_file
from huggingface_hub import snapshot_download

local = snapshot_download("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")

# Load metadata and compressed weights, then dequantize
# See eoq_loader.py for full decompression logic

Model tree for caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed

Base model

zai-org/GLM-4.7-Flash

Finetuned

unsloth/GLM-4.7-Flash

Finetuned

TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill

Finetuned

(4)

this model

Collection including caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed

Large Models (27B-35B) HLWQ

Collection

HLWQ + EOQ quantized large models · Claude Opus distilled + MoE variants • 5 items • Updated Apr 13

caiovicentino1
/

GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed

GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill EOQ Q5 (Compressed)

Verified Benchmark Results

Usage

Method 1: Using eoq_loader.py (recommended)

Method 2: Manual loading

Links

Model tree for caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed

Collection including caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed

Large Models (27B-35B) HLWQ