GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill EOQ Q5 (Compressed)

EOQ (Entropy-Optimal Quantization) Q5 compressed version of TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill.

GLM-4.7-Flash is a 30B parameter MoE model (Glm4MoeLite architecture / DeepSeek2), distilled from Claude Opus 4.5 reasoning traces. 262K native context.

Verified Benchmark Results

All benchmarks on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), verified on Google Colab Pro G4.

Metric FP16 (Baseline) EOQ Q5 Compressed
Size 59.9 GB 30.4 GB
Compression 1.0x 2.0x
PPL (WikiText-2) 37.71 41.12
PPL delta -- +3.41
Throughput (tok/s) 3.2 3.2 (no degradation)
  • Bits: 5
  • Block size: 128
  • Architecture: Glm4MoeLite (MoE, DeepSeek2-based)
  • Note: High base PPL is expected -- this model is optimized for chat/reasoning with a specific template, not raw text completion. WikiText-2 is not the ideal benchmark for this model type.

Usage

Method 1: Using eoq_loader.py (recommended)

from huggingface_hub import snapshot_download
import sys
local = snapshot_download("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")
sys.path.insert(0, local)
from eoq_loader import load_eoq_model
model, tokenizer = load_eoq_model("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")

inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Method 2: Manual loading

import torch, json, torch.nn.functional as F
from safetensors.torch import load_file
from huggingface_hub import snapshot_download

local = snapshot_download("caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed")

# Load metadata and compressed weights, then dequantize
# See eoq_loader.py for full decompression logic

Links

Downloads last month
3
Safetensors
Model size
30B params
Tensor type
F16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed

Collection including caiovicentino1/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-EOQ-Q5-compressed