Instructions to use inference-optimization/GLM-5.2-0.8B-A0.8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inference-optimization/GLM-5.2-0.8B-A0.8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="inference-optimization/GLM-5.2-0.8B-A0.8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("inference-optimization/GLM-5.2-0.8B-A0.8B")
model = AutoModelForCausalLM.from_pretrained("inference-optimization/GLM-5.2-0.8B-A0.8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inference-optimization/GLM-5.2-0.8B-A0.8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inference-optimization/GLM-5.2-0.8B-A0.8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/GLM-5.2-0.8B-A0.8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/inference-optimization/GLM-5.2-0.8B-A0.8B

SGLang

How to use inference-optimization/GLM-5.2-0.8B-A0.8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inference-optimization/GLM-5.2-0.8B-A0.8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/GLM-5.2-0.8B-A0.8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inference-optimization/GLM-5.2-0.8B-A0.8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/GLM-5.2-0.8B-A0.8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use inference-optimization/GLM-5.2-0.8B-A0.8B with Docker Model Runner:
```
docker model run hf.co/inference-optimization/GLM-5.2-0.8B-A0.8B
```

GLM-5.2-0.8B-A0.8B

This is a tiny version of zai-org/GLM-5.2 created for testing and development.

Model Details

Base Model: zai-org/GLM-5.2
Architecture: glm_moe_dsa (GLM MoE with DeepSeek Sparse Attention)
Total Parameters: 0.85B
Activated Parameters: ~0.77B

Configuration Changes

The following parameters were reduced from the original model:

Parameter	Original	Tiny
num_hidden_layers	78	6
hidden_size	6144	2048
intermediate_size	12288	4096
num_attention_heads	64	16
num_key_value_heads	64	16
n_routed_experts	256	8
num_experts_per_tok	8	2
moe_intermediate_size	2048	512
kv_lora_rank	512	128
q_lora_rank	2048	512
v_head_dim	256	128
index_n_heads	32	8
index_head_dim	128	64
first_k_dense_replace	3	2

Checkpoint Structure

Single safetensors file containing 194 tensors in float32. Layers 0-1 have dense MLP, layers 2-5 have MoE MLP. Layers 0-2 have full DSA indexer weights, layers 3-5 use shared indexer.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("inference-optimization/GLM-5.2-0.8B-A0.8B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/GLM-5.2-0.8B-A0.8B")

input_ids = tokenizer("According to all known laws", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Creation Process

Inspected original GLM-5.2 config (78 layers, 256 experts, hidden_size=6144)
Reduced all dimensions to target ~1B parameters while preserving architecture
Created model with float32 dtype for training stability
Fine-tuned on copypasta dataset to perplexity ~1.0
Validated checkpoint structure matches original model naming conventions
Validated model loads, inferences, and generates correctly

Validation Output

Success: 1.0000379085540771 <= 10.0
Generating sample text:
According to all known laws of aviation, there is no way a bee should be able to fly.

Notes

The model uses float32 dtype (original uses bfloat16) to ensure proper initialization and training of the tiny model
Architecture preserves both dense and sparse MLP layer types, MLA attention with compressed Q/KV, and DSA indexer with full/shared patterns
The model has been fine-tuned on a toy dataset and is intended for testing purposes only

Downloads last month: -

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for inference-optimization/GLM-5.2-0.8B-A0.8B

Base model

zai-org/GLM-5.2

Finetuned

(11)

this model

Collection including inference-optimization/GLM-5.2-0.8B-A0.8B

Tiny Models

Collection

Tiny models used for testing • 8 items • Updated about 19 hours ago