Instructions to use 0G-AI/0GM-1.0-35B-A3B-0427 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0G-AI/0GM-1.0-35B-A3B-0427 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="0G-AI/0GM-1.0-35B-A3B-0427")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("0G-AI/0GM-1.0-35B-A3B-0427")
model = AutoModelForMultimodalLM.from_pretrained("0G-AI/0GM-1.0-35B-A3B-0427")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 0G-AI/0GM-1.0-35B-A3B-0427 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0G-AI/0GM-1.0-35B-A3B-0427"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0G-AI/0GM-1.0-35B-A3B-0427",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/0G-AI/0GM-1.0-35B-A3B-0427

SGLang

How to use 0G-AI/0GM-1.0-35B-A3B-0427 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0G-AI/0GM-1.0-35B-A3B-0427" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0G-AI/0GM-1.0-35B-A3B-0427",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0G-AI/0GM-1.0-35B-A3B-0427" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0G-AI/0GM-1.0-35B-A3B-0427",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 0G-AI/0GM-1.0-35B-A3B-0427 with Docker Model Runner:
```
docker model run hf.co/0G-AI/0GM-1.0-35B-A3B-0427
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

0GM-1.0-35B-A3B (Preview-0427)

0GM-1.0-35B-A3B (Preview-0427) is the first MoE fine-tuned model built on Qwen 3.6 35B-A3B, a 35B-parameter mixture-of-experts model with roughly 3B active parameters per token. It was trained on 0G Compute's decentralized network and evaluated head-to-head against prior Qwen models and the larger Covenant-72B baseline.

Benchmark Results

Benchmark	Covenant-72B	Qwen 3.5 35B	Qwen 3.6 35B	0GM-1.0-35B
MMLU-Pro (4k context)	41.84%	61.43%	75.75%	77.62%
AIME 2026 (Pass@1)	0.00%	76.67%	70.00%	83.33%
GSM-8K	70.81%	80.06%	96.74%	96.82%
MATH-500	51.80%	91.20%	95.60%	95.80%
Token Length (MATH-500, rel.)	6.22%	100.00% (6,804)	96.60%	94.91%

0GM-1.0-35B-A3B leads on every accuracy benchmark, with the largest absolute gain on AIME 2026 (+13.33 points over Qwen 3.6 35B, +6.66 over Qwen 3.5 35B). Token usage on MATH-500 drops slightly versus the 3.6 dense baseline while accuracy goes up — i.e., shorter chains, higher correctness. Covenant-72B's near-zero output length and floor-level AIME score reflect its non-reasoning-tuned baseline rather than a capability ceiling at that parameter count.

MMLU-Pro Subject Breakdown

0GM-1.0-35B-A3B outperforms Qwen 3.6 35B-A3B on 12 out of 14 MMLU-Pro subjects, with the largest gains in physics (+5.1), philosophy (+4.8), engineering (+4.4), and computer science (+2.5). Under a constrained 4k context budget, these improvements indicate stronger reasoning capability while using fewer tokens.

Subject	Covenant-72B	Qwen 3.5 35B	Qwen 3.6 35B	0GM-1.0-35B
Math	41.6%	71.4%	82.4%	83.9%
Physics	36.8%	61.9%	74.2%	79.3%
Biology	69.5%	79.1%	91.1%	91.1%
Economics	53.7%	75.5%	87.2%	87.8%
Chemistry	31.1%	55.3%	72.1%	74.6%
Business	42.3%	72.0%	78.8%	81.9%
Psychology	57.4%	74.8%	84.5%	85.0%
Computer Science	43.2%	69.5%	80.7%	83.2%
Health	48.3%	62.8%	79.5%	79.0%
Other	44.5%	68.2%	77.7%	79.0%
Philosophy	42.7%	62.5%	80.2%	85.0%
History	40.9%	58.0%	74.8%	77.2%
Engineering	25.0%	33.3%	42.0%	46.4%
Law	27.6%	31.1%	67.5%	66.0%
Avg	43.19%	62.53%	75.7%	77.4%

Model Overview

Type: Causal Language Model with Vision Encoder
Training Stage: Pre-training & Post-training
Language Model
- Number of Parameters: 35B in total and 3B activated
- Hidden Dimension: 2048
- Token Embedding: 248320 (Padded)
- Number of Layers: 40
- Hidden Layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
- Gated DeltaNet:
  - Number of Linear Attention Heads: 32 for V and 16 for QK
  - Head Dimension: 128
- Gated Attention:
  - Number of Attention Heads: 16 for Q and 2 for KV
  - Head Dimension: 256
  - Rotary Position Embedding Dimension: 64
- Mixture Of Experts
  - Number of Experts: 256
  - Number of Activated Experts: 8 Routed + 1 Shared
  - Expert Intermediate Dimension: 512
- LM Output: 248320 (Padded)
- MTP: trained with multi-steps
Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Quickstart

For streamlined integration, we recommend using APIs. Below is a guide to use OpenAI-compatible API.

Serving

0GM can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers.

Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.

The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because this model leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. sglang>=0.5.10 is recommended, which can be installed using the following command in a fresh environment:

uv pip install sglang[all]

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3

Tool Use: To support tool use, you can use the following command.

python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder

Multi-Token Prediction (MTP): The following command is recommended for MTP:

python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

KTransformers

KTransformers is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing. For running with KTransformers, see the KTransformers Deployment Guide.

Hugging Face Transformers

Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required:

pip install "transformers[serving]"

See its documentation for more details. Please also make sure torchvision and pillow are installed.

Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:

transformers serve 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --continuous-batching

Using 0GM-1.0-35B-A3B-0427 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

We recommend using the following set of sampling parameters for generation:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

0GM-1.0-35B-A3B-0427 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

Downloads last month: 15

Safetensors

Model size

36B params

Tensor type

BF16

Model tree for 0G-AI/0GM-1.0-35B-A3B-0427

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(156)

this model

Quantizations

2 models