Instructions to use 0G-AI/0GM-1.0-35B-A3B-0427 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 0G-AI/0GM-1.0-35B-A3B-0427 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="0G-AI/0GM-1.0-35B-A3B-0427") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("0G-AI/0GM-1.0-35B-A3B-0427") model = AutoModelForMultimodalLM.from_pretrained("0G-AI/0GM-1.0-35B-A3B-0427") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 0G-AI/0GM-1.0-35B-A3B-0427 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0G-AI/0GM-1.0-35B-A3B-0427" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0G-AI/0GM-1.0-35B-A3B-0427", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/0G-AI/0GM-1.0-35B-A3B-0427
- SGLang
How to use 0G-AI/0GM-1.0-35B-A3B-0427 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "0G-AI/0GM-1.0-35B-A3B-0427" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0G-AI/0GM-1.0-35B-A3B-0427", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "0G-AI/0GM-1.0-35B-A3B-0427" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0G-AI/0GM-1.0-35B-A3B-0427", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use 0G-AI/0GM-1.0-35B-A3B-0427 with Docker Model Runner:
docker model run hf.co/0G-AI/0GM-1.0-35B-A3B-0427
0GM-1.0-35B-A3B (Preview-0427)
0GM-1.0-35B-A3B (Preview-0427) is the first MoE fine-tuned model built on Qwen 3.6 35B-A3B, a 35B-parameter mixture-of-experts model with roughly 3B active parameters per token. It was trained on 0G Compute's decentralized network and evaluated head-to-head against prior Qwen models and the larger Covenant-72B baseline.
Benchmark Results
| Benchmark | Covenant-72B | Qwen 3.5 35B | Qwen 3.6 35B | 0GM-1.0-35B |
|---|---|---|---|---|
| MMLU-Pro (4k context) | 41.84% | 61.43% | 75.75% | 77.62% |
| AIME 2026 (Pass@1) | 0.00% | 76.67% | 70.00% | 83.33% |
| GSM-8K | 70.81% | 80.06% | 96.74% | 96.82% |
| MATH-500 | 51.80% | 91.20% | 95.60% | 95.80% |
| Token Length (MATH-500, rel.) | 6.22% | 100.00% (6,804) | 96.60% | 94.91% |
0GM-1.0-35B-A3B leads on every accuracy benchmark, with the largest absolute gain on AIME 2026 (+13.33 points over Qwen 3.6 35B, +6.66 over Qwen 3.5 35B). Token usage on MATH-500 drops slightly versus the 3.6 dense baseline while accuracy goes up — i.e., shorter chains, higher correctness. Covenant-72B's near-zero output length and floor-level AIME score reflect its non-reasoning-tuned baseline rather than a capability ceiling at that parameter count.
MMLU-Pro Subject Breakdown
0GM-1.0-35B-A3B outperforms Qwen 3.6 35B-A3B on 12 out of 14 MMLU-Pro subjects, with the largest gains in physics (+5.1), philosophy (+4.8), engineering (+4.4), and computer science (+2.5). Under a constrained 4k context budget, these improvements indicate stronger reasoning capability while using fewer tokens.
| Subject | Covenant-72B | Qwen 3.5 35B | Qwen 3.6 35B | 0GM-1.0-35B |
|---|---|---|---|---|
| Math | 41.6% | 71.4% | 82.4% | 83.9% |
| Physics | 36.8% | 61.9% | 74.2% | 79.3% |
| Biology | 69.5% | 79.1% | 91.1% | 91.1% |
| Economics | 53.7% | 75.5% | 87.2% | 87.8% |
| Chemistry | 31.1% | 55.3% | 72.1% | 74.6% |
| Business | 42.3% | 72.0% | 78.8% | 81.9% |
| Psychology | 57.4% | 74.8% | 84.5% | 85.0% |
| Computer Science | 43.2% | 69.5% | 80.7% | 83.2% |
| Health | 48.3% | 62.8% | 79.5% | 79.0% |
| Other | 44.5% | 68.2% | 77.7% | 79.0% |
| Philosophy | 42.7% | 62.5% | 80.2% | 85.0% |
| History | 40.9% | 58.0% | 74.8% | 77.2% |
| Engineering | 25.0% | 33.3% | 42.0% | 46.4% |
| Law | 27.6% | 31.1% | 67.5% | 66.0% |
| Avg | 43.19% | 62.53% | 75.7% | 77.4% |
Model Overview
- Type: Causal Language Model with Vision Encoder
- Training Stage: Pre-training & Post-training
- Language Model
- Number of Parameters: 35B in total and 3B activated
- Hidden Dimension: 2048
- Token Embedding: 248320 (Padded)
- Number of Layers: 40
- Hidden Layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
- Gated DeltaNet:
- Number of Linear Attention Heads: 32 for V and 16 for QK
- Head Dimension: 128
- Gated Attention:
- Number of Attention Heads: 16 for Q and 2 for KV
- Head Dimension: 256
- Rotary Position Embedding Dimension: 64
- Mixture Of Experts
- Number of Experts: 256
- Number of Activated Experts: 8 Routed + 1 Shared
- Expert Intermediate Dimension: 512
- LM Output: 248320 (Padded)
- MTP: trained with multi-steps
- Context Length: 262,144 natively and extensible up to 1,010,000 tokens.
Quickstart
For streamlined integration, we recommend using APIs. Below is a guide to use OpenAI-compatible API.
Serving
0GM can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers.
Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.
The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because this model leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.
SGLang
SGLang is a fast serving framework for large language models and vision language models. sglang>=0.5.10 is recommended, which can be installed using the following command in a fresh environment:
uv pip install sglang[all]
See its documentation for more details.
The following will create API endpoints at http://localhost:8000/v1:
Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.
python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3Tool Use: To support tool use, you can use the following command.
python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coderMulti-Token Prediction (MTP): The following command is recommended for MTP:
python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
KTransformers
KTransformers is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing. For running with KTransformers, see the KTransformers Deployment Guide.
Hugging Face Transformers
Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required:
pip install "transformers[serving]"
See its documentation for more details. Please also make sure torchvision and pillow are installed.
Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:
transformers serve 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --continuous-batching
Using 0GM-1.0-35B-A3B-0427 via the Chat Completions API
The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.
Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:
pip install -U openai
# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"
We recommend using the following set of sampling parameters for generation:
- Thinking mode for general tasks:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0- Thinking mode for precise coding tasks (e.g. WebDev):
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0- Instruct (or non-thinking) mode:
temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0Please note that the support for sampling parameters varies according to inference frameworks.
0GM-1.0-35B-A3B-0427 models operate in thinking mode by default, generating thinking content signified by
<think>\n...</think>\n\nbefore producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.
- Downloads last month
- 15