How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0G-AI/0GM-1.0-35B-A3B-0427"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0G-AI/0GM-1.0-35B-A3B-0427",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/0G-AI/0GM-1.0-35B-A3B-0427
Quick Links

Qwavity logo

0GM-1.0-35B-A3B (Preview-0427)

0GM-1.0-35B-A3B (Preview-0427) is the first MoE fine-tuned model built on Qwen 3.6 35B-A3B, a 35B-parameter mixture-of-experts model with roughly 3B active parameters per token. It was trained on 0G Compute's decentralized network and evaluated head-to-head against prior Qwen models and the larger Covenant-72B baseline.

Benchmark Results

Benchmark Covenant-72B Qwen 3.5 35B Qwen 3.6 35B 0GM-1.0-35B
MMLU-Pro (4k context) 41.84% 61.43% 75.75% 77.62%
AIME 2026 (Pass@1) 0.00% 76.67% 70.00% 83.33%
GSM-8K 70.81% 80.06% 96.74% 96.82%
MATH-500 51.80% 91.20% 95.60% 95.80%
Token Length (MATH-500, rel.) 6.22% 100.00% (6,804) 96.60% 94.91%

0GM-1.0-35B-A3B leads on every accuracy benchmark, with the largest absolute gain on AIME 2026 (+13.33 points over Qwen 3.6 35B, +6.66 over Qwen 3.5 35B). Token usage on MATH-500 drops slightly versus the 3.6 dense baseline while accuracy goes up — i.e., shorter chains, higher correctness. Covenant-72B's near-zero output length and floor-level AIME score reflect its non-reasoning-tuned baseline rather than a capability ceiling at that parameter count.

MMLU-Pro Subject Breakdown

0GM-1.0-35B-A3B outperforms Qwen 3.6 35B-A3B on 12 out of 14 MMLU-Pro subjects, with the largest gains in physics (+5.1), philosophy (+4.8), engineering (+4.4), and computer science (+2.5). Under a constrained 4k context budget, these improvements indicate stronger reasoning capability while using fewer tokens.

Subject Covenant-72B Qwen 3.5 35B Qwen 3.6 35B 0GM-1.0-35B
Math 41.6% 71.4% 82.4% 83.9%
Physics 36.8% 61.9% 74.2% 79.3%
Biology 69.5% 79.1% 91.1% 91.1%
Economics 53.7% 75.5% 87.2% 87.8%
Chemistry 31.1% 55.3% 72.1% 74.6%
Business 42.3% 72.0% 78.8% 81.9%
Psychology 57.4% 74.8% 84.5% 85.0%
Computer Science 43.2% 69.5% 80.7% 83.2%
Health 48.3% 62.8% 79.5% 79.0%
Other 44.5% 68.2% 77.7% 79.0%
Philosophy 42.7% 62.5% 80.2% 85.0%
History 40.9% 58.0% 74.8% 77.2%
Engineering 25.0% 33.3% 42.0% 46.4%
Law 27.6% 31.1% 67.5% 66.0%
Avg 43.19% 62.53% 75.7% 77.4%

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model
    • Number of Parameters: 35B in total and 3B activated
    • Hidden Dimension: 2048
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 40
    • Hidden Layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 32 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 16 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts
      • Number of Experts: 256
      • Number of Activated Experts: 8 Routed + 1 Shared
      • Expert Intermediate Dimension: 512
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Quickstart

For streamlined integration, we recommend using APIs. Below is a guide to use OpenAI-compatible API.

Serving

0GM can be served via APIs with popular inference frameworks. In the following, we show example commands to launch OpenAI-Compatible API servers.

Inference efficiency and throughput vary significantly across frameworks. We recommend using the latest framework versions to ensure optimal performance and compatibility. For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.

The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because this model leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

SGLang

SGLang is a fast serving framework for large language models and vision language models. sglang>=0.5.10 is recommended, which can be installed using the following command in a fresh environment:

uv pip install sglang[all]

See its documentation for more details.

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.

    python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3
    
  • Tool Use: To support tool use, you can use the following command.

    python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --tool-call-parser qwen3_coder
    
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:

    python -m sglang.launch_server --model-path 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length 262144 --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
    

KTransformers

KTransformers is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing. For running with KTransformers, see the KTransformers Deployment Guide.

Hugging Face Transformers

Hugging Face Transformers contains a lightweight server which can be used for quick testing and moderate load deployment. The latest transformers is required:

pip install "transformers[serving]"

See its documentation for more details. Please also make sure torchvision and pillow are installed.

Then, run transformers serve to launch a server with API endpoints at http://localhost:8000/v1; it will place the model on accelerators if available:

transformers serve 0G-AI/0GM-1.0-35B-A3B-0427 --port 8000 --continuous-batching

Using 0GM-1.0-35B-A3B-0427 via the Chat Completions API

The chat completions API is accessible via standard HTTP requests or OpenAI SDKs. Here, we show examples using the OpenAI Python SDK.

Before starting, make sure it is installed and the API key and the API base URL is configured, e.g.:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

We recommend using the following set of sampling parameters for generation:

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
  • Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
  • Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

0GM-1.0-35B-A3B-0427 models operate in thinking mode by default, generating thinking content signified by <think>\n...</think>\n\n before producing the final responses. To disable thinking content and obtain direct response, refer to the examples here.

Downloads last month
15
Safetensors
Model size
36B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for 0G-AI/0GM-1.0-35B-A3B-0427

Finetuned
(156)
this model
Quantizations
2 models