Instructions to use silx-ai/Quasar-3B-A1B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use silx-ai/Quasar-3B-A1B-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="silx-ai/Quasar-3B-A1B-Preview", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("silx-ai/Quasar-3B-A1B-Preview", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use silx-ai/Quasar-3B-A1B-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "silx-ai/Quasar-3B-A1B-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silx-ai/Quasar-3B-A1B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/silx-ai/Quasar-3B-A1B-Preview

SGLang

How to use silx-ai/Quasar-3B-A1B-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "silx-ai/Quasar-3B-A1B-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silx-ai/Quasar-3B-A1B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "silx-ai/Quasar-3B-A1B-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silx-ai/Quasar-3B-A1B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use silx-ai/Quasar-3B-A1B-Preview with Docker Model Runner:
```
docker model run hf.co/silx-ai/Quasar-3B-A1B-Preview
```

eyad-silx commited on Apr 19

Commit

cee9995

verified ·

1 Parent(s): 67fc2f0

Create README.md

Browse files

Files changed (1) hide show

README.md +278 -0

README.md ADDED Viewed

	@@ -0,0 +1,278 @@

+---
+language:
+- en
+- ar
+license: mit
+tags:
+- silx-ai
+- quasar
+- foundation-model
+- 3b
+- moe
+- long-context
+- bittensor
+- sn24
+- distillation
+- hybrid-transformer
+pipeline_tag: text-generation
+library_name: transformers
+---
+<p align="center">
+  <img src="./Quasar.png" alt="Quasar Foundation Model" width="100%">
+</p>
+# **Quasar Foundation Models (RoPE Base)**
+**Quasar Foundation Models** are SILX AI’s core models designed for **long-context reasoning**, **agentic systems**, and **persistent memory-based intelligence**.
+This release is **NOT a state-of-the-art final model**.
+It is a **base pretraining model** designed specifically for **distributed knowledge distillation on Bittensor (SN24 Quasar subnet)**.
+The goal is to create a shared architecture where miners continuously **distill knowledge from frontier models (e.g., Qwen, GLM)** into Quasar.
+---
+## ⚠️ Important Note
+This model is:
+- A **base model**
+- **Pretrained for only a few billion tokens**
+- Designed for **distillation and scaling**, not benchmarking
+Performance will improve through **iterative subnet training + distillation cycles**.
+---
+## Model Overview
+- **Model Name:** Quasar 3B (RoPE Base)
+- **Organization:** SILX AI
+- **Architecture:** Quasar-RoPE Hybrid Transformer
+- **Total Parameters:** 3B
+- **Active Parameters:** ~1B (Mixture-of-Experts)
+- **Training Stage:** Stage 1 (Base Pretraining)
+- **Sequence Length:** 16K tokens (RoPE phase)
+---
+## Training Strategy
+Quasar follows a **multi-stage training pipeline**:
+### **Stage 1 — RoPE Pretraining**
+- Train using **Rotary Positional Embeddings (RoPE)**
+- Context length: **16K tokens**
+- Objective: stabilize training and build core reasoning
+### **Stage 2 — Distillation (SN24)**
+- Distributed training on **Bittensor subnet (SN24)**
+- Miners distill knowledge from:
+  - Qwen
+  - GLM
+- Target: transfer reasoning + capabilities into Quasar
+### **Stage 3 — DroPE Long-Context Training**
+- Remove positional embeddings entirely (**DroPE phase**)
+- Transition to **position-free reasoning**
+- Train on **ultra-long context (up to 5M tokens)**
+This staged approach allows:
+- Stable early training
+- Efficient knowledge transfer
+- Extreme context scaling without positional bottlenecks
+---
+# **Quasar-RoPE Hybrid Architecture**
+Quasar is a **high-throughput hybrid transformer** designed for **trillion-token scale training**.
+It combines:
+- **Looped computation**
+- **Persistent latent memory**
+- **Hybrid attention mechanisms**
+- **Stable Mixture-of-Experts routing**
+---
+## 1. Looped Transformer Logic
+Instead of increasing depth traditionally, Quasar uses **looped execution**:
+- A fixed set of layers is reused multiple times (`num_loops`)
+- This multiplies effective depth without increasing VRAM
+### Key Mechanism:
+- **Anchor P (Input Injection):**
+  - Embedding output is stored as `P`
+  - Injected into the hidden state at every loop
+- **Gradient Stabilization:**
+  - Injection gradients scaled by `1 / num_loops`
+  - Prevents instability during recirculation
+---
+## 2. Hybrid Layer Composition
+Each loop contains a mix of:
+### **Quasar Layers**
+- Use **Latent Memory Module**
+- Handle long-range dependencies
+- Read/write persistent state
+### **GLA Layers (Gated Linear Attention)**
+- Fast, RNN-like recurrence
+- Efficient local sequence modeling
+---
+## 3. Persistent Latent Memory
+A defining component of Quasar:
+- **Memory Slots:**
+  - Fixed parameter banks (e.g., 128–256 slots)
+- **Segment Compression:**
+  - Tokens grouped into segments (default: 64 tokens)
+  - Reduced noise during updates
+- **Saliency Gating:**
+  - Learns which information is important
+  - Writes only high-value signals to memory
+---
+## 4. SMEBU (Stability-Maximized Expert Balancing Unit)
+Custom Mixture-of-Experts system:
+- **Global Bias Buffers**
+  - Stored outside optimizer
+  - Prevent routing collapse
+- **Zero-Loop Updates**
+  - Expert balancing done in vectorized pass
+  - No recursive instability
+- **Sparse Activation**
+  - ~1B active parameters per forward pass
+---
+## 5. Technical Specifications
+- **Normalization:** RMSNorm (Pre-Norm)
+- **Positional Encoding:** RoPE (`theta = 1,000,000`)
+- **Initialization:** Depth-scaled `1/sqrt(2L)`
+- **Architecture Type:** Hybrid Transformer + Memory + MoE
+---
+# Architecture Overview
+## Core Data Flow
+```
+Token IDs
+  ↓
+Embedding Layer
+  ↓
+Anchor P Snapshot
+  ↓
+┌──────────────────────────────────────────────┐
+│ Loop (i < num_loops)                         │
+│                                              │
+│   Quasar Block                               │
+│        ↓                                     │
+│   GLA Block                                  │
+│        ↓                                     │
+│   SMEBU MoE                                  │
+│        ↓                                     │
+│   Inject Anchor P (Residual Conditioning)    │
+└──────────────────────────────────────────────┘
+  ↓
+Next Loop Iteration (state updated)
+Final Loop Output
+  ↓
+RMSNorm
+  ↓
+LM Head
+  ↓
+Logits
+```
+---
+## Latent Memory Update Path
+```
+Hidden States
+  ↓
+Layer Normalization (RMSNorm)
+  ↓
+Segment Compressor
+  ↓
+Segment Representation (Z)
+  ↓
+  ├──────────────→ Saliency Gate (importance scoring)
+  │                        ↓
+  │                     Write Signal
+  │                        ↓
+  └──────────────→ Memory Write Operation
+                           ↓
+              Persistent Memory Bank (M)
+                           ↓
+                  Updated Memory (M')
+                           ↓
+                  Memory Read Module
+                           ↓
+              Memory-Augmented Hidden State
+                           ↓
+                         Output
+```
+---
+## SMEBU MoE Stability Flow
+```
+Router Network
+  ↓
+Token Routing Scores
+  ↓
+  * Global Bias Buffer (non-trainable stability path)
+  ↓
+Top-K Expert Selection
+  ↓
+Selected Experts
+  ↓
+Expert Output Aggregation
+  ↓
+Final MoE Output
+  ↓
+Post-Loop Bias Update (vectorized, stabilized)
+```
+---
+# Intended Use
+This model is designed as a **foundation base model** for the Quasar ecosystem and is primarily intended for:
+- **Bittensor SN24 miners** participating in distributed training and knowledge distillation
+- **Distillation pipelines** transferring capabilities from frontier models (e.g., Qwen, GLM)
+- **Research on long-context architectures**, especially beyond traditional positional encoding limits
+- **Agentic system development**, where persistent memory and long-horizon reasoning are required
+---
+# Next Steps
+- Training on **SN24** in the coming days
+- Miners distill knowledge into this model
+- Then we go for **Run 2 — DroPE training** at **5M tokens**