eyad-silx commited on
Commit
cee9995
Β·
verified Β·
1 Parent(s): 67fc2f0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +278 -0
README.md ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ar
5
+ license: mit
6
+ tags:
7
+ - silx-ai
8
+ - quasar
9
+ - foundation-model
10
+ - 3b
11
+ - moe
12
+ - long-context
13
+ - bittensor
14
+ - sn24
15
+ - distillation
16
+ - hybrid-transformer
17
+ pipeline_tag: text-generation
18
+ library_name: transformers
19
+ ---
20
+
21
+ <p align="center">
22
+ <img src="./Quasar.png" alt="Quasar Foundation Model" width="100%">
23
+ </p>
24
+
25
+ # **Quasar Foundation Models (RoPE Base)**
26
+
27
+ **Quasar Foundation Models** are SILX AI’s core models designed for **long-context reasoning**, **agentic systems**, and **persistent memory-based intelligence**.
28
+
29
+ This release is **NOT a state-of-the-art final model**.
30
+ It is a **base pretraining model** designed specifically for **distributed knowledge distillation on Bittensor (SN24 Quasar subnet)**.
31
+
32
+ The goal is to create a shared architecture where miners continuously **distill knowledge from frontier models (e.g., Qwen, GLM)** into Quasar.
33
+
34
+ ---
35
+
36
+ ## ⚠️ Important Note
37
+
38
+ This model is:
39
+
40
+ - A **base model**
41
+ - **Pretrained for only a few billion tokens**
42
+ - Designed for **distillation and scaling**, not benchmarking
43
+
44
+ Performance will improve through **iterative subnet training + distillation cycles**.
45
+
46
+ ---
47
+
48
+ ## Model Overview
49
+
50
+ - **Model Name:** Quasar 3B (RoPE Base)
51
+ - **Organization:** SILX AI
52
+ - **Architecture:** Quasar-RoPE Hybrid Transformer
53
+ - **Total Parameters:** 3B
54
+ - **Active Parameters:** ~1B (Mixture-of-Experts)
55
+ - **Training Stage:** Stage 1 (Base Pretraining)
56
+ - **Sequence Length:** 16K tokens (RoPE phase)
57
+
58
+ ---
59
+
60
+ ## Training Strategy
61
+
62
+ Quasar follows a **multi-stage training pipeline**:
63
+
64
+ ### **Stage 1 β€” RoPE Pretraining**
65
+ - Train using **Rotary Positional Embeddings (RoPE)**
66
+ - Context length: **16K tokens**
67
+ - Objective: stabilize training and build core reasoning
68
+
69
+ ### **Stage 2 β€” Distillation (SN24)**
70
+ - Distributed training on **Bittensor subnet (SN24)**
71
+ - Miners distill knowledge from:
72
+ - Qwen
73
+ - GLM
74
+ - Target: transfer reasoning + capabilities into Quasar
75
+
76
+ ### **Stage 3 β€” DroPE Long-Context Training**
77
+ - Remove positional embeddings entirely (**DroPE phase**)
78
+ - Transition to **position-free reasoning**
79
+ - Train on **ultra-long context (up to 5M tokens)**
80
+
81
+ This staged approach allows:
82
+ - Stable early training
83
+ - Efficient knowledge transfer
84
+ - Extreme context scaling without positional bottlenecks
85
+
86
+ ---
87
+
88
+ # **Quasar-RoPE Hybrid Architecture**
89
+
90
+ Quasar is a **high-throughput hybrid transformer** designed for **trillion-token scale training**.
91
+
92
+ It combines:
93
+ - **Looped computation**
94
+ - **Persistent latent memory**
95
+ - **Hybrid attention mechanisms**
96
+ - **Stable Mixture-of-Experts routing**
97
+
98
+ ---
99
+
100
+ ## 1. Looped Transformer Logic
101
+
102
+ Instead of increasing depth traditionally, Quasar uses **looped execution**:
103
+
104
+ - A fixed set of layers is reused multiple times (`num_loops`)
105
+ - This multiplies effective depth without increasing VRAM
106
+
107
+ ### Key Mechanism:
108
+
109
+ - **Anchor P (Input Injection):**
110
+ - Embedding output is stored as `P`
111
+ - Injected into the hidden state at every loop
112
+ - **Gradient Stabilization:**
113
+ - Injection gradients scaled by `1 / num_loops`
114
+ - Prevents instability during recirculation
115
+
116
+ ---
117
+
118
+ ## 2. Hybrid Layer Composition
119
+
120
+ Each loop contains a mix of:
121
+
122
+ ### **Quasar Layers**
123
+ - Use **Latent Memory Module**
124
+ - Handle long-range dependencies
125
+ - Read/write persistent state
126
+
127
+ ### **GLA Layers (Gated Linear Attention)**
128
+ - Fast, RNN-like recurrence
129
+ - Efficient local sequence modeling
130
+
131
+ ---
132
+
133
+ ## 3. Persistent Latent Memory
134
+
135
+ A defining component of Quasar:
136
+
137
+ - **Memory Slots:**
138
+ - Fixed parameter banks (e.g., 128–256 slots)
139
+
140
+ - **Segment Compression:**
141
+ - Tokens grouped into segments (default: 64 tokens)
142
+ - Reduced noise during updates
143
+
144
+ - **Saliency Gating:**
145
+ - Learns which information is important
146
+ - Writes only high-value signals to memory
147
+
148
+ ---
149
+
150
+ ## 4. SMEBU (Stability-Maximized Expert Balancing Unit)
151
+
152
+ Custom Mixture-of-Experts system:
153
+
154
+ - **Global Bias Buffers**
155
+ - Stored outside optimizer
156
+ - Prevent routing collapse
157
+
158
+ - **Zero-Loop Updates**
159
+ - Expert balancing done in vectorized pass
160
+ - No recursive instability
161
+
162
+ - **Sparse Activation**
163
+ - ~1B active parameters per forward pass
164
+
165
+ ---
166
+
167
+ ## 5. Technical Specifications
168
+
169
+ - **Normalization:** RMSNorm (Pre-Norm)
170
+ - **Positional Encoding:** RoPE (`theta = 1,000,000`)
171
+ - **Initialization:** Depth-scaled `1/sqrt(2L)`
172
+ - **Architecture Type:** Hybrid Transformer + Memory + MoE
173
+
174
+ ---
175
+ # Architecture Overview
176
+
177
+ ## Core Data Flow
178
+
179
+ ```
180
+ Token IDs
181
+ ↓
182
+ Embedding Layer
183
+ ↓
184
+ Anchor P Snapshot
185
+ ↓
186
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
187
+ β”‚ Loop (i < num_loops) β”‚
188
+ β”‚ β”‚
189
+ β”‚ Quasar Block β”‚
190
+ β”‚ ↓ β”‚
191
+ β”‚ GLA Block β”‚
192
+ β”‚ ↓ β”‚
193
+ β”‚ SMEBU MoE β”‚
194
+ β”‚ ↓ β”‚
195
+ β”‚ Inject Anchor P (Residual Conditioning) β”‚
196
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
197
+ ↓
198
+ Next Loop Iteration (state updated)
199
+
200
+ Final Loop Output
201
+ ↓
202
+ RMSNorm
203
+ ↓
204
+ LM Head
205
+ ↓
206
+ Logits
207
+ ```
208
+
209
+ ---
210
+
211
+ ## Latent Memory Update Path
212
+
213
+ ```
214
+ Hidden States
215
+ ↓
216
+ Layer Normalization (RMSNorm)
217
+ ↓
218
+ Segment Compressor
219
+ ↓
220
+ Segment Representation (Z)
221
+ ↓
222
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’ Saliency Gate (importance scoring)
223
+ β”‚ ↓
224
+ β”‚ Write Signal
225
+ β”‚ ↓
226
+ └──────────────→ Memory Write Operation
227
+ ↓
228
+ Persistent Memory Bank (M)
229
+ ↓
230
+ Updated Memory (M')
231
+ ↓
232
+ Memory Read Module
233
+ ↓
234
+ Memory-Augmented Hidden State
235
+ ↓
236
+ Output
237
+ ```
238
+
239
+ ---
240
+
241
+ ## SMEBU MoE Stability Flow
242
+
243
+ ```
244
+ Router Network
245
+ ↓
246
+ Token Routing Scores
247
+ ↓
248
+ * Global Bias Buffer (non-trainable stability path)
249
+ ↓
250
+ Top-K Expert Selection
251
+ ↓
252
+ Selected Experts
253
+ ↓
254
+ Expert Output Aggregation
255
+ ↓
256
+ Final MoE Output
257
+ ↓
258
+ Post-Loop Bias Update (vectorized, stabilized)
259
+ ```
260
+
261
+ ---
262
+
263
+ # Intended Use
264
+
265
+ This model is designed as a **foundation base model** for the Quasar ecosystem and is primarily intended for:
266
+
267
+ - **Bittensor SN24 miners** participating in distributed training and knowledge distillation
268
+ - **Distillation pipelines** transferring capabilities from frontier models (e.g., Qwen, GLM)
269
+ - **Research on long-context architectures**, especially beyond traditional positional encoding limits
270
+ - **Agentic system development**, where persistent memory and long-horizon reasoning are required
271
+
272
+ ---
273
+
274
+ # Next Steps
275
+
276
+ - Training on **SN24** in the coming days
277
+ - Miners distill knowledge into this model
278
+ - Then we go for **Run 2 β€” DroPE training** at **5M tokens**