File size: 3,760 Bytes

09719f3

---
license: apache-2.0
language:
- en
pipeline_tag: any-to-any
---

## **Spectra-561B-27B-512E: Model Overview**

**Omnira Spectra-561B-27B-512E** is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for **real-time audio-visual interaction**, it integrates text, image, video, and audio processing into a single, unified framework.
---

### **Core Architecture**

The model utilizes a massive **Mixture-of-Experts (MoE)** structure that balances extreme scale with computational efficiency.

* **Parameters:** 561B total, with **27B activated** per token.
* **Backbone:** Built on **Spectra Architecture**, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency.
* **Dynamic Computation:** Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token.
* **Context Window:** Supports up to **128K tokens**, enabling long-term memory and complex temporal reasoning.

### **Multimodal Integration**

| Component | Technology | Function |
| --- | --- | --- |
| **Vision Encoder** | **VidEnc** (637M) | Processes images and videos natively; supports arbitrary aspect ratios and resolutions. |
| **Audio System** | **Audio-Code-S** | Discretizes audio into semantic and acoustic codebooks at 16.67 Hz. |
| **Streaming Encoder** | **FSMN-based** | Uses Feedforward Sequential Memory Networks for low-latency audio processing. |
| **Fusion Strategy** | **Early-Fusion** | Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning. |

---

### **Key Performance Highlights**

* **Omni-Modality:** Outperforms open-source rivals (e.g., Qwen3-Omni) on **OmniBench** (61.38) and **WorldSense** (60.89).
* **Real-Time Interaction:** Achieves **millisecond-level latency** for streaming speech generation and video interaction.
* **Vision & Video:** Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME).
* **Audio Excellence:** Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio.

### **Training & Efficiency**

1. **Curriculum-Inspired Training:** A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data.
2. **Modality-Decoupled Parallelism (MDP):** Separates encoder and LLM optimization to maintain over **90% of text-only training throughput** even with complex multimodal data.

## Technical Specifications

| **Feature** | **Specification** |
| --- | --- |
| **Total Parameters** | 561B |
| **Activated Parameters** | 27B |
| **Expert Configuration** | 512 Routed Experts |
| **Context Window** | 128K tokens |
| **Primary Tasks** | Audio, Visual, Text, Video-Continuation |

## Quick Start

### Installation

```bash
# install Spectra-Omni environment
conda create -n spectra python=3.10
conda activate spectra

# install dependencies
pip install torch transformers flash_attn

```

### Usage

Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16.

```python
from spectra_omni import SpectraModel

model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-512E")
# Seamlessly process audio, image, and text inputs

```

## License Agreement

The **model weights** are released under the **MIT License**. This license does not grant any rights to use Omnira trademarks or patents.

## Citation

```
@misc{omnira2026spectra,
    title={Spectra-561B-27B-512E: Unified Omni-modal Intelligence}, 
    author={Omnira}, 
    year={2026}, 
    url={https://github.com/theomnira/Spectra-Omni}, 
}

```