thehekimoghlu's picture
Update README.md
09719f3 verified
|
Raw
History Blame
3.76 kB
---
license: apache-2.0
language:
- en
pipeline_tag: any-to-any
---
## **Spectra-561B-27B-512E: Model Overview**
**Omnira Spectra-561B-27B-512E** is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for **real-time audio-visual interaction**, it integrates text, image, video, and audio processing into a single, unified framework.
---
### **Core Architecture**
The model utilizes a massive **Mixture-of-Experts (MoE)** structure that balances extreme scale with computational efficiency.
* **Parameters:** 561B total, with **27B activated** per token.
* **Backbone:** Built on **Spectra Architecture**, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency.
* **Dynamic Computation:** Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token.
* **Context Window:** Supports up to **128K tokens**, enabling long-term memory and complex temporal reasoning.
### **Multimodal Integration**
| Component | Technology | Function |
| --- | --- | --- |
| **Vision Encoder** | **VidEnc** (637M) | Processes images and videos natively; supports arbitrary aspect ratios and resolutions. |
| **Audio System** | **Audio-Code-S** | Discretizes audio into semantic and acoustic codebooks at 16.67 Hz. |
| **Streaming Encoder** | **FSMN-based** | Uses Feedforward Sequential Memory Networks for low-latency audio processing. |
| **Fusion Strategy** | **Early-Fusion** | Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning. |
---
### **Key Performance Highlights**
* **Omni-Modality:** Outperforms open-source rivals (e.g., Qwen3-Omni) on **OmniBench** (61.38) and **WorldSense** (60.89).
* **Real-Time Interaction:** Achieves **millisecond-level latency** for streaming speech generation and video interaction.
* **Vision & Video:** Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME).
* **Audio Excellence:** Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio.
### **Training & Efficiency**
1. **Curriculum-Inspired Training:** A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data.
2. **Modality-Decoupled Parallelism (MDP):** Separates encoder and LLM optimization to maintain over **90% of text-only training throughput** even with complex multimodal data.
## Technical Specifications
| **Feature** | **Specification** |
| --- | --- |
| **Total Parameters** | 561B |
| **Activated Parameters** | 27B |
| **Expert Configuration** | 512 Routed Experts |
| **Context Window** | 128K tokens |
| **Primary Tasks** | Audio, Visual, Text, Video-Continuation |
## Quick Start
### Installation
```bash
# install Spectra-Omni environment
conda create -n spectra python=3.10
conda activate spectra
# install dependencies
pip install torch transformers flash_attn
```
### Usage
Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16.
```python
from spectra_omni import SpectraModel
model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-512E")
# Seamlessly process audio, image, and text inputs
```
## License Agreement
The **model weights** are released under the **MIT License**. This license does not grant any rights to use Omnira trademarks or patents.
## Citation
```
@misc{omnira2026spectra,
title={Spectra-561B-27B-512E: Unified Omni-modal Intelligence},
author={Omnira},
year={2026},
url={https://github.com/theomnira/Spectra-Omni},
}
```