| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: any-to-any |
| --- |
| |
| ## **Spectra-561B-27B-512E: Model Overview** |
|
|
| **Omnira Spectra-561B-27B-512E** is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for **real-time audio-visual interaction**, it integrates text, image, video, and audio processing into a single, unified framework. |
| --- |
|
|
| ### **Core Architecture** |
|
|
| The model utilizes a massive **Mixture-of-Experts (MoE)** structure that balances extreme scale with computational efficiency. |
|
|
| * **Parameters:** 561B total, with **27B activated** per token. |
| * **Backbone:** Built on **Spectra Architecture**, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency. |
| * **Dynamic Computation:** Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token. |
| * **Context Window:** Supports up to **128K tokens**, enabling long-term memory and complex temporal reasoning. |
|
|
| ### **Multimodal Integration** |
|
|
| | Component | Technology | Function | |
| | --- | --- | --- | |
| | **Vision Encoder** | **VidEnc** (637M) | Processes images and videos natively; supports arbitrary aspect ratios and resolutions. | |
| | **Audio System** | **Audio-Code-S** | Discretizes audio into semantic and acoustic codebooks at 16.67 Hz. | |
| | **Streaming Encoder** | **FSMN-based** | Uses Feedforward Sequential Memory Networks for low-latency audio processing. | |
| | **Fusion Strategy** | **Early-Fusion** | Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning. | |
|
|
| --- |
|
|
| ### **Key Performance Highlights** |
|
|
| * **Omni-Modality:** Outperforms open-source rivals (e.g., Qwen3-Omni) on **OmniBench** (61.38) and **WorldSense** (60.89). |
| * **Real-Time Interaction:** Achieves **millisecond-level latency** for streaming speech generation and video interaction. |
| * **Vision & Video:** Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME). |
| * **Audio Excellence:** Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio. |
|
|
| ### **Training & Efficiency** |
|
|
| 1. **Curriculum-Inspired Training:** A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data. |
| 2. **Modality-Decoupled Parallelism (MDP):** Separates encoder and LLM optimization to maintain over **90% of text-only training throughput** even with complex multimodal data. |
|
|
| ## Technical Specifications |
|
|
| | **Feature** | **Specification** | |
| | --- | --- | |
| | **Total Parameters** | 561B | |
| | **Activated Parameters** | 27B | |
| | **Expert Configuration** | 512 Routed Experts | |
| | **Context Window** | 128K tokens | |
| | **Primary Tasks** | Audio, Visual, Text, Video-Continuation | |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| # install Spectra-Omni environment |
| conda create -n spectra python=3.10 |
| conda activate spectra |
| |
| # install dependencies |
| pip install torch transformers flash_attn |
| |
| ``` |
|
|
| ### Usage |
|
|
| Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16. |
|
|
| ```python |
| from spectra_omni import SpectraModel |
| |
| model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-512E") |
| # Seamlessly process audio, image, and text inputs |
| |
| ``` |
|
|
| ## License Agreement |
|
|
| The **model weights** are released under the **MIT License**. This license does not grant any rights to use Omnira trademarks or patents. |
|
|
| ## Citation |
|
|
| ``` |
| @misc{omnira2026spectra, |
| title={Spectra-561B-27B-512E: Unified Omni-modal Intelligence}, |
| author={Omnira}, |
| year={2026}, |
| url={https://github.com/theomnira/Spectra-Omni}, |
| } |
| |
| ``` |