--- license: apache-2.0 language: - en pipeline_tag: any-to-any --- ## **Spectra-561B-27B-768E: Model Overview** **Omnira Spectra-561B-27B-768E** is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for **real-time audio-visual interaction**, it integrates text, image, video, and audio processing into a single, unified framework. --- ### **Core Architecture** The model utilizes a massive **Mixture-of-Experts (MoE)** structure that balances extreme scale with computational efficiency. * **Parameters:** 561B total, with **27B activated** per token. * **Backbone:** Built on **Spectra Architecture**, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency. * **Dynamic Computation:** Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token. * **Context Window:** Supports up to **128K tokens**, enabling long-term memory and complex temporal reasoning. ### **Multimodal Integration** | Component | Technology | Function | | --- | --- | --- | | **Vision Encoder** | **VidEnc** (637M) | Processes images and videos natively; supports arbitrary aspect ratios and resolutions. | | **Audio System** | **Audio-Code-S** | Discretizes audio into semantic and acoustic codebooks at 16.67 Hz. | | **Streaming Encoder** | **FSMN-based** | Uses Feedforward Sequential Memory Networks for low-latency audio processing. | | **Fusion Strategy** | **Early-Fusion** | Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning. | --- ### **Key Performance Highlights** * **Omni-Modality:** Outperforms open-source rivals (e.g., Qwen3-Omni) on **OmniBench** (61.38) and **WorldSense** (60.89). * **Real-Time Interaction:** Achieves **millisecond-level latency** for streaming speech generation and video interaction. * **Vision & Video:** Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME). * **Audio Excellence:** Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio. ### **Training & Efficiency** 1. **Curriculum-Inspired Training:** A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data. 2. **Modality-Decoupled Parallelism (MDP):** Separates encoder and LLM optimization to maintain over **90% of text-only training throughput** even with complex multimodal data. ## Technical Specifications | **Feature** | **Specification** | | --- | --- | | **Total Parameters** | 561B | | **Activated Parameters** | 27B | | **Expert Configuration** | 512 Routed; 256 Zero Experts | | **Context Window** | 128K tokens | | **Primary Tasks** | Audio, Visual, Text, Video-Continuation | ## Quick Start ### Installation ```bash # install Spectra-Omni environment conda create -n spectra python=3.10 conda activate spectra # install dependencies pip install torch transformers flash_attn ``` ### Usage Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16. ```python from spectra_omni import SpectraModel model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-768E") # Seamlessly process audio, image, and text inputs ``` ## License Agreement The **model weights** are released under the **Omnira License**. This license does not grant any rights to use Omnira trademarks or patents. ## Citation ``` @misc{omnira2026spectra, title={Spectra-561B-27B-768E: Unified Omni-modal Intelligence}, author={Omnira}, year={2026}, url={https://github.com/theomnira/Spectra-Omni}, } ```