metadata
license: apache-2.0
language:
- en
pipeline_tag: any-to-any
Spectra-561B-27B-768E: Model Overview
Omnira Spectra-561B-27B-768E is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for real-time audio-visual interaction, it integrates text, image, video, and audio processing into a single, unified framework.
Core Architecture
The model utilizes a massive Mixture-of-Experts (MoE) structure that balances extreme scale with computational efficiency.
- Parameters: 561B total, with 27B activated per token.
- Backbone: Built on Spectra Architecture, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency.
- Dynamic Computation: Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token.
- Context Window: Supports up to 128K tokens, enabling long-term memory and complex temporal reasoning.
Multimodal Integration
| Component | Technology | Function |
|---|---|---|
| Vision Encoder | VidEnc (637M) | Processes images and videos natively; supports arbitrary aspect ratios and resolutions. |
| Audio System | Audio-Code-S | Discretizes audio into semantic and acoustic codebooks at 16.67 Hz. |
| Streaming Encoder | FSMN-based | Uses Feedforward Sequential Memory Networks for low-latency audio processing. |
| Fusion Strategy | Early-Fusion | Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning. |
Key Performance Highlights
- Omni-Modality: Outperforms open-source rivals (e.g., Qwen3-Omni) on OmniBench (61.38) and WorldSense (60.89).
- Real-Time Interaction: Achieves millisecond-level latency for streaming speech generation and video interaction.
- Vision & Video: Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME).
- Audio Excellence: Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio.
Training & Efficiency
- Curriculum-Inspired Training: A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data.
- Modality-Decoupled Parallelism (MDP): Separates encoder and LLM optimization to maintain over 90% of text-only training throughput even with complex multimodal data.
Technical Specifications
| Feature | Specification |
|---|---|
| Total Parameters | 561B |
| Activated Parameters | 27B |
| Expert Configuration | 512 Routed; 256 Zero Experts |
| Context Window | 128K tokens |
| Primary Tasks | Audio, Visual, Text, Video-Continuation |
Quick Start
Installation
# install Spectra-Omni environment
conda create -n spectra python=3.10
conda activate spectra
# install dependencies
pip install torch transformers flash_attn
Usage
Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16.
from spectra_omni import SpectraModel
model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-768E")
# Seamlessly process audio, image, and text inputs
License Agreement
The model weights are released under the Omnira License. This license does not grant any rights to use Omnira trademarks or patents.
Citation
@misc{omnira2026spectra,
title={Spectra-561B-27B-768E: Unified Omni-modal Intelligence},
author={Omnira},
year={2026},
url={https://github.com/theomnira/Spectra-Omni},
}