Update README.md

09719f3 verified 6 months ago

3.76 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: any-to-any
	---

	## Spectra-561B-27B-512E: Model Overview

	Omnira Spectra-561B-27B-512E is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for real-time audio-visual interaction, it integrates text, image, video, and audio processing into a single, unified framework.
	---

	### Core Architecture

	The model utilizes a massive Mixture-of-Experts (MoE) structure that balances extreme scale with computational efficiency.

	* Parameters: 561B total, with 27B activated per token.
	* Backbone: Built on Spectra Architecture, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency.
	* Dynamic Computation: Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token.
	* Context Window: Supports up to 128K tokens, enabling long-term memory and complex temporal reasoning.

	### Multimodal Integration

	\| Component \| Technology \| Function \|
	\| --- \| --- \| --- \|
	\| Vision Encoder \| VidEnc (637M) \| Processes images and videos natively; supports arbitrary aspect ratios and resolutions. \|
	\| Audio System \| Audio-Code-S \| Discretizes audio into semantic and acoustic codebooks at 16.67 Hz. \|
	\| Streaming Encoder \| FSMN-based \| Uses Feedforward Sequential Memory Networks for low-latency audio processing. \|
	\| Fusion Strategy \| Early-Fusion \| Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning. \|

	---

	### Key Performance Highlights

	* Omni-Modality: Outperforms open-source rivals (e.g., Qwen3-Omni) on OmniBench (61.38) and WorldSense (60.89).
	* Real-Time Interaction: Achieves millisecond-level latency for streaming speech generation and video interaction.
	* Vision & Video: Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME).
	* Audio Excellence: Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio.

	### Training & Efficiency

	1. Curriculum-Inspired Training: A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data.
	2. Modality-Decoupled Parallelism (MDP): Separates encoder and LLM optimization to maintain over 90% of text-only training throughput even with complex multimodal data.

	## Technical Specifications

	\| Feature \| Specification \|
	\| --- \| --- \|
	\| Total Parameters \| 561B \|
	\| Activated Parameters \| 27B \|
	\| Expert Configuration \| 512 Routed Experts \|
	\| Context Window \| 128K tokens \|
	\| Primary Tasks \| Audio, Visual, Text, Video-Continuation \|

	## Quick Start

	### Installation

	```bash
	# install Spectra-Omni environment
	conda create -n spectra python=3.10
	conda activate spectra

	# install dependencies
	pip install torch transformers flash_attn

	```

	### Usage

	Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16.

	```python
	from spectra_omni import SpectraModel

	model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-512E")
	# Seamlessly process audio, image, and text inputs

	```

	## License Agreement

	The model weights are released under the MIT License. This license does not grant any rights to use Omnira trademarks or patents.

	## Citation

	```
	@misc{omnira2026spectra,
	title={Spectra-561B-27B-512E: Unified Omni-modal Intelligence},
	author={Omnira},
	year={2026},
	url={https://github.com/theomnira/Spectra-Omni},
	}

	```