--- language: en license: apache-2.0 tags: - sparse-autoencoder - sae - interpretability - audio - voice - speech - majestrino - whisper datasets: - laion/majestrino-data base_model: laion/Majestrino-1.00 pipeline_tag: feature-extraction --- # Majestrino 1.00 Sparse Autoencoder (16x, k=5) A **Top-K Sparse Autoencoder** trained on [Majestrino 1.00](https://huggingface.co/laion/Majestrino-1.00) voice/audio embeddings. It decomposes 768-dimensional audio embeddings into **12,288 interpretable features** covering emotions, speaking styles, languages, vocal qualities, and more. ## Key Numbers | | | |---|---| | **Input dimension** | 768 (Majestrino 1.00 embedding) | | **Dictionary size** | 12,288 features (16x expansion) | | **Active features per input** | 5 (top-k) | | **Parameters** | 18.9M | | **Training data** | 7.6M embeddings from [majestrino-data](https://huggingface.co/datasets/laion/majestrino-data) | | **Training epochs** | 30 | | **Best validation MSE** | 0.000116 | | **Annotated features** | 9,575 / 12,288 (77.9%) | | **Semantic groups** | 14 | ## Feature Groups Each of the 9,575 annotated features has been classified into one of 14 semantic groups (183 features belong to 2 groups): | # | Group | Features | Description | |---|-------|----------|-------------| | 1 | **Sound Effects** | 98 | Non-speech sounds: impacts, clicks, mechanical noises, foley | | 2 | **Music & Singing** | 216 | Singing, instruments, rap, humming, melodies | | 3 | **Recording / Technical** | 26 | Microphone type, reverb, compression, audio quality | | 4 | **Environmental / Ambient** | 194 | Background noise, crowd, traffic, weather, room tone | | 5 | **Vocal Bursts** | 998 | Laughter, crying, gasping, sighing, coughing, screaming | | 6 | **Cognitive States** | 369 | Hesitation, filler words, confusion, uncertainty | | 7 | **Speed / Tempo** | 80 | Speech rate, pacing, cadence, rhythm | | 8 | **Vocal Register** | 154 | Falsetto, vocal fry, pitch range, chest/head voice | | 9 | **Languages** | 1,533 | Language identity (French, Arabic, Japanese, etc.) | | 10 | **Accents / Slang** | 228 | Regional pronunciation, dialect, AAVE, code-switching | | 11 | **Emotions (EmoNet 40)** | 1,760 | 40 emotion categories: joy, anger, fear, sadness, etc. | | 12 | **Talking Styles** | 3,452 | Narration, broadcast, whisper, theatrical, casual, didactic | | 13 | **Character Archetypes** | 303 | Villain, mentor, child, gamer, military commander | | 14 | **Timbre & Speaker Qualities** | 347 | Raspy, nasal, smooth, breathy, warm, deep, bright | ## Quick Start ### Install dependencies ```bash pip install torch huggingface_hub transformers torchaudio safetensors ``` ### Load the SAE ```python from sae import SparseAutoencoder # Download from HuggingFace and load sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae") sae.eval() ``` ### Full pipeline: Audio → Majestrino embedding → SAE features ```python import torch import torch.nn as nn import torch.nn.functional as F import torchaudio from transformers import WhisperModel, WhisperFeatureExtractor from safetensors.torch import load_file from huggingface_hub import hf_hub_download from sae import SparseAutoencoder import json DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # ── Step 1: Load Majestrino 1.00 base model ── class MajestrinoCLAP(nn.Module): def __init__(self): super().__init__() self.whisper = WhisperModel.from_pretrained("openai/whisper-small") self.audio_encoder = self.whisper.encoder input_dim = self.whisper.config.d_model # 768 self.projector = nn.Sequential( nn.Linear(input_dim, 2048), nn.GELU(), nn.Linear(2048, 768), ) def encode_audio(self, features): out = self.audio_encoder(features).last_hidden_state out = out.mean(dim=1) return F.normalize(self.projector(out), p=2, dim=1) majestrino = MajestrinoCLAP().to(DEVICE).eval() # Load weights (note: key remapping audio_proj -> projector) weights_path = hf_hub_download("laion/Majestrino-1.00", "model.safetensors") state_dict = load_file(weights_path) remapped = {k.replace("audio_proj.", "projector."): v for k, v in state_dict.items()} majestrino.load_state_dict(remapped, strict=False) # ── Step 2: Load SAE ── sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae", device=DEVICE) # ── Step 3: Load annotations ── annotations_path = hf_hub_download("laion/majestrino-1.00-16xk5-sae", "annotations.json") with open(annotations_path) as f: annotations = json.load(f) # dict: feature_id_str -> {title, description, ...} # ── Step 4: Process audio ── feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small") waveform, sr = torchaudio.load("your_audio.wav") if sr != 16000: waveform = torchaudio.functional.resample(waveform, sr, 16000) waveform = waveform.mean(dim=0) # mono inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt") mel = inputs.input_features.to(DEVICE) with torch.no_grad(): embedding = majestrino.encode_audio(mel) # (1, 768) recons, info = sae(embedding) # top-k decomposition top_indices = info["inds"][0].cpu().tolist() # 5 feature indices top_values = info["vals"][0].cpu().tolist() # 5 activation values print("Active features:") for idx, val in zip(top_indices, top_values): ann = annotations.get(str(idx), {}) title = ann.get("title", "Unknown") print(f" Feature {idx}: {title} (activation={val:.4f})") ``` ### Example output ``` Active features: Feature 4821: Casual American Male Speech (activation=0.3142) Feature 7203: Conversational Narration (activation=0.2891) Feature 1156: Standard American English (activation=0.2453) Feature 9834: Clear Articulate Delivery (activation=0.1987) Feature 3291: Warm Baritone Timbre (activation=0.1654) ``` ## Files ``` ├── sae.py # Standalone SAE class (copy to your project) ├── model/ │ ├── config.json # Model hyperparameters │ └── state_dict.pth # PyTorch weights (73 MB) ├── annotations.json # 9,575 feature annotations ├── group_assignments.json # Feature → group mapping └── reports/ ├── index.html # Main feature index (browseable) ├── index_groups.html # Grouped feature view └── feature_reports.tar # 10,684 individual feature pages with audio ``` ### Extracting feature reports ```bash # Download and extract the interactive HTML reports cd reports/ tar xf feature_reports.tar # Open index.html in a browser to explore all features ``` ## Architecture ``` Input (768-d Majestrino embedding) │ ├─ subtract pre_bias │ ├─ encoder: Linear(768 → 12288, no bias) │ ├─ add latent_bias │ ├─ top-k (k=5): keep 5 largest activations │ ├─ ReLU │ ├─ decoder: Linear(12288 → 768, no bias) │ └─ add pre_bias → reconstruction (768-d) ``` ## Training Details - **Base embeddings**: Majestrino 1.00 (`embedding_0_11` column from [majestrino-data](https://huggingface.co/datasets/laion/majestrino-data)) - **Training samples**: 7,608,199 embeddings - **Validation samples**: 7,615 embeddings - **Optimizer**: Adam (lr=1e-4) - **Loss**: MSE reconstruction + AuxK dead neuron recovery + frequency overactivation penalty (coef=3.0, decay=0.999) - **Dead features**: 2,713 / 12,288 (22.1%) — these are features that never activate and are excluded from annotations - **Alive & annotated**: 9,575 features with Gemini-generated titles and descriptions ## Annotations Each annotated feature in `annotations.json` has: ```json { "3400": { "bin": 18, "bin_name": "Angry & Hostile State", "title": "Intense Anger and Frustration", "description": "The primary commonality across all positive samples is ...", "consistency": "high", "reasoning": "..." } } ``` Group assignments in `group_assignments.json`: ```json { "3400": [11], "5234": [12, 14] } ``` Values are lists of group IDs (1-14). Features can belong to multiple groups (183 do). ## Citation ```bibtex @misc{majestrino-sae-2025, title={Sparse Autoencoder for Majestrino 1.00 Voice Embeddings}, author={LAION}, year={2025}, url={https://huggingface.co/laion/majestrino-1.00-16xk5-sae} } ``` ## License Apache 2.0