---
language: en
license: apache-2.0
tags:
  - sparse-autoencoder
  - sae
  - interpretability
  - audio
  - voice
  - speech
  - majestrino
  - whisper
datasets:
  - laion/majestrino-data
base_model: laion/Majestrino-1.00
pipeline_tag: feature-extraction
---

# Majestrino 1.00 Sparse Autoencoder (16x, k=5)

A **Top-K Sparse Autoencoder** trained on [Majestrino 1.00](https://huggingface.co/laion/Majestrino-1.00) voice/audio embeddings. It decomposes 768-dimensional audio embeddings into **12,288 interpretable features** covering emotions, speaking styles, languages, vocal qualities, and more.

## Key Numbers

| | |
|---|---|
| **Input dimension** | 768 (Majestrino 1.00 embedding) |
| **Dictionary size** | 12,288 features (16x expansion) |
| **Active features per input** | 5 (top-k) |
| **Parameters** | 18.9M |
| **Training data** | 7.6M embeddings from [majestrino-data](https://huggingface.co/datasets/laion/majestrino-data) |
| **Training epochs** | 30 |
| **Best validation MSE** | 0.000116 |
| **Annotated features** | 9,575 / 12,288 (77.9%) |
| **Semantic groups** | 14 |

## Feature Groups

Each of the 9,575 annotated features has been classified into one of 14 semantic groups (183 features belong to 2 groups):

| # | Group | Features | Description |
|---|-------|----------|-------------|
| 1 | **Sound Effects** | 98 | Non-speech sounds: impacts, clicks, mechanical noises, foley |
| 2 | **Music & Singing** | 216 | Singing, instruments, rap, humming, melodies |
| 3 | **Recording / Technical** | 26 | Microphone type, reverb, compression, audio quality |
| 4 | **Environmental / Ambient** | 194 | Background noise, crowd, traffic, weather, room tone |
| 5 | **Vocal Bursts** | 998 | Laughter, crying, gasping, sighing, coughing, screaming |
| 6 | **Cognitive States** | 369 | Hesitation, filler words, confusion, uncertainty |
| 7 | **Speed / Tempo** | 80 | Speech rate, pacing, cadence, rhythm |
| 8 | **Vocal Register** | 154 | Falsetto, vocal fry, pitch range, chest/head voice |
| 9 | **Languages** | 1,533 | Language identity (French, Arabic, Japanese, etc.) |
| 10 | **Accents / Slang** | 228 | Regional pronunciation, dialect, AAVE, code-switching |
| 11 | **Emotions (EmoNet 40)** | 1,760 | 40 emotion categories: joy, anger, fear, sadness, etc. |
| 12 | **Talking Styles** | 3,452 | Narration, broadcast, whisper, theatrical, casual, didactic |
| 13 | **Character Archetypes** | 303 | Villain, mentor, child, gamer, military commander |
| 14 | **Timbre & Speaker Qualities** | 347 | Raspy, nasal, smooth, breathy, warm, deep, bright |

## Quick Start

### Install dependencies

```bash
pip install torch huggingface_hub transformers torchaudio safetensors
```

### Load the SAE

```python
from sae import SparseAutoencoder

# Download from HuggingFace and load
sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae")
sae.eval()
```

### Full pipeline: Audio → Majestrino embedding → SAE features

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from sae import SparseAutoencoder
import json

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# ── Step 1: Load Majestrino 1.00 base model ──

class MajestrinoCLAP(nn.Module):
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
        self.audio_encoder = self.whisper.encoder
        input_dim = self.whisper.config.d_model  # 768
        self.projector = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        out = self.audio_encoder(features).last_hidden_state
        out = out.mean(dim=1)
        return F.normalize(self.projector(out), p=2, dim=1)

majestrino = MajestrinoCLAP().to(DEVICE).eval()

# Load weights (note: key remapping audio_proj -> projector)
weights_path = hf_hub_download("laion/Majestrino-1.00", "model.safetensors")
state_dict = load_file(weights_path)
remapped = {k.replace("audio_proj.", "projector."): v for k, v in state_dict.items()}
majestrino.load_state_dict(remapped, strict=False)

# ── Step 2: Load SAE ──

sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae", device=DEVICE)

# ── Step 3: Load annotations ──

annotations_path = hf_hub_download("laion/majestrino-1.00-16xk5-sae", "annotations.json")
with open(annotations_path) as f:
    annotations = json.load(f)  # dict: feature_id_str -> {title, description, ...}

# ── Step 4: Process audio ──

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

waveform, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0)  # mono

inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
mel = inputs.input_features.to(DEVICE)

with torch.no_grad():
    embedding = majestrino.encode_audio(mel)       # (1, 768)
    recons, info = sae(embedding)                   # top-k decomposition
    top_indices = info["inds"][0].cpu().tolist()     # 5 feature indices
    top_values = info["vals"][0].cpu().tolist()      # 5 activation values

print("Active features:")
for idx, val in zip(top_indices, top_values):
    ann = annotations.get(str(idx), {})
    title = ann.get("title", "Unknown")
    print(f"  Feature {idx}: {title} (activation={val:.4f})")
```

### Example output

```
Active features:
  Feature 4821: Casual American Male Speech (activation=0.3142)
  Feature 7203: Conversational Narration (activation=0.2891)
  Feature 1156: Standard American English (activation=0.2453)
  Feature 9834: Clear Articulate Delivery (activation=0.1987)
  Feature 3291: Warm Baritone Timbre (activation=0.1654)
```

## Files

```
├── sae.py                     # Standalone SAE class (copy to your project)
├── model/
│   ├── config.json            # Model hyperparameters
│   └── state_dict.pth         # PyTorch weights (73 MB)
├── annotations.json           # 9,575 feature annotations
├── group_assignments.json     # Feature → group mapping
└── reports/
    ├── index.html             # Main feature index (browseable)
    ├── index_groups.html      # Grouped feature view
    └── feature_reports.tar    # 10,684 individual feature pages with audio
```

### Extracting feature reports

```bash
# Download and extract the interactive HTML reports
cd reports/
tar xf feature_reports.tar
# Open index.html in a browser to explore all features
```

## Architecture

```
Input (768-d Majestrino embedding)
  │
  ├─ subtract pre_bias
  │
  ├─ encoder: Linear(768 → 12288, no bias)
  │
  ├─ add latent_bias
  │
  ├─ top-k (k=5): keep 5 largest activations
  │
  ├─ ReLU
  │
  ├─ decoder: Linear(12288 → 768, no bias)
  │
  └─ add pre_bias → reconstruction (768-d)
```

## Training Details

- **Base embeddings**: Majestrino 1.00 (`embedding_0_11` column from [majestrino-data](https://huggingface.co/datasets/laion/majestrino-data))
- **Training samples**: 7,608,199 embeddings
- **Validation samples**: 7,615 embeddings
- **Optimizer**: Adam (lr=1e-4)
- **Loss**: MSE reconstruction + AuxK dead neuron recovery + frequency overactivation penalty (coef=3.0, decay=0.999)
- **Dead features**: 2,713 / 12,288 (22.1%) — these are features that never activate and are excluded from annotations
- **Alive & annotated**: 9,575 features with Gemini-generated titles and descriptions

## Annotations

Each annotated feature in `annotations.json` has:

```json
{
  "3400": {
    "bin": 18,
    "bin_name": "Angry & Hostile State",
    "title": "Intense Anger and Frustration",
    "description": "The primary commonality across all positive samples is ...",
    "consistency": "high",
    "reasoning": "..."
  }
}
```

Group assignments in `group_assignments.json`:

```json
{
  "3400": [11],
  "5234": [12, 14]
}
```

Values are lists of group IDs (1-14). Features can belong to multiple groups (183 do).

## Citation

```bibtex
@misc{majestrino-sae-2025,
  title={Sparse Autoencoder for Majestrino 1.00 Voice Embeddings},
  author={LAION},
  year={2025},
  url={https://huggingface.co/laion/majestrino-1.00-16xk5-sae}
}
```

## License

Apache 2.0