---
license: apache-2.0
tags:
  - music
  - audio
  - quality-estimation
  - mert
  - mlp
  - regression
  - music-information-retrieval
pipeline_tag: audio-classification
library_name: pytorch
datasets:
  - treadon/fma-mert-embeddings
language:
  - en
metrics:
  - mae
  - spearmanr
---

# Banger Scorer


> Follow [**@treadon on X**](https://x.com/treadon) and [**treadon on Hugging Face**](https://huggingface.co/treadon) for more AI experiments, evals, and projects.

**Rate how good a song sounds, 0-10.** A lightweight MLP trained on top of frozen [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) embeddings to predict music quality from audio. Trained on FMA-Small play count data, designed for relative ranking of AI-generated songs within a batch.

![Global scatter: predicted vs actual banger scores](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/global_scatter.png)

## Model Description

The banger scorer takes a raw audio waveform, encodes it through MERT-v1-330M (frozen, 330M params) to produce a 1024-dimensional embedding, then passes that embedding through a small trained MLP to output a single scalar score from 0 to 10.

The core use case: generate a batch of songs with a model like [ACE-Step](https://huggingface.co/ACE-Step/Ace-Step1.5), score them all automatically, and keep only the top-scoring tracks. Bangers only.

```
Audio waveform (24kHz, 30s)
    --> MERT-v1-330M (frozen, 330M params)
    --> 1024-dim embedding (mean-pooled across ~1200 time frames)
    --> MLP Scorer (558K trainable params)
    --> Score 0-10
```

## Architecture

**Encoder:** [m-a-p/MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) -- a 24-layer self-supervised music understanding model trained on 160K hours of audio. Completely frozen during training; used only to extract embeddings.

**Scorer head:** MLP with progressive bottleneck:

```
Input: 1024-dim MERT embedding
    --> Linear(1024, 512) + BatchNorm + ReLU + Dropout(0.3)
    --> Linear(512, 256)  + BatchNorm + ReLU + Dropout(0.3)
    --> Linear(256, 128)  + BatchNorm + ReLU + Dropout(0.15)
    --> Linear(128, 1)
Output: scalar score (0-10)
```

- **Total trainable parameters:** ~558K (~2.6 MB in float32)
- **Inference time:** < 1ms per prediction (MLP only, excludes MERT encoding)

## Training Data

Trained on [FMA-Small](https://github.com/mdeff/fma) (Free Music Archive), a Creative Commons dataset of 8,000 tracks across 8 balanced genres (1,000 each): Hip-Hop, Pop, Folk, Experimental, Rock, International, Electronic, Instrumental.

**Labels:** Play counts from FMA, log-normalized to a 0-10 scale:

```python
log_listens = np.log1p(df["listens"])
banger_score = (log_listens - log_listens.min()) / (log_listens.max() - log_listens.min()) * 10.0
```

After log-normalization, the training score distribution is heavily concentrated in the 1-5 range, with very few examples above 7:

![Training score distribution](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/training/training_distribution.png)

![Training genre distribution](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/training/training_genre_distribution.png)

**Pre-computed embeddings** are available as a separate dataset: [treadon/fma-mert-embeddings](https://huggingface.co/datasets/treadon/fma-mert-embeddings). MERT embedding extraction took 101 minutes on M4 Pro MPS across 7,997 successfully processed tracks (3 corrupt MP3s failed, 99.96% success rate).

## Training Procedure

- **Split:** 70% train (5,600) / 15% validation (1,200) / 15% test (1,200)
- **Optimizer:** AdamW, lr=1e-3, weight_decay=1e-4
- **Scheduler:** CosineAnnealingLR over 200 max epochs
- **Loss:** MSE
- **Batch size:** 64
- **Early stopping:** Patience 20, best model at **epoch 9** (early stopped at epoch 30)
- **Training time:** ~30 seconds on Apple M4 Pro (MPS) with cached embeddings
- **Device:** Apple M4 Pro, Metal Performance Shaders (MPS)

## Evaluation Results

| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| **Test MAE** | **0.858** | < 1.5 | Exceeded by 43% |
| **Test Spearman** | **0.468** | > 0.4 | Hit target |
| **Val MAE** | **0.822** | -- | -- |
| **Model size** | **~2.6 MB** | < 50 MB | -- |

MAE of 0.858 on a 0-10 scale means predictions are typically off by less than 1 point. For filtering purposes (pick the best 5 from 50), the model reliably distinguishes a 2/10 from a 6/10, even if it cannot tell a 7/10 from an 8/10.

## Test Results: 230 AI-Generated Songs

The scorer was tested on 230 songs generated with ACE-Step 1.5 across 10 genres (20 songs each) plus 1 banger-optimized run (30 songs). Languages included English, Spanish, Hindi, Punjabi, and Chinese.

![Genre ranking by mean score](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/genre_ranking.png)

| Rank | Genre | Mean | Best | Range |
|------|-------|------|------|-------|
| 1 | Electronic/EDM | 3.71 | **5.29** | 2.80-5.29 |
| 2 | Punjabi/Bhangra | 3.77 | 4.26 | 2.79-4.26 |
| 3 | Bollywood | 3.53 | 4.38 | 2.70-4.38 |
| 4 | C-Pop | 3.20 | 4.47 | 2.13-4.47 |
| 5 | Latin/Reggaeton | 3.19 | 3.90 | 2.36-3.90 |
| 6 | Pop/Dance | 3.05 | 4.31 | 1.98-4.31 |
| 7 | Rock/Alternative | 3.03 | 3.66 | 2.02-3.66 |
| 8 | Hip Hop | 2.92 | 3.38 | 2.52-3.38 |
| 9 | Acoustic/Folk | 2.63 | 3.31 | 2.03-3.31 |
| 10 | R&B/Soul | 2.62 | 3.21 | 2.14-3.21 |

**Overall best:** Melodic techno, 130 BPM, Eb minor -- scored 5.29/10 (67th percentile of FMA).

![Score distribution histogram](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/score_histogram.png)

![Box plot of scores by genre](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/global_boxplot.png)

### Optimization Impact

After analyzing the 200 random-parameter songs, a banger-optimized run of 30 songs was generated using only the highest-scoring parameter combinations (dark electronic/industrial styles, 126-138 BPM, minor keys only). Results:

![Optimization impact comparison](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/optimization_impact.png)

| Metric | Random (200 songs) | Optimized (30 songs) | Improvement |
|--------|-------------------|---------------------|-------------|
| Mean score | 3.17 | **3.48** | +10% |
| Songs >= 3.5 | 20% | **60%** | 3x |
| Songs >= 4.0 | 5% | **20%** | 4x |
| Top score | 5.29 | 5.29 | Same ceiling |

The optimization raised the floor and consistency dramatically (4x hit rate for scores >= 4.0) without raising the ceiling.

![Hit rate comparison by genre](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/hit_rate_comparison.png)

### Musical Analysis

![Top vs bottom songs comparison](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/top_vs_bottom.png)

**Key findings:**
- Minor keys outperformed major keys (3.26 vs 3.03 mean)
- BPM sweet spots vary by genre: EDM peaks at 126-138, Punjabi at 95-105, Pop at 124-128
- Slower BPMs (< 85) consistently underperformed across all genres

![Major vs minor key analysis](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/major_vs_minor.png)

![Key analysis](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/key_analysis.png)

![BPM vs score](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/bpm_vs_score_global.png)

![BPM/Key heatmap for banger-optimized run](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/per_genre/bangers_bpm_key_heatmap.png)

![Generated vs training score distributions](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/training/generated_vs_training.png)

## How to Use

```python
import torch
import librosa
import numpy as np
from transformers import AutoModel, AutoFeatureExtractor

# 1. Load MERT encoder
mert_name = "m-a-p/MERT-v1-330M"
feature_extractor = AutoFeatureExtractor.from_pretrained(mert_name, trust_remote_code=True)
mert = AutoModel.from_pretrained(mert_name, trust_remote_code=True)
mert.eval()

# 2. Load the scorer MLP
from huggingface_hub import hf_hub_download
import torch.nn as nn

class BangerScorer(nn.Module):
    def __init__(self, input_dim=1024, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout / 2),
            nn.Linear(128, 1),
        )
    def forward(self, x):
        return self.net(x).squeeze(-1)

model_path = hf_hub_download("treadon/banger-scorer", "scorer_model.pt")
scorer = BangerScorer()
scorer.load_state_dict(torch.load(model_path, weights_only=True))
scorer.eval()

# 3. Score an audio file
waveform, _ = librosa.load("song.mp3", sr=24000, mono=True)
waveform = waveform[:24000 * 30]  # truncate to 30s

inputs = feature_extractor(waveform, sampling_rate=24000, return_tensors="pt")
with torch.no_grad():
    embedding = mert(**inputs).last_hidden_state.mean(dim=1)  # (1, 1024)
    score = scorer(embedding).item()
    score = max(0, min(10, score))  # clamp to 0-10

print(f"Banger score: {score:.2f}/10")
```

## Intended Use

**Do:** Use the scorer to rank songs relative to each other within a batch. Generate N candidates, score them all, keep the top K.

**Don't:** Treat the score as an absolute quality judgment. A song scoring 3.5 is not objectively "bad" -- it just means the model thinks it is less likely to be popular based on patterns learned from FMA play counts.

The scorer is most useful as a cheap, fast filter to surface promising candidates from a large batch of AI-generated music, reducing the amount of human listening needed.

## Limitations

- **Genre bias:** The scorer strongly prefers high-energy, beat-driven music (EDM, Punjabi/Bhangra, Bollywood) over mellow genres (R&B, Acoustic/Folk). This reflects FMA's popularity distribution, not absolute musical quality.
- **FMA popularity != mainstream popularity.** FMA is indie/unsigned artists on a free archive. The model learned "music that people actively seek out on a niche platform," not Billboard chart hits.
- **30-second clips.** Songs are scored based on 30-second excerpts. Quality aspects involving full-track structure (build-ups, drops, bridges) are not fully captured.
- **Mean pooling loses temporal info.** By averaging MERT's per-frame outputs, temporal dynamics are collapsed. A song with a brilliant 10-second hook and 20 seconds of noise averages to the same embedding as a consistently mediocre song.
- **Popularity is not quality.** Some brilliant niche music has low play counts; some generic music has millions of plays. The model learns statistical tendencies, not absolute aesthetic truth.
- **Distribution shift.** MERT was trained on real music. AI-generated music may contain subtle artifacts that shift the embedding distribution in ways the scorer was not trained to handle.
- **Training data skew.** Only 45 tracks out of 8,000 score above 7 in the training data. The model learned the 2-5 range well but cannot confidently score anything higher.

![Caption style analysis](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/caption_style_analysis.png)

## Citation

If you use this model, please cite the MERT paper and the FMA dataset:

```bibtex
@article{li2023mert,
  title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and others},
  journal={arXiv preprint arXiv:2306.00107},
  year={2023}
}

@inproceedings{defferrard2017fma,
  title={FMA: A Dataset For Music Analysis},
  author={Defferrard, Micha{\"e}l and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier},
  booktitle={ISMIR},
  year={2017}
}
```

## Model Card Contact

[treadon](https://huggingface.co/treadon) on HuggingFace

## More from me

For other projects and writeups, see [**riteshkhanna.com**](https://riteshkhanna.com), follow [**@treadon on X**](https://x.com/treadon), or [**treadon on Hugging Face**](https://huggingface.co/treadon).