--- license: apache-2.0 tags: - music - audio - quality-estimation - mert - mlp - regression - music-information-retrieval pipeline_tag: audio-classification library_name: pytorch datasets: - treadon/fma-mert-embeddings language: - en metrics: - mae - spearmanr --- # Banger Scorer > Follow [**@treadon on X**](https://x.com/treadon) and [**treadon on Hugging Face**](https://huggingface.co/treadon) for more AI experiments, evals, and projects. **Rate how good a song sounds, 0-10.** A lightweight MLP trained on top of frozen [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) embeddings to predict music quality from audio. Trained on FMA-Small play count data, designed for relative ranking of AI-generated songs within a batch. ![Global scatter: predicted vs actual banger scores](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/global_scatter.png) ## Model Description The banger scorer takes a raw audio waveform, encodes it through MERT-v1-330M (frozen, 330M params) to produce a 1024-dimensional embedding, then passes that embedding through a small trained MLP to output a single scalar score from 0 to 10. The core use case: generate a batch of songs with a model like [ACE-Step](https://huggingface.co/ACE-Step/Ace-Step1.5), score them all automatically, and keep only the top-scoring tracks. Bangers only. ``` Audio waveform (24kHz, 30s) --> MERT-v1-330M (frozen, 330M params) --> 1024-dim embedding (mean-pooled across ~1200 time frames) --> MLP Scorer (558K trainable params) --> Score 0-10 ``` ## Architecture **Encoder:** [m-a-p/MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) -- a 24-layer self-supervised music understanding model trained on 160K hours of audio. Completely frozen during training; used only to extract embeddings. **Scorer head:** MLP with progressive bottleneck: ``` Input: 1024-dim MERT embedding --> Linear(1024, 512) + BatchNorm + ReLU + Dropout(0.3) --> Linear(512, 256) + BatchNorm + ReLU + Dropout(0.3) --> Linear(256, 128) + BatchNorm + ReLU + Dropout(0.15) --> Linear(128, 1) Output: scalar score (0-10) ``` - **Total trainable parameters:** ~558K (~2.6 MB in float32) - **Inference time:** < 1ms per prediction (MLP only, excludes MERT encoding) ## Training Data Trained on [FMA-Small](https://github.com/mdeff/fma) (Free Music Archive), a Creative Commons dataset of 8,000 tracks across 8 balanced genres (1,000 each): Hip-Hop, Pop, Folk, Experimental, Rock, International, Electronic, Instrumental. **Labels:** Play counts from FMA, log-normalized to a 0-10 scale: ```python log_listens = np.log1p(df["listens"]) banger_score = (log_listens - log_listens.min()) / (log_listens.max() - log_listens.min()) * 10.0 ``` After log-normalization, the training score distribution is heavily concentrated in the 1-5 range, with very few examples above 7: ![Training score distribution](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/training/training_distribution.png) ![Training genre distribution](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/training/training_genre_distribution.png) **Pre-computed embeddings** are available as a separate dataset: [treadon/fma-mert-embeddings](https://huggingface.co/datasets/treadon/fma-mert-embeddings). MERT embedding extraction took 101 minutes on M4 Pro MPS across 7,997 successfully processed tracks (3 corrupt MP3s failed, 99.96% success rate). ## Training Procedure - **Split:** 70% train (5,600) / 15% validation (1,200) / 15% test (1,200) - **Optimizer:** AdamW, lr=1e-3, weight_decay=1e-4 - **Scheduler:** CosineAnnealingLR over 200 max epochs - **Loss:** MSE - **Batch size:** 64 - **Early stopping:** Patience 20, best model at **epoch 9** (early stopped at epoch 30) - **Training time:** ~30 seconds on Apple M4 Pro (MPS) with cached embeddings - **Device:** Apple M4 Pro, Metal Performance Shaders (MPS) ## Evaluation Results | Metric | Value | Target | Status | |--------|-------|--------|--------| | **Test MAE** | **0.858** | < 1.5 | Exceeded by 43% | | **Test Spearman** | **0.468** | > 0.4 | Hit target | | **Val MAE** | **0.822** | -- | -- | | **Model size** | **~2.6 MB** | < 50 MB | -- | MAE of 0.858 on a 0-10 scale means predictions are typically off by less than 1 point. For filtering purposes (pick the best 5 from 50), the model reliably distinguishes a 2/10 from a 6/10, even if it cannot tell a 7/10 from an 8/10. ## Test Results: 230 AI-Generated Songs The scorer was tested on 230 songs generated with ACE-Step 1.5 across 10 genres (20 songs each) plus 1 banger-optimized run (30 songs). Languages included English, Spanish, Hindi, Punjabi, and Chinese. ![Genre ranking by mean score](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/genre_ranking.png) | Rank | Genre | Mean | Best | Range | |------|-------|------|------|-------| | 1 | Electronic/EDM | 3.71 | **5.29** | 2.80-5.29 | | 2 | Punjabi/Bhangra | 3.77 | 4.26 | 2.79-4.26 | | 3 | Bollywood | 3.53 | 4.38 | 2.70-4.38 | | 4 | C-Pop | 3.20 | 4.47 | 2.13-4.47 | | 5 | Latin/Reggaeton | 3.19 | 3.90 | 2.36-3.90 | | 6 | Pop/Dance | 3.05 | 4.31 | 1.98-4.31 | | 7 | Rock/Alternative | 3.03 | 3.66 | 2.02-3.66 | | 8 | Hip Hop | 2.92 | 3.38 | 2.52-3.38 | | 9 | Acoustic/Folk | 2.63 | 3.31 | 2.03-3.31 | | 10 | R&B/Soul | 2.62 | 3.21 | 2.14-3.21 | **Overall best:** Melodic techno, 130 BPM, Eb minor -- scored 5.29/10 (67th percentile of FMA). ![Score distribution histogram](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/score_histogram.png) ![Box plot of scores by genre](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/global_boxplot.png) ### Optimization Impact After analyzing the 200 random-parameter songs, a banger-optimized run of 30 songs was generated using only the highest-scoring parameter combinations (dark electronic/industrial styles, 126-138 BPM, minor keys only). Results: ![Optimization impact comparison](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/optimization_impact.png) | Metric | Random (200 songs) | Optimized (30 songs) | Improvement | |--------|-------------------|---------------------|-------------| | Mean score | 3.17 | **3.48** | +10% | | Songs >= 3.5 | 20% | **60%** | 3x | | Songs >= 4.0 | 5% | **20%** | 4x | | Top score | 5.29 | 5.29 | Same ceiling | The optimization raised the floor and consistency dramatically (4x hit rate for scores >= 4.0) without raising the ceiling. ![Hit rate comparison by genre](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/hit_rate_comparison.png) ### Musical Analysis ![Top vs bottom songs comparison](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/overview/top_vs_bottom.png) **Key findings:** - Minor keys outperformed major keys (3.26 vs 3.03 mean) - BPM sweet spots vary by genre: EDM peaks at 126-138, Punjabi at 95-105, Pop at 124-128 - Slower BPMs (< 85) consistently underperformed across all genres ![Major vs minor key analysis](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/major_vs_minor.png) ![Key analysis](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/key_analysis.png) ![BPM vs score](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/bpm_vs_score_global.png) ![BPM/Key heatmap for banger-optimized run](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/per_genre/bangers_bpm_key_heatmap.png) ![Generated vs training score distributions](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/training/generated_vs_training.png) ## How to Use ```python import torch import librosa import numpy as np from transformers import AutoModel, AutoFeatureExtractor # 1. Load MERT encoder mert_name = "m-a-p/MERT-v1-330M" feature_extractor = AutoFeatureExtractor.from_pretrained(mert_name, trust_remote_code=True) mert = AutoModel.from_pretrained(mert_name, trust_remote_code=True) mert.eval() # 2. Load the scorer MLP from huggingface_hub import hf_hub_download import torch.nn as nn class BangerScorer(nn.Module): def __init__(self, input_dim=1024, dropout=0.3): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout), nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(dropout), nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout / 2), nn.Linear(128, 1), ) def forward(self, x): return self.net(x).squeeze(-1) model_path = hf_hub_download("treadon/banger-scorer", "scorer_model.pt") scorer = BangerScorer() scorer.load_state_dict(torch.load(model_path, weights_only=True)) scorer.eval() # 3. Score an audio file waveform, _ = librosa.load("song.mp3", sr=24000, mono=True) waveform = waveform[:24000 * 30] # truncate to 30s inputs = feature_extractor(waveform, sampling_rate=24000, return_tensors="pt") with torch.no_grad(): embedding = mert(**inputs).last_hidden_state.mean(dim=1) # (1, 1024) score = scorer(embedding).item() score = max(0, min(10, score)) # clamp to 0-10 print(f"Banger score: {score:.2f}/10") ``` ## Intended Use **Do:** Use the scorer to rank songs relative to each other within a batch. Generate N candidates, score them all, keep the top K. **Don't:** Treat the score as an absolute quality judgment. A song scoring 3.5 is not objectively "bad" -- it just means the model thinks it is less likely to be popular based on patterns learned from FMA play counts. The scorer is most useful as a cheap, fast filter to surface promising candidates from a large batch of AI-generated music, reducing the amount of human listening needed. ## Limitations - **Genre bias:** The scorer strongly prefers high-energy, beat-driven music (EDM, Punjabi/Bhangra, Bollywood) over mellow genres (R&B, Acoustic/Folk). This reflects FMA's popularity distribution, not absolute musical quality. - **FMA popularity != mainstream popularity.** FMA is indie/unsigned artists on a free archive. The model learned "music that people actively seek out on a niche platform," not Billboard chart hits. - **30-second clips.** Songs are scored based on 30-second excerpts. Quality aspects involving full-track structure (build-ups, drops, bridges) are not fully captured. - **Mean pooling loses temporal info.** By averaging MERT's per-frame outputs, temporal dynamics are collapsed. A song with a brilliant 10-second hook and 20 seconds of noise averages to the same embedding as a consistently mediocre song. - **Popularity is not quality.** Some brilliant niche music has low play counts; some generic music has millions of plays. The model learns statistical tendencies, not absolute aesthetic truth. - **Distribution shift.** MERT was trained on real music. AI-generated music may contain subtle artifacts that shift the embedding distribution in ways the scorer was not trained to handle. - **Training data skew.** Only 45 tracks out of 8,000 score above 7 in the training data. The model learned the 2-5 range well but cannot confidently score anything higher. ![Caption style analysis](https://raw.githubusercontent.com/treadon/banger-scorer/main/plots/analysis/caption_style_analysis.png) ## Citation If you use this model, please cite the MERT paper and the FMA dataset: ```bibtex @article{li2023mert, title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and others}, journal={arXiv preprint arXiv:2306.00107}, year={2023} } @inproceedings{defferrard2017fma, title={FMA: A Dataset For Music Analysis}, author={Defferrard, Micha{\"e}l and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier}, booktitle={ISMIR}, year={2017} } ``` ## Model Card Contact [treadon](https://huggingface.co/treadon) on HuggingFace ## More from me For other projects and writeups, see [**riteshkhanna.com**](https://riteshkhanna.com), follow [**@treadon on X**](https://x.com/treadon), or [**treadon on Hugging Face**](https://huggingface.co/treadon).