Expressive Jazz Piano Performance Modeling — ISPR 2025–2026

Four trained checkpoints for the ISPR 2025–2026 project on jazz piano performance modeling. Two pretrained PerformanceRNN LSTM variants and two fine-tunes of the publicly released Aria 1B-parameter piano language model (Bradshaw & Colton 2025), all fine-tuned on PiJAMA (Edwards et al. 2024). Companion code lives in the submission directory napolitano_antonio_ispr_2025_2026_project/; the report is the accompanying napolitano_antonio_ispr_2025_2026_report.pdf.

Models in this repo

The model names below match the headline table of the report.

`aria-full-quality/` — Aria fine-tune, full-quality (offline)

Aria 1B-parameter LLaMA-3.2-style decoder fine-tuned on the PiJAMA hawthorne split with the default 17 727-id AbsTokenizer (no sustain pedal). Architecture: medium (d=1536, 16 layers, 24 heads, RoPE, GQA, max_seq_len=8192). Best swept mean OA on the test split: 0.911. FMD vs the kong test-pool reference (CLaMP-2 encoder): 272.6.

tested.safetensors — the checkpoint reported on in the paper (4-stage train/val pipeline: retrained on TRAIN+VAL for the patience-selected epoch count, evaluated once on TEST).
deployed.safetensors — full retrain on TRAIN+VAL+TEST for the same epoch count. For deployment / listening only; test metrics are not honestly reportable on this checkpoint because it has seen the test set.

`aria-real-time/` — Aria fine-tune, real-time MLX-compatible

Same backbone but loaded from the public model-demo.safetensors checkpoint with the residual-stream embedding-projection layer preserved (medium-emb architecture, +1536×512 emb_proj). Trained on PiJAMA kong album-aware split with the 2 675-id demo tokenizer that adds explicit sustain-pedal events. Drop-in jazz replacement for the upstream aria/demo/demo_mlx.py iOS sampler. Best swept mean OA: 0.804. FMD: 233.6.

tested.safetensors, deployed.safetensors — same conventions as above.

`lstm-hawthorne/` — pretrained PerformanceRNN, hawthorne split

3-layer stacked LSTM (hidden 512, embed 512, tied I/O head, 6.46M params; paper-faithful PerformanceRNN, Oore et al. 2018). Pretrained on the 820 944-file Aria-MIDI corpus (30k steps) and fine-tuned on the PiJAMA hawthorne split with the 413-id no-pedal vocabulary. Best swept mean OA: 0.664. FMD: 427.8.

tested.pt — Stage-B equivalent (the one reported on).

`lstm-kong-pedal/` — pretrained PerformanceRNN, kong+pedal split

Same architecture but with the 314-id pedal-aware vocabulary (NOTE_ON×88 + NOTE_OFF×88 + TIME_SHIFT×100 + VELOCITY×32 + SUSTAIN_ON/OFF + 4 specials). Fine-tuned on the PiJAMA kong album-aware split. Best swept mean OA: 0.768 (only ≈0.04 below Aria real-time despite a ~150× parameter ratio). FMD: 438.6.

tested.pt

Note: a full retrain on TRAIN+VAL+TEST was not performed for the LSTMs (their compute cost is small enough that the 4-stage generalisation-honest pipeline already gives a strong deployment baseline). If you need that variant, the training script in the companion submission directory reproduces it in ≈25 minutes on a single NVIDIA B200 (or comparable GPU).

MLX variants for macOS inference

Each Aria model also has mlx-tested/ and mlx-deployed/ directories containing:

model.safetensors — same weights as the top-level safetensors, laid out for loading via mlx.core.load() on Apple silicon.
config.json — the corresponding Aria model config (medium.json for full-quality, medium-emb.json for real-time).
For aria-real-time/mlx-* only: tokenizer-config.json, the same 2 675-id demo tokenizer the upstream aria/demo/demo_mlx.py uses.

Running on macOS

aria-real-time/mlx-tested/ is a drop-in replacement for the weights expected by the upstream aria/demo/demo_mlx.py (iOS / Apple silicon real-time sampler from EleutherAI/aria). Point that script at model.safetensors and use the bundled tokenizer-config.json:

python aria/demo/demo_mlx.py \
    --checkpoint-path /path/to/mlx-tested/model.safetensors \
    --tokenizer-config /path/to/mlx-tested/tokenizer-config.json

aria-full-quality/mlx-*/ ships the full-quality weights and the medium.json config. The upstream demo_mlx.py hardcodes the medium-emb arch, so to run these checkpoints on MLX you either:

Adapt aria.inference.model_mlx.TransformerLM to load medium instead of medium-emb (drop the emb_proj layer), or
Run inference via PyTorch with the MPS backend on macOS, using the top-level tested.safetensors / deployed.safetensors and the default AbsTokenizer (no demo tokenizer config needed).

The full-quality checkpoints are ≈2.5 GB in bf16 — they fit easily on ≥16 GB unified-memory Apple silicon for inference.

Loading from Python

Aria (any variant) on CUDA / ROCm / MPS

from aria.config import load_model_config
from aria.model import ModelConfig, TransformerLM
from safetensors.torch import load_file

model_config = ModelConfig(**load_model_config("medium"))      # or "medium-emb"
model_config.set_vocab_size(17727)                              # or 2675 for real-time
model = TransformerLM(model_config)
model.load_state_dict(load_file("tested.safetensors"), strict=False)
model.eval()

Aria on MLX (Apple silicon)

import mlx.core as mx
weights = mx.load("mlx-tested/model.safetensors")
# … then build the MLX TransformerLM as in aria.inference.model_mlx

LSTM

import torch
from src.models.performancernn_lstm import PerformanceRNNLSTM, PerformanceRNNLSTMConfig
ckpt = torch.load("tested.pt", map_location="cpu", weights_only=False)
cfg  = PerformanceRNNLSTMConfig(**ckpt["config"])
model = PerformanceRNNLSTM(cfg)
model.load_state_dict(ckpt["model_state"], strict=True)
model.eval()

(The PerformanceRNNLSTM / PerformanceRNNLSTMConfig definitions live in the companion submission directory under src/models/performancernn_lstm.py.)

Recommended sampling settings

The Stage-C sampling sweep covered the 12 cells T ∈ {0.8, 1.0, 1.2} × top-k ∈ {0, 24} × min-p ∈ {0.035, 0.05} with 4 PiJAMA test prompts × 20 variations per cell. The same cell — temperature = 1.2, top-k = 0 (no truncation), min-p = 0.035 — wins on both Mean OA and FMD for every model in this repo.

Model	best `(T, k, p)`	Mean OA ↑	FMD ↓ (CLaMP-2)
`aria-full-quality`	(1.2, 0, 0.035)	0.911	272.6
`aria-real-time`	(1.2, 0, 0.035)	0.804	233.6
`lstm-kong-pedal`	(1.2, 0, 0.035)	0.768	438.6
`lstm-hawthorne`	(1.2, 0, 0.035)	0.664	427.8

Three robust observations from the sweep:

Temperature dominates. Bumping T from 0.8 → 1.2 buys +0.18–0.30 absolute OA on Aria at every (k, p) cell and +0.28 on both LSTM splits.
Don't truncate. top-k = 0 (no truncation) beats top-k = 24 by 0.03–0.07 OA at every (T, p) cell — aggressive truncation hurts distributional fidelity on this corpus.
min-p is comparatively flat between 0.035 and 0.05; the smaller value wins by a small margin everywhere.

If you only want a single set of knobs that works across all four models, use temperature=1.2, top_k=0, min_p=0.035.

Reproducibility

All four checkpoints were produced by the pipeline scripts in the companion submission directory (scripts/aria_pipeline_per_variant.sh for the Aria variants, scripts/train_performancernn_lstm_pipeline.sh for the LSTMs). Reported metrics in the report come from src/eval_aria_metrics.py (OA / KLD) and scripts/fmd_eval_sweeps.py (FMD with the CLaMP-2 music encoder).

Citation

If you use these checkpoints, please cite the report and the original PiJAMA + Aria papers:

Edwards, Dixon and Benetos. PiJAMA: Piano Jazz with Automatic MIDI Annotations. ISMIR Transactions, 6(1):89–102, 2024.
Bradshaw and Colton. Aria: A Generative Model for Music-Aware AI. arXiv:2506.23869, 2025.
Oore, Simon, Dieleman, Eck, Simonyan. This Time with Feeling: Learning Expressive Musical Performance. Neural Computing and Applications, 32:955–967, 2020.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for napaalm/jazz-piano-ispr-2025-2026

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance

Paper • 2506.23869 • Published Jun 30, 2025