Expressive Jazz Piano Performance Modeling β€” ISPR 2025–2026

Four trained checkpoints for the ISPR 2025–2026 project on jazz piano performance modeling. Two pretrained PerformanceRNN LSTM variants and two fine-tunes of the publicly released Aria 1B-parameter piano language model (Bradshaw & Colton 2025), all fine-tuned on PiJAMA (Edwards et al. 2024). Companion code lives in the submission directory napolitano_antonio_ispr_2025_2026_project/; the report is the accompanying napolitano_antonio_ispr_2025_2026_report.pdf.

Models in this repo

The model names below match the headline table of the report.

aria-full-quality/ β€” Aria fine-tune, full-quality (offline)

Aria 1B-parameter LLaMA-3.2-style decoder fine-tuned on the PiJAMA hawthorne split with the default 17 727-id AbsTokenizer (no sustain pedal). Architecture: medium (d=1536, 16 layers, 24 heads, RoPE, GQA, max_seq_len=8192). Best swept mean OA on the test split: 0.911. FMD vs the kong test-pool reference (CLaMP-2 encoder): 272.6.

  • tested.safetensors β€” the checkpoint reported on in the paper (4-stage train/val pipeline: retrained on TRAIN+VAL for the patience-selected epoch count, evaluated once on TEST).
  • deployed.safetensors β€” full retrain on TRAIN+VAL+TEST for the same epoch count. For deployment / listening only; test metrics are not honestly reportable on this checkpoint because it has seen the test set.

aria-real-time/ β€” Aria fine-tune, real-time MLX-compatible

Same backbone but loaded from the public model-demo.safetensors checkpoint with the residual-stream embedding-projection layer preserved (medium-emb architecture, +1536Γ—512 emb_proj). Trained on PiJAMA kong album-aware split with the 2 675-id demo tokenizer that adds explicit sustain-pedal events. Drop-in jazz replacement for the upstream aria/demo/demo_mlx.py iOS sampler. Best swept mean OA: 0.804. FMD: 233.6.

  • tested.safetensors, deployed.safetensors β€” same conventions as above.

lstm-hawthorne/ β€” pretrained PerformanceRNN, hawthorne split

3-layer stacked LSTM (hidden 512, embed 512, tied I/O head, 6.46M params; paper-faithful PerformanceRNN, Oore et al. 2018). Pretrained on the 820 944-file Aria-MIDI corpus (30k steps) and fine-tuned on the PiJAMA hawthorne split with the 413-id no-pedal vocabulary. Best swept mean OA: 0.664. FMD: 427.8.

  • tested.pt β€” Stage-B equivalent (the one reported on).

lstm-kong-pedal/ β€” pretrained PerformanceRNN, kong+pedal split

Same architecture but with the 314-id pedal-aware vocabulary (NOTE_ONΓ—88 + NOTE_OFFΓ—88 + TIME_SHIFTΓ—100 + VELOCITYΓ—32 + SUSTAIN_ON/OFF + 4 specials). Fine-tuned on the PiJAMA kong album-aware split. Best swept mean OA: 0.768 (only β‰ˆ0.04 below Aria real-time despite a ~150Γ— parameter ratio). FMD: 438.6.

  • tested.pt

Note: a full retrain on TRAIN+VAL+TEST was not performed for the LSTMs (their compute cost is small enough that the 4-stage generalisation-honest pipeline already gives a strong deployment baseline). If you need that variant, the training script in the companion submission directory reproduces it in β‰ˆ25 minutes on a single NVIDIA B200 (or comparable GPU).

MLX variants for macOS inference

Each Aria model also has mlx-tested/ and mlx-deployed/ directories containing:

  • model.safetensors β€” same weights as the top-level safetensors, laid out for loading via mlx.core.load() on Apple silicon.
  • config.json β€” the corresponding Aria model config (medium.json for full-quality, medium-emb.json for real-time).
  • For aria-real-time/mlx-* only: tokenizer-config.json, the same 2 675-id demo tokenizer the upstream aria/demo/demo_mlx.py uses.

Running on macOS

aria-real-time/mlx-tested/ is a drop-in replacement for the weights expected by the upstream aria/demo/demo_mlx.py (iOS / Apple silicon real-time sampler from EleutherAI/aria). Point that script at model.safetensors and use the bundled tokenizer-config.json:

python aria/demo/demo_mlx.py \
    --checkpoint-path /path/to/mlx-tested/model.safetensors \
    --tokenizer-config /path/to/mlx-tested/tokenizer-config.json

aria-full-quality/mlx-*/ ships the full-quality weights and the medium.json config. The upstream demo_mlx.py hardcodes the medium-emb arch, so to run these checkpoints on MLX you either:

  1. Adapt aria.inference.model_mlx.TransformerLM to load medium instead of medium-emb (drop the emb_proj layer), or
  2. Run inference via PyTorch with the MPS backend on macOS, using the top-level tested.safetensors / deployed.safetensors and the default AbsTokenizer (no demo tokenizer config needed).

The full-quality checkpoints are β‰ˆ2.5 GB in bf16 β€” they fit easily on β‰₯16 GB unified-memory Apple silicon for inference.

Loading from Python

Aria (any variant) on CUDA / ROCm / MPS

from aria.config import load_model_config
from aria.model import ModelConfig, TransformerLM
from safetensors.torch import load_file

model_config = ModelConfig(**load_model_config("medium"))      # or "medium-emb"
model_config.set_vocab_size(17727)                              # or 2675 for real-time
model = TransformerLM(model_config)
model.load_state_dict(load_file("tested.safetensors"), strict=False)
model.eval()

Aria on MLX (Apple silicon)

import mlx.core as mx
weights = mx.load("mlx-tested/model.safetensors")
# … then build the MLX TransformerLM as in aria.inference.model_mlx

LSTM

import torch
from src.models.performancernn_lstm import PerformanceRNNLSTM, PerformanceRNNLSTMConfig
ckpt = torch.load("tested.pt", map_location="cpu", weights_only=False)
cfg  = PerformanceRNNLSTMConfig(**ckpt["config"])
model = PerformanceRNNLSTM(cfg)
model.load_state_dict(ckpt["model_state"], strict=True)
model.eval()

(The PerformanceRNNLSTM / PerformanceRNNLSTMConfig definitions live in the companion submission directory under src/models/performancernn_lstm.py.)

Recommended sampling settings

The Stage-C sampling sweep covered the 12 cells T ∈ {0.8, 1.0, 1.2} Γ— top-k ∈ {0, 24} Γ— min-p ∈ {0.035, 0.05} with 4 PiJAMA test prompts Γ— 20 variations per cell. The same cell β€” temperature = 1.2, top-k = 0 (no truncation), min-p = 0.035 β€” wins on both Mean OA and FMD for every model in this repo.

Model best (T, k, p) Mean OA ↑ FMD ↓ (CLaMP-2)
aria-full-quality (1.2, 0, 0.035) 0.911 272.6
aria-real-time (1.2, 0, 0.035) 0.804 233.6
lstm-kong-pedal (1.2, 0, 0.035) 0.768 438.6
lstm-hawthorne (1.2, 0, 0.035) 0.664 427.8

Three robust observations from the sweep:

  • Temperature dominates. Bumping T from 0.8 β†’ 1.2 buys +0.18–0.30 absolute OA on Aria at every (k, p) cell and +0.28 on both LSTM splits.
  • Don't truncate. top-k = 0 (no truncation) beats top-k = 24 by 0.03–0.07 OA at every (T, p) cell β€” aggressive truncation hurts distributional fidelity on this corpus.
  • min-p is comparatively flat between 0.035 and 0.05; the smaller value wins by a small margin everywhere.

If you only want a single set of knobs that works across all four models, use temperature=1.2, top_k=0, min_p=0.035.

Reproducibility

All four checkpoints were produced by the pipeline scripts in the companion submission directory (scripts/aria_pipeline_per_variant.sh for the Aria variants, scripts/train_performancernn_lstm_pipeline.sh for the LSTMs). Reported metrics in the report come from src/eval_aria_metrics.py (OA / KLD) and scripts/fmd_eval_sweeps.py (FMD with the CLaMP-2 music encoder).

Citation

If you use these checkpoints, please cite the report and the original PiJAMA + Aria papers:

  • Edwards, Dixon and Benetos. PiJAMA: Piano Jazz with Automatic MIDI Annotations. ISMIR Transactions, 6(1):89–102, 2024.
  • Bradshaw and Colton. Aria: A Generative Model for Music-Aware AI. arXiv:2506.23869, 2025.
  • Oore, Simon, Dieleman, Eck, Simonyan. This Time with Feeling: Learning Expressive Musical Performance. Neural Computing and Applications, 32:955–967, 2020.
Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for napaalm/jazz-piano-ispr-2025-2026