paion-tts-v1

Single-voice emotion-conditioned StyleTTS 2 model trained on Jaiden's recordings for the Paion AI companion project.

Architecture

  • Base: StyleTTS 2 (LibriTTS pretrained), fine-tuned on 537 voice clips
  • Decoder: iSTFTNet (mobile-optimized)
  • Speaker conditioning: 15 emotional "speakers" mapping to Paion's parent feel tags
  • Output: 24kHz mono speech, 8-bit quantized for mobile

Parent Feel Tags β†’ Speaker IDs

The model conditions on integer speaker IDs, each representing one parent feel tag from the Paion Feel Tag Taxonomy:

Speaker ID Parent tag
0 tenderness
1 joy
2 sadness
3 fear
4 calm
5 curiosity
6 anger
7 surprise
8 pride
9 relief
10 desire
11 discomfort
12 whisper (delivery mode)
13 soft (delivery mode)
14 urgent (delivery mode)

Runtime Pipeline

LLM produces  <feel>worried, curious</feel> Look I am worried about you...
  ↓
Watcher  β†’  parent tag = "fear"  β†’  speaker_id = 3  β†’  response text only
  ↓
StyleTTS 2 ONNX inference (this model)
  ↓
Pitch shift +4.5 semitones + formant shift 1.15Γ— β†’ female-low voice
  ↓
PCM out

Files

  • paion_styletts2.onnx β€” quantized ONNX, ship to mobile
  • epoch_2nd_*.pth β€” original PyTorch checkpoint
  • config_paion.yml β€” training config
  • speaker_map.json β€” parent tag β†’ speaker ID
  • feel_tag_map.py β€” feel-word β†’ parent-tag mapper (runtime watcher source)

Training

  • Hardware: RTX PRO 6000 (Blackwell, 96GB)
  • Wall time: ~4 hours
  • Dataset: 537 clips Γ— ~5 sec avg β‰ˆ 45 min total audio
  • Stages: StyleTTS 2 standard 2-stage (acoustic + SLM adversarial)

Credits

  • Voice: Jaiden (Jeff's son)
  • Project: Paion (AI companion expanding human cognition)
  • Architecture: StyleTTS 2 by Yinghao Aaron Li et al.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support