TTS Suitability Classifier

ONNX audio classifier that estimates whether a speech segment is suitable for TTS training.

The model is a binary classifier based on the 300M wav2vec2 encoder from facebook/omniASR. The ONNX file is self-contained and does not require fairseq2, PyTorch, or the original omnilingual-asr repository for inference.

Labels

Class	Label	Meaning
0	`not_tts`	Audio is not suitable for TTS training
1	`tts`	Audio is suitable for TTS training

p_tts is the softmax probability of class 1. The default decision threshold is 0.5. For dataset filtering, choose the threshold on a manually labeled validation set.

Installation

pip install -r requirements.txt

For CUDA inference, replace onnxruntime with a compatible onnxruntime-gpu build.

Command-line inference

python inference.py sample.mp3
python inference.py /path/to/audio-directory --provider cpu
python inference.py sample.wav --provider cuda --cuda-device-id 0

Each result is printed as one JSON object:

{
  "label": "tts",
  "predicted_class": 1,
  "p_not_tts": 0.02,
  "p_tts": 0.98,
  "logits": [-2.2, 1.5]
}

Python API

from inference import TTSSuitabilityClassifier

classifier = TTSSuitabilityClassifier(provider="auto")
result = classifier.predict("sample.mp3")

print(result["label"])
print(result["p_tts"])

Input preprocessing

The included inference code applies the same preprocessing as the training and export recipe:

Decode WAV, FLAC, MP3, OGG, or M4A.
Mix channels to mono.
Resample to 16 kHz.
Apply waveform layer normalization.
Split long audio into 10-second chunks.
Average chunk logits and apply softmax.

The ONNX input is a float32 tensor named waveforms with shape [batch_size, num_frames]. The output is logits with shape [batch_size, 2]. Both input axes are dynamic; ONNX opset 17 is used.

Files

model.onnx: self-contained FP32 ONNX model.
inference.py: standalone ONNX Runtime inference.
requirements.txt: CPU inference dependencies.

Upload to Hugging Face

Create an empty model repository, then run from this directory:

hf upload-large-folder <username>/<repo-name> . --repo-type model

model.onnx is configured for Git LFS in .gitattributes.

Training and export

The released model corresponds to training checkpoint step 94,000. It was exported using the repository recipes:

workflows/recipes/wav2vec2/binary_classification/export_onnx.py
workflows/recipes/wav2vec2/binary_classification/run_onnx.py

Architecture: wav2vec2_asr 300m
Sample rate: 16000 Hz
Training maximum audio length: 160000 samples
Classes: not_tts, tts

Limitations

The score measures similarity to the training definition of TTS-suitable audio; it is not a general-purpose MOS score.
Music, noise, clipping, overlapping speakers, and unusual recording conditions may affect predictions.
Probabilities are not guaranteed to be calibrated.
Validate the threshold on data from the intended domain before filtering a large dataset.

License

Apache 2.0. The base architecture and code originate from the omnilingual-asr project.

Downloads last month: -; Downloads are not tracked for this model. How to track