TTS Suitability Classifier

ONNX audio classifier that estimates whether a speech segment is suitable for TTS training.

The model is a binary classifier based on the 300M wav2vec2 encoder from facebook/omniASR. The ONNX file is self-contained and does not require fairseq2, PyTorch, or the original omnilingual-asr repository for inference.

Labels

Class Label Meaning
0 not_tts Audio is not suitable for TTS training
1 tts Audio is suitable for TTS training

p_tts is the softmax probability of class 1. The default decision threshold is 0.5. For dataset filtering, choose the threshold on a manually labeled validation set.

Installation

pip install -r requirements.txt

For CUDA inference, replace onnxruntime with a compatible onnxruntime-gpu build.

Command-line inference

python inference.py sample.mp3
python inference.py /path/to/audio-directory --provider cpu
python inference.py sample.wav --provider cuda --cuda-device-id 0

Each result is printed as one JSON object:

{
  "label": "tts",
  "predicted_class": 1,
  "p_not_tts": 0.02,
  "p_tts": 0.98,
  "logits": [-2.2, 1.5]
}

Python API

from inference import TTSSuitabilityClassifier

classifier = TTSSuitabilityClassifier(provider="auto")
result = classifier.predict("sample.mp3")

print(result["label"])
print(result["p_tts"])

Input preprocessing

The included inference code applies the same preprocessing as the training and export recipe:

  1. Decode WAV, FLAC, MP3, OGG, or M4A.
  2. Mix channels to mono.
  3. Resample to 16 kHz.
  4. Apply waveform layer normalization.
  5. Split long audio into 10-second chunks.
  6. Average chunk logits and apply softmax.

The ONNX input is a float32 tensor named waveforms with shape [batch_size, num_frames]. The output is logits with shape [batch_size, 2]. Both input axes are dynamic; ONNX opset 17 is used.

Files

  • model.onnx: self-contained FP32 ONNX model.
  • inference.py: standalone ONNX Runtime inference.
  • requirements.txt: CPU inference dependencies.

Upload to Hugging Face

Create an empty model repository, then run from this directory:

hf upload-large-folder <username>/<repo-name> . --repo-type model

model.onnx is configured for Git LFS in .gitattributes.

Training and export

The released model corresponds to training checkpoint step 94,000. It was exported using the repository recipes:

  • workflows/recipes/wav2vec2/binary_classification/export_onnx.py
  • workflows/recipes/wav2vec2/binary_classification/run_onnx.py

Architecture: wav2vec2_asr 300m
Sample rate: 16000 Hz
Training maximum audio length: 160000 samples
Classes: not_tts, tts

Limitations

  • The score measures similarity to the training definition of TTS-suitable audio; it is not a general-purpose MOS score.
  • Music, noise, clipping, overlapping speakers, and unusual recording conditions may affect predictions.
  • Probabilities are not guaranteed to be calibrated.
  • Validate the threshold on data from the intended domain before filtering a large dataset.

License

Apache 2.0. The base architecture and code originate from the omnilingual-asr project.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support