Audio Classification
Transformers
Safetensors
whisper
telephony
answering-machine-detection
amd
speech-processing
real-time
Generated from Trainer
Eval Results (legacy)
Instructions to use AbijahKaj/whisper-telephony-amd with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AbijahKaj/whisper-telephony-amd with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd")# Load model directly from transformers import AutoProcessor, AutoModelForAudioClassification processor = AutoProcessor.from_pretrained("AbijahKaj/whisper-telephony-amd") model = AutoModelForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - fr | |
| - es | |
| - de | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: audio-classification | |
| tags: | |
| - whisper | |
| - audio-classification | |
| - telephony | |
| - answering-machine-detection | |
| - amd | |
| - speech-processing | |
| - real-time | |
| - generated_from_trainer | |
| datasets: | |
| - AbijahKaj/telephony-amd-dataset | |
| - PolyAI/minds14 | |
| - pipecat-ai/human_5_all | |
| - pipecat-ai/human_convcollector_1 | |
| - pipecat-ai/smart-turn-data-v3.2-train | |
| base_model: openai/whisper-tiny | |
| metrics: | |
| - accuracy | |
| - f1 | |
| - precision | |
| - recall | |
| model-index: | |
| - name: whisper-telephony-amd | |
| results: | |
| - task: | |
| type: audio-classification | |
| name: Audio Classification | |
| dataset: | |
| name: telephony-amd-dataset | |
| type: AbijahKaj/telephony-amd-dataset | |
| split: test | |
| metrics: | |
| - type: accuracy | |
| value: 0.9875 | |
| name: Accuracy | |
| - type: f1 | |
| value: 0.99 | |
| name: F1 (macro) | |
| - type: precision | |
| value: 0.99 | |
| name: Precision (macro) | |
| - type: recall | |
| value: 0.99 | |
| name: Recall (macro) | |
| # Whisper Telephony AMD (Answering Machine Detection) | |
| A real-time audio classifier that detects whether a telephony call is answered by a **human**, **voicemail**, **IVR system**, or **answering machine** — using Whisper's speech understanding to distinguish human-recorded voicemail greetings from live speech. | |
| ## Results | |
| **98.75% accuracy** on 400 test samples with only 5 misclassifications: | |
| ``` | |
| precision recall f1-score support | |
| human 1.00 0.99 1.00 114 | |
| voicemail 0.96 0.99 0.98 102 | |
| ivr 1.00 0.99 0.99 92 | |
| answering_machine 0.99 0.98 0.98 92 | |
| accuracy 0.99 400 | |
| macro avg 0.99 0.99 0.99 400 | |
| weighted avg 0.99 0.99 0.99 400 | |
| ``` | |
| **Confusion Matrix** (rows = actual, columns = predicted): | |
| | | Human | Voicemail | IVR | Answering Machine | | |
| |---|:---:|:---:|:---:|:---:| | |
| | **Human** | 113 | 1 | 0 | 0 | | |
| | **Voicemail** | 0 | 101 | 0 | 1 | | |
| | **IVR** | 0 | 1 | 91 | 0 | | |
| | **Answering Machine** | 0 | 2 | 0 | 90 | | |
| ### Accuracy Per Epoch | |
| | Epoch | Accuracy | Eval Loss | Per-Class | | |
| |:-----:|:--------:|:---------:|-----------| | |
| | 1 | **98.75%** | 0.0785 | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% | | |
| | 2 | 95.75% | 0.1473 | human=94.7%, vm=93.1%, ivr=97.8%, am=97.8% | | |
| | 3 | 98.25% | 0.0779 | human=97.4%, vm=100%, ivr=97.8%, am=97.8% | | |
| | 4 | **98.75%** | **0.0415** | human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% | | |
| | 5 | 98.75% | 0.0569 | human=99.1%, vm=98.0%, ivr=98.9%, am=98.9% | | |
| | 6 | 98.00% | 0.0539 | human=97.4%, vm=99.0%, ivr=97.8%, am=97.8% | | |
| Early stopping triggered after epoch 6 (patience=5, best at epoch 4). Best model loaded from epoch 4 checkpoint. | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Architecture** | WhisperForAudioClassification (Whisper-Tiny encoder + linear classifier) | | |
| | **Base model** | [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | | |
| | **Parameters** | 8.3M total, 7.2M trainable (conv layers frozen) | | |
| | **Input** | 16kHz mono audio → 80-bin mel spectrogram (30s padded) | | |
| | **Output** | 4 classes: `human`, `voicemail`, `ivr`, `answering_machine` | | |
| | **Inference speed** | ~12ms CPU (ONNX int8), <5ms GPU | | |
| | **Model size** | 31.7 MB (safetensors) | | |
| | **Design reference** | Same architecture as [pipecat-ai/smart-turn-v3](https://hf.co/pipecat-ai/smart-turn-v3) | | |
| ## Quick Start | |
| ### Pipeline (simplest) | |
| ```python | |
| from transformers import pipeline | |
| classifier = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd") | |
| result = classifier("phone_call.wav") | |
| print(result) | |
| # [{'score': 0.98, 'label': 'human'}, {'score': 0.01, 'label': 'voicemail'}, ...] | |
| ``` | |
| ### Manual Inference | |
| ```python | |
| from transformers import WhisperForAudioClassification, AutoFeatureExtractor | |
| import torch | |
| model = WhisperForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd") | |
| fe = AutoFeatureExtractor.from_pretrained("AbijahKaj/whisper-telephony-amd") | |
| # audio_array: numpy array at 16kHz | |
| inputs = fe(audio_array, sampling_rate=16000, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| pred = torch.argmax(logits, dim=-1).item() | |
| label = model.config.id2label[str(pred)] | |
| print(f"Predicted: {label}") | |
| ``` | |
| ### Streaming Real-Time Inference | |
| ```python | |
| from streaming_amd import StreamingAMDClassifier | |
| classifier = StreamingAMDClassifier("AbijahKaj/whisper-telephony-amd") | |
| for pcm_chunk in audio_stream: # 160ms chunks @ 8kHz | |
| result = classifier.process_chunk(pcm_chunk) | |
| if result: | |
| label, confidence, elapsed_s = result | |
| print(f"{label} ({confidence:.0%}) after {elapsed_s:.1f}s") | |
| break | |
| ``` | |
| ## Why Whisper? | |
| Voicemail greetings are **recorded by real humans** — they are acoustically identical to live speech. Traditional acoustic-only models (energy, pitch, spectral features) cannot reliably distinguish *"Hi, I'm not available, leave a message"* from *"Hello? Who's calling?"*. | |
| Whisper's encoder was pre-trained on 680K hours of speech and understands **what is being said**, not just how it sounds. This semantic understanding is critical for AMD. | |
| ## Training | |
| ### Dataset | |
| [AbijahKaj/telephony-amd-dataset](https://hf.co/datasets/AbijahKaj/telephony-amd-dataset) — **8,264 train / 400 test** samples, balanced across 4 classes (~2,000 each). | |
| **Data sources:** | |
| | Class | Count | Sources | | |
| |-------|-------|---------| | |
| | Human | 2,151 | [PolyAI/minds14](https://hf.co/datasets/PolyAI/minds14) (real telephony callers, 6 languages), [pipecat-ai/human_5_all](https://hf.co/datasets/pipecat-ai/human_5_all), [pipecat-ai/human_convcollector_1](https://hf.co/datasets/pipecat-ai/human_convcollector_1), original edge-tts | | |
| | Voicemail | 2,078 | [pipecat-ai smart-turn rime_2](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (personal greeting style), original edge-tts | | |
| | IVR | 2,017 | [pipecat-ai smart-turn chirp3](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (automated system style), original edge-tts | | |
| | Answering Machine | 2,018 | [pipecat-ai smart-turn orpheus](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (machine greeting style), original edge-tts | | |
| ### Hyperparameters | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Learning rate | 1e-4 | | |
| | Scheduler | Cosine with 25 warmup steps | | |
| | Batch size | 32 | | |
| | Gradient accumulation | 1 | | |
| | Max epochs | 20 (early stopped at 6) | | |
| | Weight decay | 0.01 | | |
| | Precision | FP16 | | |
| | Gradient checkpointing | Enabled | | |
| | Freeze strategy | Conv layers frozen, transformer layers + head trainable | | |
| | Early stopping patience | 5 | | |
| | Max audio length | 10s (truncated, padded to 30s for Whisper) | | |
| | Hardware | Tesla T4 (16GB VRAM) | | |
| ### Framework Versions | |
| - Transformers 5.7.0 | |
| - PyTorch 2.11.0+cu130 | |
| - Datasets 4.8.5 | |
| - Tokenizers 0.22.2 | |
| ## Classes | |
| | Label | ID | Description | Example | | |
| |-------|-----|------------|---------| | |
| | `human` | 0 | Live person on the phone | *"Hello? Yes, who is this?"* | | |
| | `voicemail` | 1 | Personal voicemail greeting | *"Hi, you've reached John. Leave a message after the beep."* | | |
| | `ivr` | 2 | IVR system / automated menu | *"Press 1 for sales, press 2 for support..."* | | |
| | `answering_machine` | 3 | Carrier/generic automated message | *"The number you have dialed is not available..."* | | |
| ## Limitations | |
| - Trained primarily on English, French, Spanish, and German audio | |
| - TTS-generated non-human classes may not fully represent all real-world telephony systems | |
| - Best performance on first 10 seconds of audio | |
| - Not tested on noisy cellular connections or VoIP codec artifacts beyond telephony bandpass (300-3400Hz) | |
| - The model may confuse voicemail greetings with answering machine messages in edge cases (2 misclassifications in test set) | |
| ## Files | |
| - `model.safetensors` — Model weights (31.7MB) | |
| - `config.json` — Model configuration | |
| - `preprocessor_config.json` — Feature extractor config | |
| - `streaming_amd.py` — Streaming real-time inference module | |
| - `train_local.py` — Training script (CLI args, RTX 5090 ready) | |
| ## Citation | |
| ```bibtex | |
| @misc{whisper-telephony-amd, | |
| author = {AbijahKaj}, | |
| title = {Whisper Telephony AMD: Real-Time Answering Machine Detection}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/AbijahKaj/whisper-telephony-amd} | |
| } | |
| ``` | |