Update model card with full documentation and final results

6a648b8 verified about 2 months ago

8.59 kB

	---
	language:
	- en
	- fr
	- es
	- de
	license: apache-2.0
	library_name: transformers
	pipeline_tag: audio-classification
	tags:
	- whisper
	- audio-classification
	- telephony
	- answering-machine-detection
	- amd
	- speech-processing
	- real-time
	- generated_from_trainer
	datasets:
	- AbijahKaj/telephony-amd-dataset
	- PolyAI/minds14
	- pipecat-ai/human_5_all
	- pipecat-ai/human_convcollector_1
	- pipecat-ai/smart-turn-data-v3.2-train
	base_model: openai/whisper-tiny
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: whisper-telephony-amd
	results:
	- task:
	type: audio-classification
	name: Audio Classification
	dataset:
	name: telephony-amd-dataset
	type: AbijahKaj/telephony-amd-dataset
	split: test
	metrics:
	- type: accuracy
	value: 0.9875
	name: Accuracy
	- type: f1
	value: 0.99
	name: F1 (macro)
	- type: precision
	value: 0.99
	name: Precision (macro)
	- type: recall
	value: 0.99
	name: Recall (macro)
	---

	# Whisper Telephony AMD (Answering Machine Detection)

	A real-time audio classifier that detects whether a telephony call is answered by a human, voicemail, IVR system, or answering machine — using Whisper's speech understanding to distinguish human-recorded voicemail greetings from live speech.

	## Results

	98.75% accuracy on 400 test samples with only 5 misclassifications:

	```
	precision recall f1-score support

	human 1.00 0.99 1.00 114
	voicemail 0.96 0.99 0.98 102
	ivr 1.00 0.99 0.99 92
	answering_machine 0.99 0.98 0.98 92

	accuracy 0.99 400
	macro avg 0.99 0.99 0.99 400
	weighted avg 0.99 0.99 0.99 400
	```

	Confusion Matrix (rows = actual, columns = predicted):

	\| \| Human \| Voicemail \| IVR \| Answering Machine \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| Human \| 113 \| 1 \| 0 \| 0 \|
	\| Voicemail \| 0 \| 101 \| 0 \| 1 \|
	\| IVR \| 0 \| 1 \| 91 \| 0 \|
	\| Answering Machine \| 0 \| 2 \| 0 \| 90 \|

	### Accuracy Per Epoch

	\| Epoch \| Accuracy \| Eval Loss \| Per-Class \|
	\|:-----:\|:--------:\|:---------:\|-----------\|
	\| 1 \| 98.75% \| 0.0785 \| human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% \|
	\| 2 \| 95.75% \| 0.1473 \| human=94.7%, vm=93.1%, ivr=97.8%, am=97.8% \|
	\| 3 \| 98.25% \| 0.0779 \| human=97.4%, vm=100%, ivr=97.8%, am=97.8% \|
	\| 4 \| 98.75% \| 0.0415 \| human=99.1%, vm=99.0%, ivr=98.9%, am=97.8% \|
	\| 5 \| 98.75% \| 0.0569 \| human=99.1%, vm=98.0%, ivr=98.9%, am=98.9% \|
	\| 6 \| 98.00% \| 0.0539 \| human=97.4%, vm=99.0%, ivr=97.8%, am=97.8% \|

	Early stopping triggered after epoch 6 (patience=5, best at epoch 4). Best model loaded from epoch 4 checkpoint.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| WhisperForAudioClassification (Whisper-Tiny encoder + linear classifier) \|
	\| Base model \| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) \|
	\| Parameters \| 8.3M total, 7.2M trainable (conv layers frozen) \|
	\| Input \| 16kHz mono audio → 80-bin mel spectrogram (30s padded) \|
	\| Output \| 4 classes: `human`, `voicemail`, `ivr`, `answering_machine` \|
	\| Inference speed \| ~12ms CPU (ONNX int8), <5ms GPU \|
	\| Model size \| 31.7 MB (safetensors) \|
	\| Design reference \| Same architecture as [pipecat-ai/smart-turn-v3](https://hf.co/pipecat-ai/smart-turn-v3) \|

	## Quick Start

	### Pipeline (simplest)
	```python
	from transformers import pipeline

	classifier = pipeline("audio-classification", model="AbijahKaj/whisper-telephony-amd")
	result = classifier("phone_call.wav")
	print(result)
	# [{'score': 0.98, 'label': 'human'}, {'score': 0.01, 'label': 'voicemail'}, ...]
	```

	### Manual Inference
	```python
	from transformers import WhisperForAudioClassification, AutoFeatureExtractor
	import torch

	model = WhisperForAudioClassification.from_pretrained("AbijahKaj/whisper-telephony-amd")
	fe = AutoFeatureExtractor.from_pretrained("AbijahKaj/whisper-telephony-amd")

	# audio_array: numpy array at 16kHz
	inputs = fe(audio_array, sampling_rate=16000, return_tensors="pt")
	with torch.no_grad():
	logits = model(**inputs).logits
	pred = torch.argmax(logits, dim=-1).item()
	label = model.config.id2label[str(pred)]
	print(f"Predicted: {label}")
	```

	### Streaming Real-Time Inference
	```python
	from streaming_amd import StreamingAMDClassifier

	classifier = StreamingAMDClassifier("AbijahKaj/whisper-telephony-amd")

	for pcm_chunk in audio_stream: # 160ms chunks @ 8kHz
	result = classifier.process_chunk(pcm_chunk)
	if result:
	label, confidence, elapsed_s = result
	print(f"{label} ({confidence:.0%}) after {elapsed_s:.1f}s")
	break
	```

	## Why Whisper?

	Voicemail greetings are recorded by real humans — they are acoustically identical to live speech. Traditional acoustic-only models (energy, pitch, spectral features) cannot reliably distinguish "Hi, I'm not available, leave a message" from "Hello? Who's calling?".

	Whisper's encoder was pre-trained on 680K hours of speech and understands what is being said, not just how it sounds. This semantic understanding is critical for AMD.

	## Training

	### Dataset

	[AbijahKaj/telephony-amd-dataset](https://hf.co/datasets/AbijahKaj/telephony-amd-dataset) — 8,264 train / 400 test samples, balanced across 4 classes (~2,000 each).

	Data sources:

	\| Class \| Count \| Sources \|
	\|-------\|-------\|---------\|
	\| Human \| 2,151 \| [PolyAI/minds14](https://hf.co/datasets/PolyAI/minds14) (real telephony callers, 6 languages), [pipecat-ai/human_5_all](https://hf.co/datasets/pipecat-ai/human_5_all), [pipecat-ai/human_convcollector_1](https://hf.co/datasets/pipecat-ai/human_convcollector_1), original edge-tts \|
	\| Voicemail \| 2,078 \| [pipecat-ai smart-turn rime_2](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (personal greeting style), original edge-tts \|
	\| IVR \| 2,017 \| [pipecat-ai smart-turn chirp3](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (automated system style), original edge-tts \|
	\| Answering Machine \| 2,018 \| [pipecat-ai smart-turn orpheus](https://hf.co/datasets/pipecat-ai/smart-turn-data-v3.2-train) TTS (machine greeting style), original edge-tts \|

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning rate \| 1e-4 \|
	\| Scheduler \| Cosine with 25 warmup steps \|
	\| Batch size \| 32 \|
	\| Gradient accumulation \| 1 \|
	\| Max epochs \| 20 (early stopped at 6) \|
	\| Weight decay \| 0.01 \|
	\| Precision \| FP16 \|
	\| Gradient checkpointing \| Enabled \|
	\| Freeze strategy \| Conv layers frozen, transformer layers + head trainable \|
	\| Early stopping patience \| 5 \|
	\| Max audio length \| 10s (truncated, padded to 30s for Whisper) \|
	\| Hardware \| Tesla T4 (16GB VRAM) \|

	### Framework Versions

	- Transformers 5.7.0
	- PyTorch 2.11.0+cu130
	- Datasets 4.8.5
	- Tokenizers 0.22.2

	## Classes

	\| Label \| ID \| Description \| Example \|
	\|-------\|-----\|------------\|---------\|
	\| `human` \| 0 \| Live person on the phone \| "Hello? Yes, who is this?" \|
	\| `voicemail` \| 1 \| Personal voicemail greeting \| "Hi, you've reached John. Leave a message after the beep." \|
	\| `ivr` \| 2 \| IVR system / automated menu \| "Press 1 for sales, press 2 for support..." \|
	\| `answering_machine` \| 3 \| Carrier/generic automated message \| "The number you have dialed is not available..." \|

	## Limitations

	- Trained primarily on English, French, Spanish, and German audio
	- TTS-generated non-human classes may not fully represent all real-world telephony systems
	- Best performance on first 10 seconds of audio
	- Not tested on noisy cellular connections or VoIP codec artifacts beyond telephony bandpass (300-3400Hz)
	- The model may confuse voicemail greetings with answering machine messages in edge cases (2 misclassifications in test set)

	## Files

	- `model.safetensors` — Model weights (31.7MB)
	- `config.json` — Model configuration
	- `preprocessor_config.json` — Feature extractor config
	- `streaming_amd.py` — Streaming real-time inference module
	- `train_local.py` — Training script (CLI args, RTX 5090 ready)

	## Citation

	```bibtex
	@misc{whisper-telephony-amd,
	author = {AbijahKaj},
	title = {Whisper Telephony AMD: Real-Time Answering Machine Detection},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/AbijahKaj/whisper-telephony-amd}
	}
	```