---
license: other
license_name: whissle-inference-only-1.0
license_link: LICENSE
language:
- multilingual
- en
- hi
- es
- fr
- de
- it
- gu
- mr
pipeline_tag: automatic-speech-recognition
tags:
- nemo
- asr
- onnx
- cpu
- emotion
- age
- gender
- intent
- entity_recognition
datasets:
- MLCommons/peoples_speech
- fsicoli/common_voice_17_0
- ai4bharat/IndicVoices
- facebook/multilingual_librispeech
- openslr/librispeech_asr
base_model:
- nvidia/parakeet-ctc-0.6b
library_name: nemo
extra_gated_heading: "Access Whissle STT-meta-1B on Hugging Face"
extra_gated_description: >
  This model is licensed for inference only — no training, fine-tuning, distillation,
  or reverse engineering permitted. Accept the license to access. Automatic approval.
extra_gated_button_content: "Agree and access repository"
extra_gated_fields:
  First Name: text
  Last Name: text
  Organization: text
  Country: country
  Date of birth: date_picker
  I want to use this model for:
    type: select
    options:
      - Research
      - Education
      - Commercial product
      - Personal project
      - label: Other
        value: other
  I accept the Whissle Inference-Only License Agreement: checkbox
extra_gated_prompt: >-
  By clicking "Agree", you accept the Whissle Inference-Only License Agreement.
  See the LICENSE file for full terms. Key restrictions: INFERENCE ONLY — no
  training, fine-tuning, distillation, model compression, or reverse engineering
  permitted. Free for inference use under 100M MAU. "Powered by Whissle"
  attribution required for redistribution.
---

# Whissle STT-meta-1B

Multilingual speech recognition model with dual-head tag classifier for real-time speaker metadata extraction. Built on Conformer-CTC architecture with 18 encoder layers. Supports 9 languages with age, gender, emotion, and intent detection per utterance.

## Model Details

| | |
|---|---|
| **Architecture** | Conformer-CTC + dual-head tag classifier |
| **Encoder** | 512-dim, 18 layers, 4x subsampling |
| **Download size** | ~488 MB |
| **Format** | NeMo (.nemo) and ONNX (CPU and GPU compatible) |
| **Sample rate** | 16 kHz mono |
| **Languages** | English, Hindi, Spanish, French, German, Italian, Gujarati, Marathi |
| **Base model** | [nvidia/parakeet-ctc-0.6b](https://huggingface.co/nvidia/parakeet-ctc-0.6b) |

## Tag Classifier Outputs

| Category | Classes | Labels |
|----------|---------|--------|
| **Age** | 6 | 0-18, 18-30, 30-45, 45-60, 60+, NONE |
| **Emotion** | 8 | NEUTRAL, HAPPY, SAD, ANGRY, FEAR, SURPRISE, DISGUST, NONE |
| **Gender** | 4 | MALE, FEMALE, OTHER, NONE |
| **Intent** | 10 | COMMAND, DESCRIBE, EXCLAIM, EXPLAIN, INFORM, OPINION, QUESTION, REQUEST, STATEMENT, NONE |

## Quick Start

Use with the [Whissle STT Inference Server](https://github.com/WhissleAI/whissle_stt_inference) (ONNX, CPU):

```bash
git clone https://github.com/WhissleAI/whissle_stt_inference.git
cd whissle_stt_inference
./setup.sh --model en-meta
```

Or load directly with NeMo:

```python
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("WhissleAI/STT-meta-1B")
transcriptions = asr_model.transcribe(["/path/to/your/audio.wav"])
```

Also usable with [PromptingNemo](https://github.com/WhissleAI/PromptingNemo/blob/main/scripts/asr/meta-asr).

## Performance

Tested on CPU (Apple M-series):

| Audio length | Inference time | RTF | Tags |
|-------------|---------------|-----|------|
| 25.9s | 3.6s | 0.14x | Female, 30-45, Neutral, Describe |
| 1.1s | 0.46s | 0.42x | Female, 18-30, Happy, Question |

## License

[Whissle Inference-Only License](./LICENSE) — inference only, no training/fine-tuning/distillation/reverse engineering. Free under 100M MAU.

## Citation

```bibtex
@misc{whissle2026sttmeta1b,
  title={Whissle STT-meta-1B: Multilingual ASR with Intent, Emotion, and Voice Biometrics},
  author={Whissle AI},
  year={2026},
  url={https://huggingface.co/WhissleAI/STT-meta-1B}
}
```