--- license: other license_name: whissle-inference-only-1.0 license_link: LICENSE language: - multilingual - en - hi - es - fr - de - it - gu - mr pipeline_tag: automatic-speech-recognition tags: - nemo - asr - onnx - cpu - emotion - age - gender - intent - entity_recognition datasets: - MLCommons/peoples_speech - fsicoli/common_voice_17_0 - ai4bharat/IndicVoices - facebook/multilingual_librispeech - openslr/librispeech_asr base_model: - nvidia/parakeet-ctc-0.6b library_name: nemo extra_gated_heading: "Access Whissle STT-meta-1B on Hugging Face" extra_gated_description: > This model is licensed for inference only — no training, fine-tuning, distillation, or reverse engineering permitted. Accept the license to access. Automatic approval. extra_gated_button_content: "Agree and access repository" extra_gated_fields: First Name: text Last Name: text Organization: text Country: country Date of birth: date_picker I want to use this model for: type: select options: - Research - Education - Commercial product - Personal project - label: Other value: other I accept the Whissle Inference-Only License Agreement: checkbox extra_gated_prompt: >- By clicking "Agree", you accept the Whissle Inference-Only License Agreement. See the LICENSE file for full terms. Key restrictions: INFERENCE ONLY — no training, fine-tuning, distillation, model compression, or reverse engineering permitted. Free for inference use under 100M MAU. "Powered by Whissle" attribution required for redistribution. --- # Whissle STT-meta-1B Multilingual speech recognition model with dual-head tag classifier for real-time speaker metadata extraction. Built on Conformer-CTC architecture with 18 encoder layers. Supports 9 languages with age, gender, emotion, and intent detection per utterance. ## Model Details | | | |---|---| | **Architecture** | Conformer-CTC + dual-head tag classifier | | **Encoder** | 512-dim, 18 layers, 4x subsampling | | **Download size** | ~488 MB | | **Format** | NeMo (.nemo) and ONNX (CPU and GPU compatible) | | **Sample rate** | 16 kHz mono | | **Languages** | English, Hindi, Spanish, French, German, Italian, Gujarati, Marathi | | **Base model** | [nvidia/parakeet-ctc-0.6b](https://huggingface.co/nvidia/parakeet-ctc-0.6b) | ## Tag Classifier Outputs | Category | Classes | Labels | |----------|---------|--------| | **Age** | 6 | 0-18, 18-30, 30-45, 45-60, 60+, NONE | | **Emotion** | 8 | NEUTRAL, HAPPY, SAD, ANGRY, FEAR, SURPRISE, DISGUST, NONE | | **Gender** | 4 | MALE, FEMALE, OTHER, NONE | | **Intent** | 10 | COMMAND, DESCRIBE, EXCLAIM, EXPLAIN, INFORM, OPINION, QUESTION, REQUEST, STATEMENT, NONE | ## Quick Start Use with the [Whissle STT Inference Server](https://github.com/WhissleAI/whissle_stt_inference) (ONNX, CPU): ```bash git clone https://github.com/WhissleAI/whissle_stt_inference.git cd whissle_stt_inference ./setup.sh --model en-meta ``` Or load directly with NeMo: ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("WhissleAI/STT-meta-1B") transcriptions = asr_model.transcribe(["/path/to/your/audio.wav"]) ``` Also usable with [PromptingNemo](https://github.com/WhissleAI/PromptingNemo/blob/main/scripts/asr/meta-asr). ## Performance Tested on CPU (Apple M-series): | Audio length | Inference time | RTF | Tags | |-------------|---------------|-----|------| | 25.9s | 3.6s | 0.14x | Female, 30-45, Neutral, Describe | | 1.1s | 0.46s | 0.42x | Female, 18-30, Happy, Question | ## License [Whissle Inference-Only License](./LICENSE) — inference only, no training/fine-tuning/distillation/reverse engineering. Free under 100M MAU. ## Citation ```bibtex @misc{whissle2026sttmeta1b, title={Whissle STT-meta-1B: Multilingual ASR with Intent, Emotion, and Voice Biometrics}, author={Whissle AI}, year={2026}, url={https://huggingface.co/WhissleAI/STT-meta-1B} } ```