llm-semantic-router
/

multi-modal-embed-small

@@ -298,11 +298,11 @@ fused_embedding = model.encode_multimodal(
 The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
-| Dataset | Size | Description |
-|---------|------|-------------|
-| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 400K | Largest audio-caption dataset |
-| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K | YouTube audio with human captions |
-| [Clotho](https://zenodo.org/record/3490684) | 6K | High-quality multi-annotator captions |
 This will enable:
 - Audio-to-text retrieval

 The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
+| Dataset | Size | Source | Paper |
+|---------|------|--------|-------|
+| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 403K clips | HuggingFace (CVSSP, University of Surrey) | [arXiv:2303.17395](https://arxiv.org/abs/2303.17395) |
+| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K clips | GitHub (Seoul National University) | [NAACL-HLT 2019](https://aclanthology.org/N19-1011/) |
+| [Clotho](https://zenodo.org/records/3490684) | 6K clips | Zenodo (Tampere University) | [ICASSP 2020](https://ieeexplore.ieee.org/document/9052990) |
 This will enable:
 - Audio-to-text retrieval