Sentence Similarity
Transformers
ONNX
Safetensors
sentence-transformers
English
multimodal
embeddings
image-text
audio-text
retrieval
2DMSE
matryoshka
Eval Results (legacy)
Instructions to use llm-semantic-router/multi-modal-embed-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use llm-semantic-router/multi-modal-embed-small with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("llm-semantic-router/multi-modal-embed-small", dtype="auto") - sentence-transformers
How to use llm-semantic-router/multi-modal-embed-small with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("llm-semantic-router/multi-modal-embed-small") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Inference
- Notebooks
- Google Colab
- Kaggle
Add verified dataset sources with paper citations
Browse files
README.md
CHANGED
|
@@ -298,11 +298,11 @@ fused_embedding = model.encode_multimodal(
|
|
| 298 |
|
| 299 |
The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
|
| 300 |
|
| 301 |
-
| Dataset | Size |
|
| 302 |
-
|---------|------|-------------|
|
| 303 |
-
| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) |
|
| 304 |
-
| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K |
|
| 305 |
-
| [Clotho](https://zenodo.org/
|
| 306 |
|
| 307 |
This will enable:
|
| 308 |
- Audio-to-text retrieval
|
|
|
|
| 298 |
|
| 299 |
The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
|
| 300 |
|
| 301 |
+
| Dataset | Size | Source | Paper |
|
| 302 |
+
|---------|------|--------|-------|
|
| 303 |
+
| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 403K clips | HuggingFace (CVSSP, University of Surrey) | [arXiv:2303.17395](https://arxiv.org/abs/2303.17395) |
|
| 304 |
+
| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K clips | GitHub (Seoul National University) | [NAACL-HLT 2019](https://aclanthology.org/N19-1011/) |
|
| 305 |
+
| [Clotho](https://zenodo.org/records/3490684) | 6K clips | Zenodo (Tampere University) | [ICASSP 2020](https://ieeexplore.ieee.org/document/9052990) |
|
| 306 |
|
| 307 |
This will enable:
|
| 308 |
- Audio-to-text retrieval
|