Sentence Similarity
Transformers
PyTorch
ONNX
sentence-transformers
Arabic
bert
feature-extraction
miniDense
passage-retrieval
knowledge-distillation
middle-training
text-embeddings-inference
Instructions to use prithivida/miniDense_arabic_v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prithivida/miniDense_arabic_v1 with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("prithivida/miniDense_arabic_v1") model = AutoModel.from_pretrained("prithivida/miniDense_arabic_v1") - sentence-transformers
How to use prithivida/miniDense_arabic_v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("prithivida/miniDense_arabic_v1") sentences = [ "هذا شخص سعيد", "هذا كلب سعيد", "هذا شخص سعيد جدا", "اليوم هو يوم مشمس" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -5,16 +5,15 @@ language:
|
|
| 5 |
datasets:
|
| 6 |
- MIRACL
|
| 7 |
tags:
|
| 8 |
-
- miniMiracle
|
| 9 |
- miniDense
|
| 10 |
- passage-retrieval
|
| 11 |
- knowledge-distillation
|
| 12 |
- middle-training
|
| 13 |
- sentence-transformers
|
| 14 |
pretty_name: >-
|
| 15 |
-
|
| 16 |
multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
|
| 17 |
-
Indo-
|
| 18 |
library_name: transformers
|
| 19 |
pipeline_tag: sentence-similarity
|
| 20 |
---
|
|
@@ -64,10 +63,10 @@ pipeline_tag: sentence-similarity
|
|
| 64 |
|
| 65 |
# Request, Terms, Disclaimers
|
| 66 |
|
| 67 |
-
https://github.com/sponsors/PrithivirajDamodaran
|
| 68 |
|
| 69 |
<center>
|
| 70 |
<img src="./ar_terms.png" width=250%/>
|
|
|
|
| 71 |
</center>
|
| 72 |
|
| 73 |
|
|
@@ -178,20 +177,31 @@ The below numbers are with mDPR model, but miniDense_arabic_v1 should give a eve
|
|
| 178 |
|
| 179 |
*Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
|
| 180 |
|
| 181 |
-
####
|
| 182 |
-
|
| 183 |
-
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
<center>
|
| 188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
</center>
|
| 190 |
|
| 191 |
<br/>
|
| 192 |
|
| 193 |
# Roadmap
|
| 194 |
-
We will add
|
| 195 |
|
| 196 |
- Spanish
|
| 197 |
- Tamil
|
|
@@ -203,7 +213,7 @@ We will add miniMiracle series of models for all popular languages as we see fit
|
|
| 203 |
|
| 204 |
We welcome anyone to reproduce our results. Here are some tips and observations:
|
| 205 |
|
| 206 |
-
- Use CLS Pooling and Inner Product.
|
| 207 |
- There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.
|
| 208 |
|
| 209 |
Here are our numbers for the full hindi run on BGE-M3
|
|
|
|
| 5 |
datasets:
|
| 6 |
- MIRACL
|
| 7 |
tags:
|
|
|
|
| 8 |
- miniDense
|
| 9 |
- passage-retrieval
|
| 10 |
- knowledge-distillation
|
| 11 |
- middle-training
|
| 12 |
- sentence-transformers
|
| 13 |
pretty_name: >-
|
| 14 |
+
miniDense is a family of High-quality, Light Weight and Easy deploy
|
| 15 |
multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
|
| 16 |
+
Indo-Dravidian Languages.
|
| 17 |
library_name: transformers
|
| 18 |
pipeline_tag: sentence-similarity
|
| 19 |
---
|
|
|
|
| 63 |
|
| 64 |
# Request, Terms, Disclaimers
|
| 65 |
|
|
|
|
| 66 |
|
| 67 |
<center>
|
| 68 |
<img src="./ar_terms.png" width=250%/>
|
| 69 |
+
<b><p>[https://github.com/sponsors/PrithivirajDamodaran](https://github.com/sponsors/PrithivirajDamodaran)</p><b>
|
| 70 |
</center>
|
| 71 |
|
| 72 |
|
|
|
|
| 177 |
|
| 178 |
*Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
|
| 179 |
|
| 180 |
+
#### MTEB numbers:
|
| 181 |
+
MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
|
| 182 |
+
So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
|
| 183 |
|
| 184 |
+
##### Long Document Retrieval
|
| 185 |
|
| 186 |
<center>
|
| 187 |
+
<img src="./ar_metrics_4.png" width=100%/>
|
| 188 |
+
<b><p>Table 3: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
|
| 189 |
+
</center>
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
##### X-lingual Retrieval
|
| 193 |
+
|
| 194 |
+
Almost all models below are monolingual arabic models based so they have no notion of any other languages.
|
| 195 |
+
|
| 196 |
+
<center>
|
| 197 |
+
<img src="./ar_metrics_5.png" width=100%/>
|
| 198 |
+
<b><p>Table 4: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
|
| 199 |
</center>
|
| 200 |
|
| 201 |
<br/>
|
| 202 |
|
| 203 |
# Roadmap
|
| 204 |
+
We will add miniDense series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
|
| 205 |
|
| 206 |
- Spanish
|
| 207 |
- Tamil
|
|
|
|
| 213 |
|
| 214 |
We welcome anyone to reproduce our results. Here are some tips and observations:
|
| 215 |
|
| 216 |
+
- Use CLS Pooling (not mean) and Inner Product (not cosine).
|
| 217 |
- There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.
|
| 218 |
|
| 219 |
Here are our numbers for the full hindi run on BGE-M3
|