prithivida commited on
Commit
cd1ad1f
·
verified ·
1 Parent(s): 1dc6fd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -11
README.md CHANGED
@@ -5,16 +5,15 @@ language:
5
  datasets:
6
  - MIRACL
7
  tags:
8
- - miniMiracle
9
  - miniDense
10
  - passage-retrieval
11
  - knowledge-distillation
12
  - middle-training
13
  - sentence-transformers
14
  pretty_name: >-
15
- miniMiracle is a family of High-quality, Light Weight and Easy deploy
16
  multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
17
- Indo-Dravidin Languages.
18
  library_name: transformers
19
  pipeline_tag: sentence-similarity
20
  ---
@@ -64,10 +63,10 @@ pipeline_tag: sentence-similarity
64
 
65
  # Request, Terms, Disclaimers
66
 
67
- https://github.com/sponsors/PrithivirajDamodaran
68
 
69
  <center>
70
  <img src="./ar_terms.png" width=250%/>
 
71
  </center>
72
 
73
 
@@ -178,20 +177,31 @@ The below numbers are with mDPR model, but miniDense_arabic_v1 should give a eve
178
 
179
  *Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
180
 
181
- #### cMTEB numbers:
182
- CMTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but like BGE-M3, miniMiracle models are predominantly tuned for retireval tasks aimed at search & IR based usecases.
183
- We ran the retrieval slice of the cMTEB.
184
 
185
- We compared the performance few top general purpose embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard. Almost all models below are bert-Arabic based so they have no notion of any other languages.
186
 
187
  <center>
188
- <img src="./ar_metrics_3.png" width=150%/>
 
 
 
 
 
 
 
 
 
 
 
189
  </center>
190
 
191
  <br/>
192
 
193
  # Roadmap
194
- We will add miniMiracle series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
195
 
196
  - Spanish
197
  - Tamil
@@ -203,7 +213,7 @@ We will add miniMiracle series of models for all popular languages as we see fit
203
 
204
  We welcome anyone to reproduce our results. Here are some tips and observations:
205
 
206
- - Use CLS Pooling and Inner Product.
207
  - There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.
208
 
209
  Here are our numbers for the full hindi run on BGE-M3
 
5
  datasets:
6
  - MIRACL
7
  tags:
 
8
  - miniDense
9
  - passage-retrieval
10
  - knowledge-distillation
11
  - middle-training
12
  - sentence-transformers
13
  pretty_name: >-
14
+ miniDense is a family of High-quality, Light Weight and Easy deploy
15
  multilingual embedders / retrievers, primarily focussed on Indo-Aryan and
16
+ Indo-Dravidian Languages.
17
  library_name: transformers
18
  pipeline_tag: sentence-similarity
19
  ---
 
63
 
64
  # Request, Terms, Disclaimers
65
 
 
66
 
67
  <center>
68
  <img src="./ar_terms.png" width=250%/>
69
+ <b><p>[https://github.com/sponsors/PrithivirajDamodaran](https://github.com/sponsors/PrithivirajDamodaran)</p><b>
70
  </center>
71
 
72
 
 
177
 
178
  *Note: MIRACL paper shows a different (higher) value for BM25 Arabic, So we are taking that value from BGE-M3 paper, rest all are form the MIRACL paper.*
179
 
180
+ #### MTEB numbers:
181
+ MTEB is a general purpose embedding evaluation benchmark covering wide range of tasks, but miniDense models (like BGE-M3) are predominantly tuned for retireval tasks aimed at search & IR based usecases.
182
+ So it makes sense to evaluate our models in retrieval slice of the MTEB benchmark.
183
 
184
+ ##### Long Document Retrieval
185
 
186
  <center>
187
+ <img src="./ar_metrics_4.png" width=100%/>
188
+ <b><p>Table 3: Detailed Arabic retrieval performance on the MultiLongDoc dev set (measured by nDCG@10)</p></b>
189
+ </center>
190
+
191
+
192
+ ##### X-lingual Retrieval
193
+
194
+ Almost all models below are monolingual arabic models based so they have no notion of any other languages.
195
+
196
+ <center>
197
+ <img src="./ar_metrics_5.png" width=100%/>
198
+ <b><p>Table 4: Detailed Arabic retrieval performance on the 3 X-lingual test set (measured by nDCG@10)</p></b>
199
  </center>
200
 
201
  <br/>
202
 
203
  # Roadmap
204
+ We will add miniDense series of models for all popular languages as we see fit or based on community requests in phases. Some of the languages we have in our list are
205
 
206
  - Spanish
207
  - Tamil
 
213
 
214
  We welcome anyone to reproduce our results. Here are some tips and observations:
215
 
216
+ - Use CLS Pooling (not mean) and Inner Product (not cosine).
217
  - There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9.
218
 
219
  Here are our numbers for the full hindi run on BGE-M3