Sentence Similarity
Transformers
ONNX
Safetensors
sentence-transformers
Transformers.js
English
new
feature-extraction
gte
mteb
custom_code
Eval Results (legacy)
text-embeddings-inference
Instructions to use Alibaba-NLP/gte-base-en-v1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Alibaba-NLP/gte-base-en-v1.5 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Alibaba-NLP/gte-base-en-v1.5", trust_remote_code=True, dtype="auto") - sentence-transformers
How to use Alibaba-NLP/gte-base-en-v1.5 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Alibaba-NLP/gte-base-en-v1.5", trust_remote_code=True) sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers.js
How to use Alibaba-NLP/gte-base-en-v1.5 with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('sentence-similarity', 'Alibaba-NLP/gte-base-en-v1.5'); - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -2622,7 +2622,8 @@ a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB a
|
|
| 2622 |
|
| 2623 |
- **Developed by:** Institute for Intelligent Computing, Alibaba Group
|
| 2624 |
- **Model type:** Text Embeddings
|
| 2625 |
-
- **Paper:**
|
|
|
|
| 2626 |
|
| 2627 |
<!-- - **Demo [optional]:** [More Information Needed] -->
|
| 2628 |
|
|
@@ -2717,7 +2718,7 @@ console.log(similarities); // [34.504930869007296, 64.03973265120138, 19.5200426
|
|
| 2717 |
### Training Data
|
| 2718 |
|
| 2719 |
- Masked language modeling (MLM): `c4-en`
|
| 2720 |
-
- Weak-supervised contrastive
|
| 2721 |
- Supervised contrastive fine-tuning: [GTE](https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
|
| 2722 |
|
| 2723 |
### Training Procedure
|
|
@@ -2728,8 +2729,8 @@ And then, we resample the data, reducing the proportion of short texts, and cont
|
|
| 2728 |
|
| 2729 |
The entire training process is as follows:
|
| 2730 |
- MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
|
| 2731 |
-
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
|
| 2732 |
-
-
|
| 2733 |
- Fine-tuning: TODO
|
| 2734 |
|
| 2735 |
|
|
@@ -2766,12 +2767,22 @@ The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=81
|
|
| 2766 |
If you find our paper or models helpful, please consider citing them as follows:
|
| 2767 |
|
| 2768 |
```
|
| 2769 |
-
@
|
| 2770 |
-
title={
|
| 2771 |
-
author={
|
| 2772 |
-
|
| 2773 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2774 |
}
|
| 2775 |
```
|
| 2776 |
-
|
| 2777 |
-
|
|
|
|
| 2622 |
|
| 2623 |
- **Developed by:** Institute for Intelligent Computing, Alibaba Group
|
| 2624 |
- **Model type:** Text Embeddings
|
| 2625 |
+
- **Paper:** [mGTE: Generalized Long-Context Text Representation and Reranking
|
| 2626 |
+
Models for Multilingual Text Retrieval](https://arxiv.org/pdf/2407.19669)
|
| 2627 |
|
| 2628 |
<!-- - **Demo [optional]:** [More Information Needed] -->
|
| 2629 |
|
|
|
|
| 2718 |
### Training Data
|
| 2719 |
|
| 2720 |
- Masked language modeling (MLM): `c4-en`
|
| 2721 |
+
- Weak-supervised contrastive pre-training (CPT): [GTE](https://arxiv.org/pdf/2308.03281.pdf) pre-training data
|
| 2722 |
- Supervised contrastive fine-tuning: [GTE](https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
|
| 2723 |
|
| 2724 |
### Training Procedure
|
|
|
|
| 2729 |
|
| 2730 |
The entire training process is as follows:
|
| 2731 |
- MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
|
| 2732 |
+
- [MLM-8192](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base): lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
|
| 2733 |
+
- CPT: max_len 512, lr 2e-4, batch_size 32768, num_steps 100000
|
| 2734 |
- Fine-tuning: TODO
|
| 2735 |
|
| 2736 |
|
|
|
|
| 2767 |
If you find our paper or models helpful, please consider citing them as follows:
|
| 2768 |
|
| 2769 |
```
|
| 2770 |
+
@misc{zhang2024mgte,
|
| 2771 |
+
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
|
| 2772 |
+
author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
|
| 2773 |
+
year={2024},
|
| 2774 |
+
eprint={2407.19669},
|
| 2775 |
+
archivePrefix={arXiv},
|
| 2776 |
+
primaryClass={cs.CL},
|
| 2777 |
+
url={https://arxiv.org/abs/2407.19669},
|
| 2778 |
+
}
|
| 2779 |
+
@misc{li2023gte,
|
| 2780 |
+
title={Towards General Text Embeddings with Multi-stage Contrastive Learning},
|
| 2781 |
+
author={Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang},
|
| 2782 |
+
year={2023},
|
| 2783 |
+
eprint={2308.03281},
|
| 2784 |
+
archivePrefix={arXiv},
|
| 2785 |
+
primaryClass={cs.CL},
|
| 2786 |
+
url={https://arxiv.org/abs/2308.03281},
|
| 2787 |
}
|
| 2788 |
```
|
|
|
|
|
|