Sentence Similarity
sentence-transformers
PyTorch
TensorFlow
Safetensors
French
camembert
Text
Sentence Similarity
Sentence-Embedding
camembert-large
Eval Results (legacy)
Instructions to use Photon-BR/sentence-camembert-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Photon-BR/sentence-camembert-large with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Photon-BR/sentence-camembert-large") sentences = [ "C'est une personne heureuse", "C'est un chien heureux", "C'est une personne très heureuse", "Aujourd'hui est une journée ensoleillée" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: sentence-similarity | |
| language: fr | |
| datasets: | |
| - stsb_multi_mt | |
| tags: | |
| - Text | |
| - Sentence Similarity | |
| - Sentence-Embedding | |
| - camembert-large | |
| license: apache-2.0 | |
| model-index: | |
| - name: sentence-camembert-large by Van Tuan DANG | |
| results: | |
| - task: | |
| name: Sentence-Embedding | |
| type: Text Similarity | |
| dataset: | |
| name: Text Similarity fr | |
| type: stsb_multi_mt | |
| args: fr | |
| metrics: | |
| - name: Test Pearson correlation coefficient | |
| type: Pearson_correlation_coefficient | |
| value: xx.xx | |
| library_name: sentence-transformers | |
| ## Description: | |
| [**Sentence-CamemBERT-Large**](https://huggingface.co/dangvantuan/sentence-camembert-large) is the Embedding Model for French developed by [La Javaness](https://www.lajavaness.com/). The purpose of this embedding model is to represent the content and semantics of a French sentence in a mathematical vector which allows it to understand the meaning of the text-beyond individual words in queries and documents, offering a powerful semantic search. | |
| ## Pre-trained sentence embedding models are state-of-the-art of Sentence Embeddings for French. | |
| The model is Fine-tuned using pre-trained [facebook/camembert-large](https://huggingface.co/camembert/camembert-large) and | |
| [Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) | |
| ## Usage | |
| The model can be used directly (without a language model) as follows: | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("dangvantuan/sentence-camembert-large") | |
| sentences = ["Un avion est en train de décoller.", | |
| "Un homme joue d'une grande flûte.", | |
| "Un homme étale du fromage râpé sur une pizza.", | |
| "Une personne jette un chat au plafond.", | |
| "Une personne est en train de plier un morceau de papier.", | |
| ] | |
| embeddings = model.encode(sentences) | |
| ``` | |
| ## Evaluation | |
| The model can be evaluated as follows on the French test data of stsb. | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| from sentence_transformers.readers import InputExample | |
| from datasets import load_dataset | |
| def convert_dataset(dataset): | |
| dataset_samples=[] | |
| for df in dataset: | |
| score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1 | |
| inp_example = InputExample(texts=[df['sentence1'], | |
| df['sentence2']], label=score) | |
| dataset_samples.append(inp_example) | |
| return dataset_samples | |
| # Loading the dataset for evaluation | |
| df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev") | |
| df_test = load_dataset("stsb_multi_mt", name="fr", split="test") | |
| # Convert the dataset for evaluation | |
| # For Dev set: | |
| dev_samples = convert_dataset(df_dev) | |
| val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev') | |
| val_evaluator(model, output_path="./") | |
| # For Test set: | |
| test_samples = convert_dataset(df_test) | |
| test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test') | |
| test_evaluator(model, output_path="./") | |
| ``` | |
| **Test Result**: | |
| The performance is measured using Pearson and Spearman correlation: | |
| - On dev | |
| | Model | Pearson correlation | Spearman correlation | #params | | |
| | ------------- | ------------- | ------------- |------------- | | |
| | [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)| 88.2 |88.02 | 336M| | |
| | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 86.73|86.54 | 110M | | |
| | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M | | |
| | [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 85 | NaN|175B | | |
| | [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.75 | 80.44|NaN | | |
| - On test | |
| | Model | Pearson correlation | Spearman correlation | | |
| | ------------- | ------------- | ------------- | | |
| | [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)| 85.9 | 85.8| | |
| | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 82.36 | 81.64| | |
| | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 78.62 | 77.48| | |
| | [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 82 | NaN|175B | | |
| | [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.05 | 77.56|NaN | | |
| ## Citation | |
| @article{reimers2019sentence, | |
| title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, | |
| author={Nils Reimers, Iryna Gurevych}, | |
| journal={https://arxiv.org/abs/1908.10084}, | |
| year={2019} | |
| } | |
| @article{martin2020camembert, | |
| title={CamemBERT: a Tasty French Language Mode}, | |
| author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, | |
| journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, | |
| year={2020} | |
| } |