Sentence Similarity
sentence-transformers
Safetensors
Transformers
English
modernbert
feature-extraction
embeddings
Eval Results (legacy)
text-embeddings-inference
Instructions to use mjbommar/ogbert-2m-sentence with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use mjbommar/ogbert-2m-sentence with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("mjbommar/ogbert-2m-sentence") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use mjbommar/ogbert-2m-sentence with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-2m-sentence") model = AutoModel.from_pretrained("mjbommar/ogbert-2m-sentence") - Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -57,7 +57,7 @@ A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary a
|
|
| 57 |
## Training
|
| 58 |
|
| 59 |
- **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
|
| 60 |
-
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
|
| 61 |
- **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
|
| 62 |
|
| 63 |
## Performance
|
|
@@ -159,7 +159,16 @@ Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) in
|
|
| 159 |
|
| 160 |
## Citation
|
| 161 |
|
| 162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
## License
|
| 165 |
|
|
|
|
| 57 |
## Training
|
| 58 |
|
| 59 |
- **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
|
| 60 |
+
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
|
| 61 |
- **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
|
| 62 |
|
| 63 |
## Performance
|
|
|
|
| 159 |
|
| 160 |
## Citation
|
| 161 |
|
| 162 |
+
If you use this model, please cite the OpenGloss dataset:
|
| 163 |
+
|
| 164 |
+
```bibtex
|
| 165 |
+
@article{bommarito2025opengloss,
|
| 166 |
+
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
|
| 167 |
+
author={Bommarito II, Michael J.},
|
| 168 |
+
journal={arXiv preprint arXiv:2511.18622},
|
| 169 |
+
year={2025}
|
| 170 |
+
}
|
| 171 |
+
```
|
| 172 |
|
| 173 |
## License
|
| 174 |
|