Sentence Similarity
sentence-transformers
Safetensors
Transformers
English
modernbert
feature-extraction
embeddings
Eval Results (legacy)
text-embeddings-inference
Instructions to use mjbommar/ogbert-2m-sentence with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use mjbommar/ogbert-2m-sentence with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("mjbommar/ogbert-2m-sentence") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use mjbommar/ogbert-2m-sentence with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-2m-sentence") model = AutoModel.from_pretrained("mjbommar/ogbert-2m-sentence") - Notebooks
- Google Colab
- Kaggle
File size: 4,901 Bytes
31c2b2a d866226 31c2b2a 2121873 31c2b2a d866226 31c2b2a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- modernbert
- embeddings
pipeline_tag: sentence-similarity
datasets:
- mjbommar/ogbert-v1-mlm
model-index:
- name: ogbert-2m-sentence
results:
- task:
type: STS
dataset:
name: MTEB STSBenchmark
type: mteb/stsbenchmark-sts
metrics:
- type: spearman_cosine
value: 0.453
- task:
type: STS
dataset:
name: MTEB STS12
type: mteb/sts12-sts
metrics:
- type: spearman_cosine
value: 0.396
---
# OGBert-2M-Sentence
A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.
**Related models:**
- [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) - Base MLM model for fill-mask tasks
## Model Details
| Property | Value |
|----------|-------|
| Architecture | ModernBERT + Mean Pooling + L2 Normalize |
| Parameters | 2.1M |
| Hidden size | 128 |
| Layers | 4 |
| Attention heads | 4 |
| Vocab size | 8,192 |
| Max sequence | 1,024 tokens |
| Embedding dim | 128 (L2 normalized) |
## Training
- **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
- **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
## Performance
### Semantic Textual Similarity (MTEB STS)
Spearman correlation between model similarity scores and human judgments on sentence pairs.
| Task | OGBert-2M | BERT-base | RoBERTa-base |
|------|----------:|----------:|-------------:|
| STSBenchmark | 0.453 | 0.473 | 0.545 |
| BIOSSES | 0.489 | 0.547 | 0.582 |
| STS12 | **0.396** | 0.309 | 0.321 |
| STS13 | 0.460 | 0.599 | 0.563 |
| STS14 | 0.388 | 0.477 | 0.452 |
| STS15 | 0.500 | 0.603 | 0.613 |
| STS16 | 0.474 | 0.637 | 0.620 |
| **Average** | **0.451** | 0.521 | 0.528 |
OGBert-2M achieves **87% of BERT-base** STS performance with **52x fewer parameters**. Outperforms both baselines on STS12.
### Document Clustering (ARI)
Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.
| Model | Params | ARI |
|-------|--------|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.797** |
| BERT-base | 110M | 0.896 |
| RoBERTa-base | 125M | 0.941 |
### Document Retrieval (MRR)
Mean Reciprocal Rank for same-category document retrieval.
| Model | Params | MRR | P@1 |
|-------|--------|-----|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.973** | **0.963** |
| BERT-base | 110M | 0.994 | - |
| RoBERTa-base | 125M | 0.989 | - |
### Summary vs Baselines
At 1/50th the size, OGBert-2M-Sentence achieves:
- **87%** of BERT-base STS (with STS12 win)
- **89%** of BERT-base clustering (ARI)
- **98%** of BERT-base retrieval (MRR)
## Usage
### Sentence-Transformers (Recommended)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here']) # L2 normalized by default
```
**Example - Domain Similarity:**
```python
sentences = [
'The financial audit revealed discrepancies in the quarterly report.',
'An accounting review found errors in the fiscal statement.',
'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)
```
| Pair | Similarity |
|------|------------|
| Financial [0] ↔ Financial [1] | **0.915** |
| Medical [2] ↔ Financial [0] | 0.874 |
| Medical [2] ↔ Financial [1] | 0.808 |
The model correctly identifies higher similarity within the financial domain.
### Direct Transformers Usage
```python
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')
inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
```
### For Fill-Mask Tasks
Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) instead.
## Citation
If you use this model, please cite the OpenGloss dataset:
```bibtex
@article{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Bommarito II, Michael J.},
journal={arXiv preprint arXiv:2511.18622},
year={2025}
}
```
## License
Apache 2.0
|