Sentence Similarity
sentence-transformers
Safetensors
Transformers
English
modernbert
feature-extraction
embeddings
Eval Results (legacy)
text-embeddings-inference
Instructions to use mjbommar/ogbert-2m-sentence with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use mjbommar/ogbert-2m-sentence with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("mjbommar/ogbert-2m-sentence") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use mjbommar/ogbert-2m-sentence with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-2m-sentence") model = AutoModel.from_pretrained("mjbommar/ogbert-2m-sentence") - Notebooks
- Google Colab
- Kaggle
File size: 3,961 Bytes
31c2b2a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | ---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- modernbert
- embeddings
pipeline_tag: sentence-similarity
datasets:
- mjbommar/ogbert-v1-mlm
model-index:
- name: ogbert-2m-sentence
results:
- task:
type: STS
dataset:
name: MTEB STSBenchmark
type: mteb/stsbenchmark-sts
metrics:
- type: spearman_cosine
value: 0.453
- task:
type: STS
dataset:
name: MTEB STS12
type: mteb/sts12-sts
metrics:
- type: spearman_cosine
value: 0.396
---
# OGBert-2M-Sentence
A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.
**Related models:**
- [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) - Base MLM model for fill-mask tasks
## Model Details
| Property | Value |
|----------|-------|
| Architecture | ModernBERT + Mean Pooling + L2 Normalize |
| Parameters | 2.1M |
| Hidden size | 128 |
| Layers | 4 |
| Attention heads | 4 |
| Vocab size | 8,192 |
| Max sequence | 1,024 tokens |
| Embedding dim | 128 (L2 normalized) |
## Training
- **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
- **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
## Performance
### Semantic Textual Similarity (MTEB STS)
Spearman correlation between model similarity scores and human judgments on sentence pairs.
| Task | OGBert-2M | BERT-base | RoBERTa-base |
|------|----------:|----------:|-------------:|
| STSBenchmark | 0.453 | 0.473 | 0.545 |
| BIOSSES | 0.489 | 0.547 | 0.582 |
| STS12 | **0.396** | 0.309 | 0.321 |
| STS13 | 0.460 | 0.599 | 0.563 |
| STS14 | 0.388 | 0.477 | 0.452 |
| STS15 | 0.500 | 0.603 | 0.613 |
| STS16 | 0.474 | 0.637 | 0.620 |
| **Average** | **0.451** | 0.521 | 0.528 |
OGBert-2M achieves **87% of BERT-base** STS performance with **52x fewer parameters**. Outperforms both baselines on STS12.
### Document Clustering (ARI)
Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.
| Model | Params | ARI |
|-------|--------|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.797** |
| BERT-base | 110M | 0.896 |
| RoBERTa-base | 125M | 0.941 |
### Document Retrieval (MRR)
Mean Reciprocal Rank for same-category document retrieval.
| Model | Params | MRR | P@1 |
|-------|--------|-----|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.973** | **0.963** |
| BERT-base | 110M | 0.994 | - |
| RoBERTa-base | 125M | 0.989 | - |
### Summary vs Baselines
At 1/50th the size, OGBert-2M-Sentence achieves:
- **87%** of BERT-base STS (with STS12 win)
- **89%** of BERT-base clustering (ARI)
- **98%** of BERT-base retrieval (MRR)
## Usage
### Sentence-Transformers (Recommended)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here']) # L2 normalized by default
```
### Direct Transformers Usage
```python
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')
inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
```
### For Fill-Mask Tasks
Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) instead.
## Citation
Forthcoming research. Contact authors for details.
## License
Apache 2.0
|