File size: 3,961 Bytes
31c2b2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- modernbert
- embeddings
pipeline_tag: sentence-similarity
datasets:
- mjbommar/ogbert-v1-mlm
model-index:
- name: ogbert-2m-sentence
  results:
  - task:
      type: STS
    dataset:
      name: MTEB STSBenchmark
      type: mteb/stsbenchmark-sts
    metrics:
    - type: spearman_cosine
      value: 0.453
  - task:
      type: STS
    dataset:
      name: MTEB STS12
      type: mteb/sts12-sts
    metrics:
    - type: spearman_cosine
      value: 0.396
---

# OGBert-2M-Sentence

A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.

**Related models:**
- [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) - Base MLM model for fill-mask tasks

## Model Details

| Property | Value |
|----------|-------|
| Architecture | ModernBERT + Mean Pooling + L2 Normalize |
| Parameters | 2.1M |
| Hidden size | 128 |
| Layers | 4 |
| Attention heads | 4 |
| Vocab size | 8,192 |
| Max sequence | 1,024 tokens |
| Embedding dim | 128 (L2 normalized) |

## Training

- **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
- **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance

## Performance

### Semantic Textual Similarity (MTEB STS)

Spearman correlation between model similarity scores and human judgments on sentence pairs.

| Task | OGBert-2M | BERT-base | RoBERTa-base |
|------|----------:|----------:|-------------:|
| STSBenchmark | 0.453 | 0.473 | 0.545 |
| BIOSSES | 0.489 | 0.547 | 0.582 |
| STS12 | **0.396** | 0.309 | 0.321 |
| STS13 | 0.460 | 0.599 | 0.563 |
| STS14 | 0.388 | 0.477 | 0.452 |
| STS15 | 0.500 | 0.603 | 0.613 |
| STS16 | 0.474 | 0.637 | 0.620 |
| **Average** | **0.451** | 0.521 | 0.528 |

OGBert-2M achieves **87% of BERT-base** STS performance with **52x fewer parameters**. Outperforms both baselines on STS12.

### Document Clustering (ARI)

Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.

| Model | Params | ARI |
|-------|--------|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.797** |
| BERT-base | 110M | 0.896 |
| RoBERTa-base | 125M | 0.941 |

### Document Retrieval (MRR)

Mean Reciprocal Rank for same-category document retrieval.

| Model | Params | MRR | P@1 |
|-------|--------|-----|-----|
| **OGBert-2M-Sentence** | **2.1M** | **0.973** | **0.963** |
| BERT-base | 110M | 0.994 | - |
| RoBERTa-base | 125M | 0.989 | - |

### Summary vs Baselines

At 1/50th the size, OGBert-2M-Sentence achieves:
- **87%** of BERT-base STS (with STS12 win)
- **89%** of BERT-base clustering (ARI)
- **98%** of BERT-base retrieval (MRR)

## Usage

### Sentence-Transformers (Recommended)

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here'])  # L2 normalized by default
```

### Direct Transformers Usage

```python
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')

inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
```

### For Fill-Mask Tasks

Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) instead.

## Citation

Forthcoming research. Contact authors for details.

## License

Apache 2.0