ogbert-2m-sentence / README.md
mjbommar's picture
Upload folder using huggingface_hub
31c2b2a verified
|
raw
history blame
3.96 kB
metadata
language:
  - en
license: apache-2.0
library_name: sentence-transformers
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
  - modernbert
  - embeddings
pipeline_tag: sentence-similarity
datasets:
  - mjbommar/ogbert-v1-mlm
model-index:
  - name: ogbert-2m-sentence
    results:
      - task:
          type: STS
        dataset:
          name: MTEB STSBenchmark
          type: mteb/stsbenchmark-sts
        metrics:
          - type: spearman_cosine
            value: 0.453
      - task:
          type: STS
        dataset:
          name: MTEB STS12
          type: mteb/sts12-sts
        metrics:
          - type: spearman_cosine
            value: 0.396

OGBert-2M-Sentence

A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.

Related models:

Model Details

Property Value
Architecture ModernBERT + Mean Pooling + L2 Normalize
Parameters 2.1M
Hidden size 128
Layers 4
Attention heads 4
Vocab size 8,192
Max sequence 1,024 tokens
Embedding dim 128 (L2 normalized)

Training

  • Pretraining: Masked Language Modeling on domain-specific glossary corpus
  • Dataset: mjbommar/ogbert-v1-mlm
  • Key finding: L2 normalization of embeddings is critical for clustering/retrieval performance

Performance

Semantic Textual Similarity (MTEB STS)

Spearman correlation between model similarity scores and human judgments on sentence pairs.

Task OGBert-2M BERT-base RoBERTa-base
STSBenchmark 0.453 0.473 0.545
BIOSSES 0.489 0.547 0.582
STS12 0.396 0.309 0.321
STS13 0.460 0.599 0.563
STS14 0.388 0.477 0.452
STS15 0.500 0.603 0.613
STS16 0.474 0.637 0.620
Average 0.451 0.521 0.528

OGBert-2M achieves 87% of BERT-base STS performance with 52x fewer parameters. Outperforms both baselines on STS12.

Document Clustering (ARI)

Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.

Model Params ARI
OGBert-2M-Sentence 2.1M 0.797
BERT-base 110M 0.896
RoBERTa-base 125M 0.941

Document Retrieval (MRR)

Mean Reciprocal Rank for same-category document retrieval.

Model Params MRR P@1
OGBert-2M-Sentence 2.1M 0.973 0.963
BERT-base 110M 0.994 -
RoBERTa-base 125M 0.989 -

Summary vs Baselines

At 1/50th the size, OGBert-2M-Sentence achieves:

  • 87% of BERT-base STS (with STS12 win)
  • 89% of BERT-base clustering (ARI)
  • 98% of BERT-base retrieval (MRR)

Usage

Sentence-Transformers (Recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here'])  # L2 normalized by default

Direct Transformers Usage

from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')

inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)

For Fill-Mask Tasks

Use mjbommar/ogbert-2m-base instead.

Citation

Forthcoming research. Contact authors for details.

License

Apache 2.0