mjbommar
/

ogbert-2m-sentence

Sentence Similarity

sentence-transformers

feature-extraction

Eval Results (legacy)

text-embeddings-inference

Model card Files Files and versions

mjbommar commited on Dec 11, 2025

Commit

d866226

·

verified ·

1 Parent(s): 2121873

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +11 -2

README.md CHANGED Viewed

@@ -57,7 +57,7 @@ A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary a
 ## Training
 - **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
-- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
 - **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
 ## Performance
@@ -159,7 +159,16 @@ Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) in
 ## Citation
-Forthcoming research. Contact authors for details.
 ## License

 ## Training
 - **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
+- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
 - **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
 ## Performance
 ## Citation
+If you use this model, please cite the OpenGloss dataset:
+```bibtex
+@article{bommarito2025opengloss,
+  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
+  author={Bommarito II, Michael J.},
+  journal={arXiv preprint arXiv:2511.18622},
+  year={2025}
+}
+```
 ## License