mjbommar commited on
Commit
d866226
·
verified ·
1 Parent(s): 2121873

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -2
README.md CHANGED
@@ -57,7 +57,7 @@ A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary a
57
  ## Training
58
 
59
  - **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
60
- - **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
61
  - **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
62
 
63
  ## Performance
@@ -159,7 +159,16 @@ Use [mjbommar/ogbert-2m-base](https://huggingface.co/mjbommar/ogbert-2m-base) in
159
 
160
  ## Citation
161
 
162
- Forthcoming research. Contact authors for details.
 
 
 
 
 
 
 
 
 
163
 
164
  ## License
165
 
 
57
  ## Training
58
 
59
  - **Pretraining**: Masked Language Modeling on domain-specific glossary corpus
60
+ - **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
61
  - **Key finding**: L2 normalization of embeddings is critical for clustering/retrieval performance
62
 
63
  ## Performance
 
159
 
160
  ## Citation
161
 
162
+ If you use this model, please cite the OpenGloss dataset:
163
+
164
+ ```bibtex
165
+ @article{bommarito2025opengloss,
166
+ title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
167
+ author={Bommarito II, Michael J.},
168
+ journal={arXiv preprint arXiv:2511.18622},
169
+ year={2025}
170
+ }
171
+ ```
172
 
173
  ## License
174