Instructions to use latincy/la_vectors_floret_md with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- spaCy
How to use latincy/la_vectors_floret_md with spaCy:
!pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-any-py3-none-any.whl # Using spacy.load(). import spacy nlp = spacy.load("la_vectors_floret_md") # Importing as module. import la_vectors_floret_md nlp = la_vectors_floret_md.load() - Notebooks
- Google Colab
- Kaggle
la_vectors_floret_md
Floret word vectors for Latin (medium, 50k hash buckets, 300 dimensions).
Part of the LatinCy project โ pretrained NLP pipelines for Latin.
Overview
| Feature | Value |
|---|---|
| Type | Floret (hash-based subword embeddings) |
| Dimensions | 300 |
| Hash buckets | 50,000 |
| Algorithm | CBOW |
| Language | Latin (la) |
| spaCy version | >=3.8.11,<3.9.0 |
| License | MIT |
Floret vectors use hash-based subword embeddings, meaning every word gets a vector โ there are no out-of-vocabulary words. This is especially important for morphologically rich languages like Latin.
Installation
pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-3.10.0-py3-none-any.whl
Usage
import spacy
nlp = spacy.load("la_vectors_floret_md")
# Get word vectors
doc = nlp("rex populum regit")
for token in doc:
print(token.text, token.vector[:5])
# Compute similarity
doc1 = nlp("bellum")
doc2 = nlp("pugna")
print(doc1.similarity(doc2))
These vectors are primarily intended as a component in LatinCy pipelines (la_core_web_md, la_core_web_lg), but can also be used standalone.
Evaluation
All methods were trained on the same v3.10 corpus and scored on the same benchmark (v0.2: 1,545 analogy items, 1,285 odd-one-out items).
Note: v0.2 is a development benchmark, not a finalized release benchmark โ these numbers are for cross-method comparison and may change.
| Model | Analogy Rank 1 | Analogy Rank 5 | Odd-One-Out |
|---|---|---|---|
| FastText CBOW-300-10 | 76.2% | 91.5% | 70.3% |
| Floret (lg) | 74.3% | 89.7% | 68.5% |
| Floret (md) | 73.4% | 88.2% | 65.7% |
| Word2Vec CBOW-300-10 | 42.3% | 71.0% | 77.3% |
| GloVe 300 | 16.3% | 34.1% | 66.4% |
The md vectors (50k buckets) trade a small amount of accuracy for ~4x smaller size versus lg (200k buckets). Floret is competitive with FastText on analogies while being much smaller and supporting arbitrary vocabulary, which is why we have chosen to use the Floret vectors for pipeline training.
Training
Corpus
Trained on 14.3M sentences (278M tokens) from 15 sources:
| Source | Description |
|---|---|
| CC100-Latin | Web-crawled Latin text (deduplicated and filtered) |
| Latin Wikisource | General Latin texts |
| Latin Wikipedia | Latin Wikipedia articles |
| The Latin Library | General Latin texts |
| Patrologia Latina | Patristic texts |
| CAMENA Neo-Latin | Early modern Latin texts |
| CLTK-Tesserae Latin | Classical texts |
| Perseus Digital Library | Classical texts |
| CSEL | Corpus Scriptorum Ecclesiasticorum Latinorum (patristic) |
| digilibLT | Late-antique Latin texts |
| UD + LASLA treebanks | Annotated treebank sentences |
| Formulae | Early medieval legal formulae |
| EDH | Epigraphic Database Heidelberg (inscriptions) |
| Epistolae | Medieval Latin letters |
| PTA | Patristic Text Archive |
Parameters
| Parameter | Value |
|---|---|
| Algorithm | CBOW |
| Dimensions | 300 |
| Subword n-gram range | 3โ6 |
| Hash buckets | 50,000 |
| Epochs | 15 |
| Negative sampling | 25 |
| Min count | 50 |
| Learning rate | 0.05 |
Training followed Sprugnoli et al. 2019 for epoch count and negative sampling parameters.
Citation
If you use these vectors, please cite this preprint:
@misc{burns2023latincy,
title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
author = "Burns, Patrick J.",
year = "2023",
eprint = "2305.04365",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
url = "https://arxiv.org/abs/2305.04365"
}
See also
- la_vectors_floret_lg โ Large vectors (200k buckets)
- LatinCy pipelines โ Latin NLP pipelines for spaCy using these vectors
References
- Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. โVir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin.โ In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1โ7. http://ceur-ws.org/Vol-2481/paper69.pdf.
- Downloads last month
- 78