la_vectors_floret_md

Floret word vectors for Latin (medium, 50k hash buckets, 300 dimensions).

Part of the LatinCy project โ€” pretrained NLP pipelines for Latin.

Overview

Feature Value
Type Floret (hash-based subword embeddings)
Dimensions 300
Hash buckets 50,000
Algorithm CBOW
Language Latin (la)
spaCy version >=3.8.11,<3.9.0
License MIT

Floret vectors use hash-based subword embeddings, meaning every word gets a vector โ€” there are no out-of-vocabulary words. This is especially important for morphologically rich languages like Latin.

Installation

pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-3.10.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("la_vectors_floret_md")

# Get word vectors
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.vector[:5])

# Compute similarity
doc1 = nlp("bellum")
doc2 = nlp("pugna")
print(doc1.similarity(doc2))

These vectors are primarily intended as a component in LatinCy pipelines (la_core_web_md, la_core_web_lg), but can also be used standalone.

Evaluation

All methods were trained on the same v3.10 corpus and scored on the same benchmark (v0.2: 1,545 analogy items, 1,285 odd-one-out items).

Note: v0.2 is a development benchmark, not a finalized release benchmark โ€” these numbers are for cross-method comparison and may change.

Model Analogy Rank 1 Analogy Rank 5 Odd-One-Out
FastText CBOW-300-10 76.2% 91.5% 70.3%
Floret (lg) 74.3% 89.7% 68.5%
Floret (md) 73.4% 88.2% 65.7%
Word2Vec CBOW-300-10 42.3% 71.0% 77.3%
GloVe 300 16.3% 34.1% 66.4%

The md vectors (50k buckets) trade a small amount of accuracy for ~4x smaller size versus lg (200k buckets). Floret is competitive with FastText on analogies while being much smaller and supporting arbitrary vocabulary, which is why we have chosen to use the Floret vectors for pipeline training.

Training

Corpus

Trained on 14.3M sentences (278M tokens) from 15 sources:

Source Description
CC100-Latin Web-crawled Latin text (deduplicated and filtered)
Latin Wikisource General Latin texts
Latin Wikipedia Latin Wikipedia articles
The Latin Library General Latin texts
Patrologia Latina Patristic texts
CAMENA Neo-Latin Early modern Latin texts
CLTK-Tesserae Latin Classical texts
Perseus Digital Library Classical texts
CSEL Corpus Scriptorum Ecclesiasticorum Latinorum (patristic)
digilibLT Late-antique Latin texts
UD + LASLA treebanks Annotated treebank sentences
Formulae Early medieval legal formulae
EDH Epigraphic Database Heidelberg (inscriptions)
Epistolae Medieval Latin letters
PTA Patristic Text Archive

Parameters

Parameter Value
Algorithm CBOW
Dimensions 300
Subword n-gram range 3โ€“6
Hash buckets 50,000
Epochs 15
Negative sampling 25
Min count 50
Learning rate 0.05

Training followed Sprugnoli et al. 2019 for epoch count and negative sampling parameters.

Citation

If you use these vectors, please cite this preprint:

@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}

See also

References

  • Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. โ€œVir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin.โ€ In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1โ€“7. http://ceur-ws.org/Vol-2481/paper69.pdf.
Downloads last month
78
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for latincy/la_vectors_floret_md