la_vectors_floret_md

Floret word vectors for Latin (medium, 50k hash buckets, 300 dimensions).

Part of the LatinCy project — pretrained NLP pipelines for Latin.

Overview

Feature	Value
Type	Floret (hash-based subword embeddings)
Dimensions	300
Hash buckets	50,000
Algorithm	CBOW
Language	Latin (`la`)
spaCy version	`>=3.8.11,<3.9.0`
License	MIT

Floret vectors use hash-based subword embeddings, meaning every word gets a vector — there are no out-of-vocabulary words. This is especially important for morphologically rich languages like Latin.

Installation

pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-3.10.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("la_vectors_floret_md")

# Get word vectors
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.vector[:5])

# Compute similarity
doc1 = nlp("bellum")
doc2 = nlp("pugna")
print(doc1.similarity(doc2))

These vectors are primarily intended as a component in LatinCy pipelines (la_core_web_md, la_core_web_lg), but can also be used standalone.

Evaluation

All methods were trained on the same v3.10 corpus and scored on the same benchmark (v0.2: 1,545 analogy items, 1,285 odd-one-out items).

Note: v0.2 is a development benchmark, not a finalized release benchmark — these numbers are for cross-method comparison and may change.

Model	Analogy Rank 1	Analogy Rank 5	Odd-One-Out
FastText CBOW-300-10	76.2%	91.5%	70.3%
Floret (lg)	74.3%	89.7%	68.5%
Floret (md)	73.4%	88.2%	65.7%
Word2Vec CBOW-300-10	42.3%	71.0%	77.3%
GloVe 300	16.3%	34.1%	66.4%

The md vectors (50k buckets) trade a small amount of accuracy for ~4x smaller size versus lg (200k buckets). Floret is competitive with FastText on analogies while being much smaller and supporting arbitrary vocabulary, which is why we have chosen to use the Floret vectors for pipeline training.

Training

Corpus

Trained on 14.3M sentences (278M tokens) from 15 sources:

Source	Description
CC100-Latin	Web-crawled Latin text (deduplicated and filtered)
Latin Wikisource	General Latin texts
Latin Wikipedia	Latin Wikipedia articles
The Latin Library	General Latin texts
Patrologia Latina	Patristic texts
CAMENA Neo-Latin	Early modern Latin texts
CLTK-Tesserae Latin	Classical texts
Perseus Digital Library	Classical texts
CSEL	Corpus Scriptorum Ecclesiasticorum Latinorum (patristic)
digilibLT	Late-antique Latin texts
UD + LASLA treebanks	Annotated treebank sentences
Formulae	Early medieval legal formulae
EDH	Epigraphic Database Heidelberg (inscriptions)
Epistolae	Medieval Latin letters
PTA	Patristic Text Archive

Parameters

Parameter	Value
Algorithm	CBOW
Dimensions	300
Subword n-gram range	3–6
Hash buckets	50,000
Epochs	15
Negative sampling	25
Min count	50
Learning rate	0.05

Training followed Sprugnoli et al. 2019 for epoch count and negative sampling parameters.

Citation

If you use these vectors, please cite this preprint:

@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}

References

Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. “Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin.” In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.

Downloads last month: 78

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for latincy/la_vectors_floret_md

LatinCy: Synthetic Trained Pipelines for Latin NLP

Paper • 2305.04365 • Published May 7, 2023

latincy
/

la_vectors_floret_md