GreekBERT Fine-tuned — Ancient Text Location Classification

Fine-tuned version of nlpaueb/bert-base-greek-uncased-v1 for multi-class classification of ancient Greek texts by geographic provenance. Given an ancient inscription or scripture fragment, the model predicts the region or location it originated from across 15 classes.

Built as part of the Ancient Texts Provenance Challenge (Kaggle — nppe1).

Model Details

Model type: BERT-based sequence classifier
Base model: nlpaueb/bert-base-greek-uncased-v1
Task: Multi-class text classification (15 classes)
Language: Ancient/Classical Greek
Developed by: Anand Kumar
Training platform: Kaggle (GPU T4/P100)
Experiment tracking: Weights & Biases (W&B)

Training Details

Dataset

Source: Ancient Texts Provenance Challenge (Kaggle — nppe1)
Split: 80/20 stratified train/test split (seed=42, stratified by label)
Classes: 15 geographic provenance labels
Note: Dataset has significant class imbalance — addressed via Macro-F1 as the primary evaluation metric

Preprocessing

Tokenized using nlpaueb/bert-base-greek-uncased-v1 tokenizer
Truncated to max length of 512 tokens
Dynamic padding via DataCollatorWithPadding

Hyperparameters

Parameter	Value
Epochs	5
Per-device batch size	32
Learning rate	5e-5
LR scheduler	Linear with warmup
Warmup ratio	0.1
Precision	fp16 mixed precision
Evaluation strategy	Per epoch

Evaluation

Metrics

Macro-F1 was chosen as the primary metric due to class imbalance in the dataset. It evaluates performance equally across all 15 classes regardless of class frequency.

Results

Metric	Score
Macro-F1	0.51
Accuracy	0.66

Tracked and logged via Weights & Biases — project: nppe1, run: greek-bert

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "anand095/greek-bert-5epoch-lr-5e-5-warmup"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "your ancient greek text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()

print(f"Predicted location class: {predicted_class}")

Limitations

Trained specifically on the nppe1 Kaggle dataset — performance on other ancient text corpora may vary
Limited to 15 predefined geographic classes from the training data
Model handles ancient/classical Greek text only; not suitable for modern Greek

Downloads last month: 20

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for anand095/greek-bert-3epoch-lr-2e-5

Base model

nlpaueb/bert-base-greek-uncased-v1

Finetuned

(13)

this model