GreekBERT Fine-tuned β€” Ancient Text Location Classification

Fine-tuned version of nlpaueb/bert-base-greek-uncased-v1 for multi-class classification of ancient Greek texts by geographic provenance. Given an ancient inscription or scripture fragment, the model predicts the region or location it originated from across 15 classes.

Built as part of the Ancient Texts Provenance Challenge (Kaggle β€” nppe1).


Model Details

  • Model type: BERT-based sequence classifier
  • Base model: nlpaueb/bert-base-greek-uncased-v1
  • Task: Multi-class text classification (15 classes)
  • Language: Ancient/Classical Greek
  • Developed by: Anand Kumar
  • Training platform: Kaggle (GPU T4/P100)
  • Experiment tracking: Weights & Biases (W&B)

Training Details

Dataset

  • Source: Ancient Texts Provenance Challenge (Kaggle β€” nppe1)
  • Split: 80/20 stratified train/test split (seed=42, stratified by label)
  • Classes: 15 geographic provenance labels
  • Note: Dataset has significant class imbalance β€” addressed via Macro-F1 as the primary evaluation metric

Preprocessing

  • Tokenized using nlpaueb/bert-base-greek-uncased-v1 tokenizer
  • Truncated to max length of 512 tokens
  • Dynamic padding via DataCollatorWithPadding

Hyperparameters

Parameter Value
Epochs 5
Per-device batch size 32
Learning rate 5e-5
LR scheduler Linear with warmup
Warmup ratio 0.1
Precision fp16 mixed precision
Evaluation strategy Per epoch

Evaluation

Metrics

Macro-F1 was chosen as the primary metric due to class imbalance in the dataset. It evaluates performance equally across all 15 classes regardless of class frequency.

Results

Metric Score
Macro-F1 0.51
Accuracy 0.66

Tracked and logged via Weights & Biases β€” project: nppe1, run: greek-bert


How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "anand095/greek-bert-5epoch-lr-5e-5-warmup"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "your ancient greek text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()

print(f"Predicted location class: {predicted_class}")

Limitations

  • Trained specifically on the nppe1 Kaggle dataset β€” performance on other ancient text corpora may vary
  • Limited to 15 predefined geographic classes from the training data
  • Model handles ancient/classical Greek text only; not suitable for modern Greek
Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anand095/greek-bert-3epoch-lr-2e-5

Finetuned
(13)
this model