--- library_name: transformers tags: - text-classification - bert - greek - ancient-text - multi-class-classification language: - el base_model: nlpaueb/bert-base-greek-uncased-v1 --- # GreekBERT Fine-tuned — Ancient Text Location Classification Fine-tuned version of [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1) for multi-class classification of ancient Greek texts by geographic provenance. Given an ancient inscription or scripture fragment, the model predicts the region or location it originated from across 15 classes. Built as part of the Ancient Texts Provenance Challenge (Kaggle — nppe1). --- ## Model Details - **Model type:** BERT-based sequence classifier - **Base model:** nlpaueb/bert-base-greek-uncased-v1 - **Task:** Multi-class text classification (15 classes) - **Language:** Ancient/Classical Greek - **Developed by:** Anand Kumar - **Training platform:** Kaggle (GPU T4/P100) - **Experiment tracking:** Weights & Biases (W&B) --- ## Training Details ### Dataset - **Source:** Ancient Texts Provenance Challenge (Kaggle — nppe1) - **Split:** 80/20 stratified train/test split (seed=42, stratified by label) - **Classes:** 15 geographic provenance labels - **Note:** Dataset has significant class imbalance — addressed via Macro-F1 as the primary evaluation metric ### Preprocessing - Tokenized using `nlpaueb/bert-base-greek-uncased-v1` tokenizer - Truncated to max length of 512 tokens - Dynamic padding via `DataCollatorWithPadding` ### Hyperparameters | Parameter | Value | |---|---| | Epochs | 5 | | Per-device batch size | 32 | | Learning rate | 5e-5 | | LR scheduler | Linear with warmup | | Warmup ratio | 0.1 | | Precision | fp16 mixed precision | | Evaluation strategy | Per epoch | --- ## Evaluation ### Metrics Macro-F1 was chosen as the primary metric due to class imbalance in the dataset. It evaluates performance equally across all 15 classes regardless of class frequency. ### Results | Metric | Score | |---|---| | Macro-F1 | 0.51 | | Accuracy | 0.66 | *Tracked and logged via Weights & Biases — project: nppe1, run: greek-bert* --- ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "anand095/greek-bert-5epoch-lr-5e-5-warmup" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) text = "your ancient greek text here" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) predicted_class = torch.argmax(outputs.logits, dim=1).item() print(f"Predicted location class: {predicted_class}") ``` --- ## Limitations - Trained specifically on the nppe1 Kaggle dataset — performance on other ancient text corpora may vary - Limited to 15 predefined geographic classes from the training data - Model handles ancient/classical Greek text only; not suitable for modern Greek