---
language:
- is
- non
license: gpl-3.0
library_name: transformers
tags:
- named-entity-recognition
- ner
- token-classification
- old-icelandic
- medieval-icelandic
- historical-nlp
- icelandic
- old-norse
- medieval-nlp
- medieval-scandinavian
- old-scandinavian
datasets:
- custom
base_model: mideind/IceBERT
pipeline_tag: token-classification
metrics:
- f1
- precision
- recall
widget:
- text: Í þann tíma var hǫfðingi ágǽtr á Íslandi í Ísafirði, er Vermundr hét.
  example_title: Example 1
- text: Valgarðar hins grá var Ulfr aurgoði, er Oddaverjar eru frá komnir.
  example_title: Example 2
- text: Fór Þorkatla heim með Merði ok var fyrir búi.
  example_title: Example 3
model-index:
- name: oldbertur-normalised-old-icelandic-ner
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      type: custom
      name: Medieval Icelandic NER (Normalised)
    metrics:
    - type: f1
      value: 0.93
      name: F1 (micro)
    - type: precision
      value: 0.9
      name: Precision
    - type: recall
      value: 0.95
      name: Recall
---

# OldBERTur: Named Entity Recognition for Normalised Old Icelandic

This model performs Named Entity Recognition (NER) on **normalised** Old Icelandic texts sourced from medieval manuscripts, identifying **Person** and **Location** entities. 
Please note, while the model is fully functional, this model card is due to be updated in the near future with supplementary information. 

## Model Description

- **Model type:** Token classification (NER)
- **Base model:** [mideind/IceBERT](https://huggingface.co/mideind/IceBERT)
- **Language:** Old Icelandic (normalised transcription level)
- **Entity types:** Person, Location (BIO tagging scheme)
- **F1 Score:** 0.93

This model is fine-tuned from IceBERT for NER, designed for normalised Old Icelandic texts as defined in the [Menota](https://www.menota.org/) normalised transcription level description.

For **diplomatic** transcriptions of Old Icelandic texts, please use [oldbertur-diplomatic-old-icelandic-ner](https://huggingface.co/Riksarkivet/oldbertur-diplomatic-old-icelandic-ner) instead.

## Intended Uses

- Named entity recognition in normalised Old Icelandic texts
- Digital humanities research on Medieval Icelandic literature
- Semi-automatic annotation of historical Icelandic documents
- Information extraction from saga literature and historical texts

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Riksarkivet/oldbertur-normalised-old-icelandic-ner")
model = AutoModelForTokenClassification.from_pretrained("Riksarkivet/oldbertur-normalised-old-icelandic-ner")

# Use aggregation_strategy="first" to properly combine subword tokens
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = "Í þann tíma var hǫfðingi ágǽtr á Íslandi í Ísafirði, er Vermundr hét"
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.3f})")
```

**Expected output:**
```
Íslandi: Location (1.000)
Ísafirði,: Location (1.000)
Vermundr: Person (1.000)
```

## Training Data

This model uses the **(M + I)<sup>R</sup> + MIM** training configuration, combining:

| Source | Description |
|--------|-------------|
| **Menota (M)** | Normalised Old Icelandic texts from the Medieval Nordic Text Archive |
| **IcePaHC (I)** | Icelandic Parsed Historical Corpus (normalised Old Icelandic texts) |
| **MIM-GOLD-NER (MIM)** | Modern Icelandic NER data for data augmentation |

The superscript <sup>R</sup> means that sentence-level class resampling (sCR) was applied to both Menota and IcePaHC data. Since both sources use normalised orthography, both are resampled to address entity class imbalance.

**Training set statistics:**

| Configuration | Person | Location | Total |
|---------------|-------:|---------:|------:|
| (M + I)<sup>R</sup> + MIM | 37,917 | 11,931 | 49,848 |

**Entity breakdown by source:**

| Source | Person | Location | Total |
|--------|-------:|---------:|------:|
| Menota (M) | 1,486 | 180 | 1,666 |
| IcePaHC (I) | 2,797 | 362 | 3,159 |
| (M + I)<sup>R</sup> after resampling | 22,330 | 2,929 | 25,259 |
| MIM-GOLD-NER | 15,587 | 9,002 | 24,589 |

**Evaluation sets:**
- Dev: 26,419 tokens, 1,301 entities (1036 Person; 265 Location)
- Test: 25,893 tokens, 1,260 entities (997 Person; 263 Location)

The dev and test sets consist exclusively of Old Icelandic texts in order to reflect our target domain.

## Evaluation Results

| Metric | Score |
|--------|-------|
| **F1** | 0.93 |
| **Precision** | 0.90 |
| **Recall** | 0.95 |

## Labels

The model uses BIO tagging with the following labels:

| Label | Description |
|-------|-------------|
| `O` | Outside any entity |
| `B-Person` | Beginning of a person name |
| `I-Person` | Inside/continuation of a person name |
| `B-Location` | Beginning of a location name |
| `I-Location` | Inside/continuation of a location name |

## Limitations

- **Orthography:** This model is trained on **normalised** texts. For diplomatic transcriptions, use the diplomatic variation of this model.
- **Entity types:** Only Person and Location entities are supported. Other entity types (organisations, dates, etc.) are not recognised due to scarcity in the training data.
- **Time period:** Primarily trained on texts from 1250-1400 CE. Performance may vary on texts from other periods.
- **Domain:** Optimised for saga literature and historical texts. May perform differently on other text types.

## Training Procedure

NER is framed as a token classification task, with a classification head added on top of IceBERT.

**Hyperparameters:**
- **Base model:** mideind/IceBERT
- **Epochs:** 5
- **Learning rate:** 2e-5
- **Batch size:** 16
- **Max sequence length:** 256 tokens
- **Warm-up ratio:** 10%
- **Weight decay:** 0.01

**Class imbalance handling:**
- Weighted cross-entropy loss with class weights: 0.1 for non-entities (O), 30.0 for entity classes to address class imbalance in the training data. 

## Citation

If you use this model, please cite:

```bibtex
@misc{coming_soon,
}
```

## Resources
For more information on our code and data, see our GitHub repository. For more information about the work in general, see our paper (coming soon). 

- **Code & Data:** [GitHub Repository](https://github.com/phenningsson/Medieval-Icelandic-NER)
- **Diplomatic NER Model:** [oldbertur-diplomatic-old-icelandic-ner](https://huggingface.co/Riksarkivet/oldbertur-diplomatic-old-icelandic-ner) 
- **Paper:** Coming soon. 
- **Base Model:** [mideind/IceBERT](https://huggingface.co/mideind/IceBERT)

## Acknowledgments

We are grateful for the great work carried out by the projects below, and for making it possible for us to use their data in order to conduct our academic research and develop NER models for Medieval Icelandic. We thank developers, annotators, scholars, project managers, and anyone else who has contributed to these projects. We also express our sincerest gratitude to the students from Uppsala University who assisted in marking and annotating entities in the two Menota works _Codex Wormianus_ (AM 242 fol) and _Vǫluspá in Hauksbók_ (AM 544 4to). 

- **Menota:** [The Menota project](https://www.menota.org/)
- **Icelandic Parsed Historical Corpus (IcePaHC):** [Wallenberg et al., 2024](http://hdl.handle.net/20.500.12537/325)
- **IceBERT base model:** [Snæbjarnarson et al. 2022](https://huggingface.co/mideind/IceBERT)
- **MIM-GOLD-NER:** [Ingólfsdóttir et al. 2020](http://hdl.handle.net/20.500.12537/140)

## License

This model is released under the [GNU General Public License v3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.en.html).

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/phenningsson/Medieval-Icelandic-NER) or contact: phenningsson@me.com