File size: 4,546 Bytes
9e47315
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---

language:
- non
tags:
- text-normalization
- historical-text
- old-icelandic
- seq2seq
- character-level
- multi-task
- medieval
license: mit
datasets:
- custom
metrics:
- cer
- wer
---


# Old Icelandic facs2dipl2norm

This repository contains a character-level transformer model for Old Icelandic manuscript normalisation tasks, specifically facsimile transcription to diplomatic transcription (facs → dipl) and diplomatic transcription to normalised form (dipl → norm). 

The model was trained on all the available MENOTA texts by Andrea de Leeuw van Weenen (AM 132 fol., AM 519 a 4to., and AM 677 4to). This is around 75% of all the currently available MENOTA texts, which are normalised, lemmatized, and (at least partially) POS-tagged.

Old Icelandic manuscript normalisation tasks:
 
- **facs → dipl**: facsimile transcription → diplomatic transcription (abbreviation expansion, character normalisation)
- **dipl → norm**: diplomatic transcription → normalised form (orthographic regularisation)
 
Task routing is controlled by a prefix token prepended to the source sequence — no architectural changes were necessary between tasks.

## Model Details
 
| Property | Value |
|---|---|
| Architecture | Transformer encoder-decoder |
| Parameters | ~10M |
| Vocabulary | ~120 characters (data-derived) |
| Max sequence length | 128 characters |
| Model dimension | 256 |
| Attention heads | 4 |
| Encoder / decoder layers | 3 / 3 |
| Feed-forward dim | 512 |
| Task tokens | `<DIPL>` (facs→dipl), `<NORM>` (dipl→norm) |
| Training data | ~36k line-level triples |
| Language | Old Icelandic (`non`) |
 
## Training Data

- Corpus size: 36240 text chunks of differing lengths, containing around 400k word tokens.

- Training-validation-test split: 80-10-10.

- Sources: <a href="https://clarino.uib.no/menota/catalogue/menota">AM 132 fol., AM 519 a 4to, and AM 677 4to</a>, edited and annotated by Andrea de Leeuw van Weenen.

## Training
TODO


## Performance

| Task | CER | WER |
|---|---|---|
| facs → dipl | 0.0112 | 0.0270 |
| dipl → norm | 0.0350 | 0.1370 |
 

## Intended Use
 
This model is intended for researchers and digital humanists working with Old Icelandic manuscript material who need to automate or assist with the production of diplomatic and normalised transcriptions from facsimile-level texts (e.g., from HTR output from models like OICEN-HTR).

## Usage

Try it out in <a href="https://colab.research.google.com/drive/13Rq2FZomqRjdG5DyHMNcuWSmv0rbq3qR?usp=sharing">Google Colab</a>!

```python

import json, torch

from model_def import CharSeq2Seq, encode_text, decode_ids, greedy_decode, DIPL_IDX, NORM_IDX

 

# Load vocab

with open("vocab.json", encoding="utf-8") as f:

    v = json.load(f)

c2i = v["c2i"]

i2c = {int(k): val for k, val in v["i2c"].items()}

 

# Load model

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

ckpt   = torch.load("best_model.pt", map_location=DEVICE)

hp     = ckpt["hparams"]

 

model = CharSeq2Seq(

    vocab_size = hp["VOCAB_SIZE"],

    d_model    = hp["D_MODEL"],

    n_heads    = hp["N_HEADS"],

    n_enc      = hp["N_ENC"],

    n_dec      = hp["N_DEC"],

    d_ff       = hp["D_FF"],

    max_len    = hp["MAX_LEN"],

    dropout    = hp["DROPOUT"],

).to(DEVICE)

model.load_state_dict(ckpt["model"])

model.eval()

```
 
### facs → dipl
 
```python

MAX_LEN = hp["MAX_LEN"]

 

def predict_dipl(texts):

    if isinstance(texts, str):

        texts = [texts]

    src = torch.tensor(

        [encode_text(t, DIPL_IDX, c2i, MAX_LEN) for t in texts],

        dtype=torch.long

    )

    return greedy_decode(model, src, MAX_LEN, DEVICE, i2c)

 

predict_dipl("koma egƚ. kappı þınu ⁊ ꝺırꝼð . en ſkaplynꝺı") # random line from test set

# → "koma eg(il)l kappi þinu (ok) dirfð . en ſkaplyndi"

```
 
### dipl → norm
 
```python

def predict_norm(texts):

    if isinstance(texts, str):

        texts = [texts]

    src = torch.tensor(

        [encode_text(t, NORM_IDX, c2i, MAX_LEN) for t in texts],

        dtype=torch.long

    )

    return greedy_decode(model, src, MAX_LEN, DEVICE, i2c)

 

predict_norm("TODO")

# → TODO

```
 
### Full pipeline: facs → dipl → norm
 
```python

def predict_pipeline(texts):

    if isinstance(texts, str):

        texts = [texts]

    dipl = predict_dipl(texts)

    norm = predict_norm(dipl)

    return list(zip(dipl, norm))

 

predict_pipeline("TODO")

# TODO

```