bezzam HF Staff commited on
Commit
9598146
·
1 Parent(s): 8930941

initial commit

Browse files
Dockerfile ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM nvidia/cuda:12.9.0-runtime-ubuntu24.04
2
+
3
+ # Avoid interactive prompts during package installation
4
+ ENV DEBIAN_FRONTEND=noninteractive
5
+
6
+ # Install Python and system dependencies
7
+ # Ubuntu 24.04 ships FFmpeg 6.1 (torchcodec requires FFmpeg 5+, <8)
8
+ RUN apt-get update && apt-get install -y --no-install-recommends \
9
+ python3 \
10
+ python3-pip \
11
+ python3-dev \
12
+ git \
13
+ libsndfile1 \
14
+ ffmpeg \
15
+ && rm -rf /var/lib/apt/lists/*
16
+
17
+ # Set Python alias (Ubuntu 24.04 ships Python 3.12)
18
+ RUN ln -sf /usr/bin/python3 /usr/bin/python
19
+
20
+ # Allow pip to install packages system-wide in the container (PEP 668)
21
+ ENV PIP_BREAK_SYSTEM_PACKAGES=1
22
+
23
+ # Set working directory
24
+ WORKDIR /app
25
+
26
+ # Install PyTorch ecosystem (cu128 wheels for CUDA 12.8+/12.9 compat)
27
+ RUN pip install --no-cache-dir \
28
+ torch==2.8.0 \
29
+ torchaudio==2.8.0 \
30
+ torchcodec==0.6.0 \
31
+ --index-url https://download.pytorch.org/whl/cu128
32
+
33
+ # Copy and install transformers-specific requirements (torch already installed above, pip will skip it)
34
+ COPY transformers/requirements.txt /app/transformers/requirements.txt
35
+ RUN pip install --no-cache-dir -r transformers/requirements.txt
36
+
37
+ # Install additional dependencies for specific model families
38
+ # - Voxtral requires mistral-common[audio]
39
+ # - soundfile for audio I/O
40
+ RUN pip install --no-cache-dir \
41
+ "mistral-common[audio]>=1.9.0" \
42
+ soundfile
43
+
44
+ # Copy only the required files
45
+ COPY transformers/ /app/transformers/
46
+ COPY normalizer/ /app/normalizer/
47
+
48
+ # Default working directory for running transformers scripts
49
+ WORKDIR /app/transformers
50
+
51
+ # Default entrypoint
52
+ ENTRYPOINT ["bash"]
53
+
54
+ # Keep-alive CMD so the Space runtime stays healthy. HF Jobs and `docker run`
55
+ # override this with their own command (e.g. `run_cohere.sh`).
56
+ EXPOSE 7860
57
+ CMD ["-c", "python3 -m http.server 7860"]
README.md CHANGED
@@ -1,11 +1,121 @@
1
  ---
2
- title: Open Asr Transformers
3
- emoji: 😻
4
  colorFrom: blue
5
- colorTo: purple
6
  sdk: docker
 
7
  pinned: false
8
- license: mit
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Open ASR Leaderboard Transformers
3
+ emoji: 🎙️
4
  colorFrom: blue
5
+ colorTo: green
6
  sdk: docker
7
+ hardware: a100-large
8
  pinned: false
 
9
  ---
10
 
11
+ # Transformers-library ASR Evaluation
12
+
13
+ This folder contains evaluation scripts for ASR models supported by the 🤗 Transformers library.
14
+
15
+ ## Supported Models
16
+
17
+ | Script | Models |
18
+ |--------|--------|
19
+ | `run_whisper.sh` | OpenAI Whisper, Distil-Whisper, CrisperWhisper |
20
+ | `run_wav2vec2.sh` | Wav2Vec2 |
21
+ | `run_wav2vec2_conformer.sh` | Wav2Vec2 Conformer |
22
+ | `run_hubert.sh` | HuBERT |
23
+ | `run_data2vec.sh` | Data2Vec |
24
+ | `run_mms.sh` | MMS |
25
+ | `run_moonshine.sh` | Moonshine, Moonshine Streaming |
26
+ | `run_voxtral.sh` | Voxtral Mini, Voxtral Small |
27
+ | `run_voxtral_realtime.sh` | Voxtral Realtime |
28
+ | `run_vibevoice.sh` | VibeVoice |
29
+ | `run_glm_asr.sh` | GLM-ASR |
30
+ | `run_granite.sh` | Granite Speech |
31
+
32
+ ### Multilingual
33
+
34
+ | Script | Models |
35
+ |--------|--------|
36
+ | `run_whisper_ml.sh` | OpenAI Whisper (multilingual) |
37
+ | `run_voxtral_ml.sh` | Voxtral Mini, Voxtral Small |
38
+ | `run_voxtral_realtime_ml.sh` | Voxtral Realtime |
39
+
40
+ Multilingual scripts evaluate on FLEURS, MCV (Mozilla Common Voice), and MLS (Multilingual LibriSpeech) for German, French, Italian, Spanish, and Portuguese. They use `run_eval_ml.py` which applies language-specific normalization. By default, models auto-detect the language during inference as per the leaderboard convention. The argument `--language` can be used to force a specific language.
41
+
42
+ ## Docker usage (recommended)
43
+
44
+ From the **repository root**, build the Docker image:
45
+
46
+ ```bash
47
+ docker build -t open-asr-transformers -f transformers/Dockerfile .
48
+ ```
49
+
50
+ ### Run a specific script directly
51
+
52
+ From the **repository root**, you can run a script without entering the container. The command below uses `--gpus` to expose all GPUs, mounts the local repo so scripts reflect latest changes, and mounts the HuggingFace cache for model downloads:
53
+
54
+ ```bash
55
+ docker run --gpus all \
56
+ -v $(pwd):/app \
57
+ -v $HF_HOME:/root/.cache/huggingface \
58
+ open-asr-transformers run_whisper.sh
59
+ ```
60
+
61
+ Results are written to `transformers/results/` and are automatically persisted on the host since the repo is mounted.
62
+
63
+ To select a specific GPU (e.g. GPU 1):
64
+
65
+ ```bash
66
+ docker run --gpus '"device=1"' \
67
+ -v $(pwd):/app \
68
+ -v $HF_HOME:/root/.cache/huggingface \
69
+ open-asr-transformers run_whisper.sh
70
+ ```
71
+
72
+ ### Run interactively
73
+
74
+ From the **repository root**, you can also enter the container to run interactively:
75
+
76
+ ```bash
77
+ docker run --gpus all -it \
78
+ -v $(pwd):/app \
79
+ -v $HF_HOME:/root/.cache/huggingface \
80
+ open-asr-transformers -i
81
+ ```
82
+
83
+ This drops you into a bash shell inside `/app/transformers`. From there, run any evaluation script:
84
+
85
+ ```bash
86
+ # Evaluate all Whisper models
87
+ bash run_whisper.sh
88
+
89
+ # Evaluate Granite models
90
+ bash run_granite.sh
91
+
92
+ # Evaluate a single model/dataset manually
93
+ python run_eval.py \
94
+ --model_id=openai/whisper-large-v3-turbo \
95
+ --dataset_path="hf-audio/open-asr-leaderboard" \
96
+ --dataset="librispeech" \
97
+ --split="test.clean" \
98
+ --device=0 \
99
+ --batch_size=64 \
100
+ --max_eval_samples=-1
101
+ ```
102
+
103
+ ### Docker cheat sheet
104
+
105
+ - Exit and stop a container, type `exit` or press `Ctrl+D`.
106
+ - Detach from a container (without stopping): `Ctrl+P` then `Ctrl+Q`.
107
+ - List running containers: `docker ps -a`.
108
+ - Attach to a container: `docker attach <container_id>`
109
+ - Delete a container: `docker rm <container_id>`
110
+
111
+ ## Local Setup (without Docker)
112
+
113
+ From the repository root:
114
+
115
+ ```bash
116
+ pip install -r requirements/requirements.txt
117
+ pip install "mistral-common[audio]>=1.9.0" # only needed for Voxtral
118
+ pip install peft # for Granite
119
+ cd transformers
120
+ bash run_whisper.sh
121
+ ```
normalizer/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .normalizer import EnglishTextNormalizer, BasicMultilingualTextNormalizer
normalizer/data_utils.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import os
3
+
4
+ import num2words
5
+ from datasets import load_dataset, Audio, IterableDataset
6
+ from normalizer import EnglishTextNormalizer, BasicMultilingualTextNormalizer
7
+
8
+ from .eval_utils import read_manifest, write_manifest, normalize_compound_pairs
9
+
10
+
11
+ def is_target_text_in_range(ref):
12
+ if ref.strip() == "ignore time segment in scoring":
13
+ return False
14
+ else:
15
+ return ref.strip() != ""
16
+
17
+
18
+ class MultilingualNormalizer(BasicMultilingualTextNormalizer):
19
+ """BasicMultilingualTextNormalizer with optional number normalization.
20
+
21
+ Call with just text for standard normalization (backward-compatible).
22
+ Pass lang= to also convert digits to words via num2words.
23
+ """
24
+
25
+ def _normalize_numbers(self, text, lang):
26
+ # Join space-separated thousand groups (e.g. "10 000" -> "10000")
27
+ text = re.sub(r"(\d)\s+(\d{3})\b", r"\1\2", text)
28
+ # Convert remaining digit sequences to words
29
+ def _replace(m):
30
+ try:
31
+ return num2words.num2words(int(m.group()), lang=lang)
32
+ except Exception:
33
+ return m.group()
34
+ return re.sub(r"\d+", _replace, text)
35
+
36
+ def __call__(self, s, lang=None):
37
+ s = super().__call__(s)
38
+ if lang is not None:
39
+ s = self._normalize_numbers(s, lang)
40
+ return s
41
+
42
+
43
+ def get_text(sample):
44
+ if "text" in sample:
45
+ return sample["text"]
46
+ elif "sentence" in sample:
47
+ return sample["sentence"]
48
+ elif "normalized_text" in sample:
49
+ return sample["normalized_text"]
50
+ elif "transcript" in sample:
51
+ return sample["transcript"]
52
+ elif "transcription" in sample:
53
+ return sample["transcription"]
54
+ else:
55
+ raise ValueError(
56
+ f"Expected transcript column of either 'text', 'sentence', 'normalized_text' or 'transcript'. Got sample of "
57
+ ".join{sample.keys()}. Ensure a text column name is present in the dataset."
58
+ )
59
+
60
+ normalizer = EnglishTextNormalizer()
61
+
62
+ ml_normalizer = MultilingualNormalizer(remove_diacritics=False)
63
+
64
+
65
+ def normalize(batch):
66
+ batch["original_text"] = get_text(batch)
67
+ batch["norm_text"] = normalizer(batch["original_text"])
68
+ return batch
69
+
70
+
71
+ def load_data(args):
72
+ dataset = load_dataset(
73
+ args.dataset_path,
74
+ args.dataset,
75
+ split=args.split,
76
+ streaming=args.streaming,
77
+ token=True,
78
+ )
79
+
80
+ return dataset
81
+
82
+ def prepare_data(dataset, sampling_rate=16000):
83
+ # Re-sample and normalize transcriptions
84
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
85
+ # NOTE (ebezzam) don't load from cache to account for potential changes in normalization logic
86
+ # IterableDataset (streaming) has no cache, so the kwarg is only needed for Dataset
87
+ map_kwargs = {} if isinstance(dataset, IterableDataset) else {"load_from_cache_file": False}
88
+ dataset = dataset.map(normalize, **map_kwargs)
89
+ dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])
90
+
91
+ return dataset
92
+
93
+
94
+ AUDIO_FILEPATH_METADATA_KEYS = [
95
+ "id", # Main: https://huggingface.co/datasets/hf-audio/open-asr-leaderboard
96
+ "file_name", # Multilingual: https://huggingface.co/datasets/nithinraok/asr-leaderboard-datasets
97
+ "file_name", # Private
98
+ ]
99
+
100
+
101
+ def _basename_or_none(value):
102
+ if value is None:
103
+ return None
104
+ value = str(value).strip()
105
+ if value == "":
106
+ return None
107
+ return os.path.basename(value)
108
+
109
+
110
+ def extract_audio_filepath_from_sample(sample):
111
+ if sample is None:
112
+ return None
113
+
114
+ for key in AUDIO_FILEPATH_METADATA_KEYS:
115
+ try:
116
+ if key in sample:
117
+ basename = _basename_or_none(sample[key])
118
+ if basename is not None:
119
+ return basename
120
+ except TypeError:
121
+ # AudioDecoder / other non-mapping sample types are not subscriptable.
122
+ return None
123
+ return None
124
+
125
+
126
+ def extract_audio_filepaths_from_batch(batch, batch_size=None):
127
+ if batch_size is None:
128
+ if "audio" in batch:
129
+ batch_size = len(batch["audio"])
130
+ elif len(batch) > 0:
131
+ first_value = next(iter(batch.values()))
132
+ if isinstance(first_value, list):
133
+ batch_size = len(first_value)
134
+
135
+ if batch_size is None:
136
+ return []
137
+
138
+ for key in AUDIO_FILEPATH_METADATA_KEYS:
139
+ values = batch.get(key)
140
+ if isinstance(values, list) and len(values) == batch_size:
141
+ return [_basename_or_none(v) for v in values]
142
+ return [None] * batch_size
normalizer/english_abbreviations.py ADDED
@@ -0,0 +1,1907 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ english_spelling_normalizer = {
2
+ "ok": "okay",
3
+ "accessorise": "accessorize",
4
+ "accessorised": "accessorized",
5
+ "accessorises": "accessorizes",
6
+ "accessorising": "accessorizing",
7
+ "acclimatisation": "acclimatization",
8
+ "acclimatise": "acclimatize",
9
+ "acclimatised": "acclimatized",
10
+ "acclimatises": "acclimatizes",
11
+ "acclimatising": "acclimatizing",
12
+ "accoutrements": "accouterments",
13
+ "aeon": "eon",
14
+ "aeons": "eons",
15
+ "aerogramme": "aerogram",
16
+ "aerogrammes": "aerograms",
17
+ "aeroplane": "airplane",
18
+ "aeroplanes": "airplanes",
19
+ "aesthete": "esthete",
20
+ "aesthetes": "esthetes",
21
+ "aesthetic": "esthetic",
22
+ "aesthetically": "esthetically",
23
+ "aesthetics": "esthetics",
24
+ "aetiology": "etiology",
25
+ "ageing": "aging",
26
+ "aggrandisement": "aggrandizement",
27
+ "agonise": "agonize",
28
+ "agonised": "agonized",
29
+ "agonises": "agonizes",
30
+ "agonising": "agonizing",
31
+ "agonisingly": "agonizingly",
32
+ "almanack": "almanac",
33
+ "almanacks": "almanacs",
34
+ "aluminium": "aluminum",
35
+ "amortisable": "amortizable",
36
+ "amortisation": "amortization",
37
+ "amortisations": "amortizations",
38
+ "amortise": "amortize",
39
+ "amortised": "amortized",
40
+ "amortises": "amortizes",
41
+ "amortising": "amortizing",
42
+ "amphitheatre": "amphitheater",
43
+ "amphitheatres": "amphitheaters",
44
+ "anaemia": "anemia",
45
+ "anaemic": "anemic",
46
+ "anaesthesia": "anesthesia",
47
+ "anaesthetic": "anesthetic",
48
+ "anaesthetics": "anesthetics",
49
+ "anaesthetise": "anesthetize",
50
+ "anaesthetised": "anesthetized",
51
+ "anaesthetises": "anesthetizes",
52
+ "anaesthetising": "anesthetizing",
53
+ "anaesthetist": "anesthetist",
54
+ "anaesthetists": "anesthetists",
55
+ "anaesthetize": "anesthetize",
56
+ "anaesthetized": "anesthetized",
57
+ "anaesthetizes": "anesthetizes",
58
+ "anaesthetizing": "anesthetizing",
59
+ "analogue": "analog",
60
+ "analogues": "analogs",
61
+ "analyse": "analyze",
62
+ "analysed": "analyzed",
63
+ "analyses": "analyzes",
64
+ "analysing": "analyzing",
65
+ "anglicise": "anglicize",
66
+ "anglicised": "anglicized",
67
+ "anglicises": "anglicizes",
68
+ "anglicising": "anglicizing",
69
+ "annualised": "annualized",
70
+ "antagonise": "antagonize",
71
+ "antagonised": "antagonized",
72
+ "antagonises": "antagonizes",
73
+ "antagonising": "antagonizing",
74
+ "apologise": "apologize",
75
+ "apologised": "apologized",
76
+ "apologises": "apologizes",
77
+ "apologising": "apologizing",
78
+ "appal": "appall",
79
+ "appals": "appalls",
80
+ "appetiser": "appetizer",
81
+ "appetisers": "appetizers",
82
+ "appetising": "appetizing",
83
+ "appetisingly": "appetizingly",
84
+ "arbour": "arbor",
85
+ "arbours": "arbors",
86
+ "archaeologically": "archeologically",
87
+ "archaeologist": "archeologist",
88
+ "archaeologists": "archeologists",
89
+ "archaeology": "archeology",
90
+ "archaeological": "archeological",
91
+ "ardour": "ardor",
92
+ "armour": "armor",
93
+ "armoured": "armored",
94
+ "armourer": "armorer",
95
+ "armourers": "armorers",
96
+ "armouries": "armories",
97
+ "armoury": "armory",
98
+ "artefact": "artifact",
99
+ "artefacts": "artifacts",
100
+ "authorise": "authorize",
101
+ "authorised": "authorized",
102
+ "authorises": "authorizes",
103
+ "authorising": "authorizing",
104
+ "axe": "ax",
105
+ "backpedalled": "backpedaled",
106
+ "backpedalling": "backpedaling",
107
+ "bannister": "banister",
108
+ "bannisters": "banisters",
109
+ "baptise": "baptize",
110
+ "baptised": "baptized",
111
+ "baptises": "baptizes",
112
+ "baptising": "baptizing",
113
+ "bastardise": "bastardize",
114
+ "bastardised": "bastardized",
115
+ "bastardises": "bastardizes",
116
+ "bastardising": "bastardizing",
117
+ "battleax": "battleaxe",
118
+ "baulk": "balk",
119
+ "baulked": "balked",
120
+ "baulking": "balking",
121
+ "baulks": "balks",
122
+ "bedevilled": "bedeviled",
123
+ "bedevilling": "bedeviling",
124
+ "behaviour": "behavior",
125
+ "behavioural": "behavioral",
126
+ "behaviourism": "behaviorism",
127
+ "behaviourist": "behaviorist",
128
+ "behaviourists": "behaviorists",
129
+ "behaviours": "behaviors",
130
+ "behove": "behoove",
131
+ "behoved": "behooved",
132
+ "behoves": "behooves",
133
+ "bejewelled": "bejeweled",
134
+ "belabour": "belabor",
135
+ "belaboured": "belabored",
136
+ "belabouring": "belaboring",
137
+ "belabours": "belabors",
138
+ "bevelled": "beveled",
139
+ "bevvies": "bevies",
140
+ "bevvy": "bevy",
141
+ "biassed": "biased",
142
+ "biassing": "biasing",
143
+ "bingeing": "binging",
144
+ "bougainvillaea": "bougainvillea",
145
+ "bougainvillaeas": "bougainvilleas",
146
+ "bowdlerise": "bowdlerize",
147
+ "bowdlerised": "bowdlerized",
148
+ "bowdlerises": "bowdlerizes",
149
+ "bowdlerising": "bowdlerizing",
150
+ "breathalyse": "breathalyze",
151
+ "breathalysed": "breathalyzed",
152
+ "breathalyser": "breathalyzer",
153
+ "breathalysers": "breathalyzers",
154
+ "breathalyses": "breathalyzes",
155
+ "breathalysing": "breathalyzing",
156
+ "brutalise": "brutalize",
157
+ "brutalised": "brutalized",
158
+ "brutalises": "brutalizes",
159
+ "brutalising": "brutalizing",
160
+ "busses": "buses",
161
+ "bussing": "busing",
162
+ "caesarean": "cesarean",
163
+ "caesareans": "cesareans",
164
+ "calibre": "caliber",
165
+ "calibres": "calibers",
166
+ "calliper": "caliper",
167
+ "callipers": "calipers",
168
+ "callisthenics": "calisthenics",
169
+ "canalise": "canalize",
170
+ "canalised": "canalized",
171
+ "canalises": "canalizes",
172
+ "canalising": "canalizing",
173
+ "cancellation": "cancelation",
174
+ "cancellations": "cancelations",
175
+ "cancelled": "canceled",
176
+ "cancelling": "canceling",
177
+ "candour": "candor",
178
+ "cannibalise": "cannibalize",
179
+ "cannibalised": "cannibalized",
180
+ "cannibalises": "cannibalizes",
181
+ "cannibalising": "cannibalizing",
182
+ "canonise": "canonize",
183
+ "canonised": "canonized",
184
+ "canonises": "canonizes",
185
+ "canonising": "canonizing",
186
+ "capitalise": "capitalize",
187
+ "capitalised": "capitalized",
188
+ "capitalises": "capitalizes",
189
+ "capitalising": "capitalizing",
190
+ "caramelise": "caramelize",
191
+ "caramelised": "caramelized",
192
+ "caramelises": "caramelizes",
193
+ "caramelising": "caramelizing",
194
+ "carbonise": "carbonize",
195
+ "carbonised": "carbonized",
196
+ "carbonises": "carbonizes",
197
+ "carbonising": "carbonizing",
198
+ "carolled": "caroled",
199
+ "carolling": "caroling",
200
+ "catalogue": "catalog",
201
+ "catalogued": "cataloged",
202
+ "catalogues": "catalogs",
203
+ "cataloguing": "cataloging",
204
+ "catalyse": "catalyze",
205
+ "catalysed": "catalyzed",
206
+ "catalyses": "catalyzes",
207
+ "catalysing": "catalyzing",
208
+ "categorise": "categorize",
209
+ "categorised": "categorized",
210
+ "categorises": "categorizes",
211
+ "categorising": "categorizing",
212
+ "cauterise": "cauterize",
213
+ "cauterised": "cauterized",
214
+ "cauterises": "cauterizes",
215
+ "cauterising": "cauterizing",
216
+ "cavilled": "caviled",
217
+ "cavilling": "caviling",
218
+ "centigramme": "centigram",
219
+ "centigrammes": "centigrams",
220
+ "centilitre": "centiliter",
221
+ "centilitres": "centiliters",
222
+ "centimetre": "centimeter",
223
+ "centimetres": "centimeters",
224
+ "centralise": "centralize",
225
+ "centralised": "centralized",
226
+ "centralises": "centralizes",
227
+ "centralising": "centralizing",
228
+ "centre": "center",
229
+ "centred": "centered",
230
+ "centrefold": "centerfold",
231
+ "centrefolds": "centerfolds",
232
+ "centrepiece": "centerpiece",
233
+ "centrepieces": "centerpieces",
234
+ "centres": "centers",
235
+ "channelled": "channeled",
236
+ "channelling": "channeling",
237
+ "characterise": "characterize",
238
+ "characterised": "characterized",
239
+ "characterises": "characterizes",
240
+ "characterising": "characterizing",
241
+ "cheque": "check",
242
+ "chequebook": "checkbook",
243
+ "chequebooks": "checkbooks",
244
+ "chequered": "checkered",
245
+ "cheques": "checks",
246
+ "chilli": "chili",
247
+ "chimaera": "chimera",
248
+ "chimaeras": "chimeras",
249
+ "chiselled": "chiseled",
250
+ "chiselling": "chiseling",
251
+ "circularise": "circularize",
252
+ "circularised": "circularized",
253
+ "circularises": "circularizes",
254
+ "circularising": "circularizing",
255
+ "civilise": "civilize",
256
+ "civilised": "civilized",
257
+ "civilises": "civilizes",
258
+ "civilising": "civilizing",
259
+ "clamour": "clamor",
260
+ "clamoured": "clamored",
261
+ "clamouring": "clamoring",
262
+ "clamours": "clamors",
263
+ "clangour": "clangor",
264
+ "clarinettist": "clarinetist",
265
+ "clarinettists": "clarinetists",
266
+ "collectivise": "collectivize",
267
+ "collectivised": "collectivized",
268
+ "collectivises": "collectivizes",
269
+ "collectivising": "collectivizing",
270
+ "colonisation": "colonization",
271
+ "colonise": "colonize",
272
+ "colonised": "colonized",
273
+ "coloniser": "colonizer",
274
+ "colonisers": "colonizers",
275
+ "colonises": "colonizes",
276
+ "colonising": "colonizing",
277
+ "colour": "color",
278
+ "colourant": "colorant",
279
+ "colourants": "colorants",
280
+ "coloured": "colored",
281
+ "coloureds": "coloreds",
282
+ "colourful": "colorful",
283
+ "colourfully": "colorfully",
284
+ "colouring": "coloring",
285
+ "colourize": "colorize",
286
+ "colourized": "colorized",
287
+ "colourizes": "colorizes",
288
+ "colourizing": "colorizing",
289
+ "colourless": "colorless",
290
+ "colours": "colors",
291
+ "commercialise": "commercialize",
292
+ "commercialised": "commercialized",
293
+ "commercialises": "commercializes",
294
+ "commercialising": "commercializing",
295
+ "compartmentalise": "compartmentalize",
296
+ "compartmentalised": "compartmentalized",
297
+ "compartmentalises": "compartmentalizes",
298
+ "compartmentalising": "compartmentalizing",
299
+ "computerise": "computerize",
300
+ "computerised": "computerized",
301
+ "computerises": "computerizes",
302
+ "computerising": "computerizing",
303
+ "conceptualise": "conceptualize",
304
+ "conceptualised": "conceptualized",
305
+ "conceptualises": "conceptualizes",
306
+ "conceptualising": "conceptualizing",
307
+ "connexion": "connection",
308
+ "connexions": "connections",
309
+ "contextualise": "contextualize",
310
+ "contextualised": "contextualized",
311
+ "contextualises": "contextualizes",
312
+ "contextualising": "contextualizing",
313
+ "cosier": "cozier",
314
+ "cosies": "cozies",
315
+ "cosiest": "coziest",
316
+ "cosily": "cozily",
317
+ "cosiness": "coziness",
318
+ "cosy": "cozy",
319
+ "councillor": "councilor",
320
+ "councillors": "councilors",
321
+ "counselled": "counseled",
322
+ "counselling": "counseling",
323
+ "counsellor": "counselor",
324
+ "counsellors": "counselors",
325
+ "crenelated": "crenellated",
326
+ "criminalise": "criminalize",
327
+ "criminalised": "criminalized",
328
+ "criminalises": "criminalizes",
329
+ "criminalising": "criminalizing",
330
+ "criticise": "criticize",
331
+ "criticised": "criticized",
332
+ "criticises": "criticizes",
333
+ "criticising": "criticizing",
334
+ "crueller": "crueler",
335
+ "cruellest": "cruelest",
336
+ "crystallisation": "crystallization",
337
+ "crystallise": "crystallize",
338
+ "crystallised": "crystallized",
339
+ "crystallises": "crystallizes",
340
+ "crystallising": "crystallizing",
341
+ "cudgelled": "cudgeled",
342
+ "cudgelling": "cudgeling",
343
+ "customise": "customize",
344
+ "customised": "customized",
345
+ "customises": "customizes",
346
+ "customising": "customizing",
347
+ "cypher": "cipher",
348
+ "cyphers": "ciphers",
349
+ "decentralisation": "decentralization",
350
+ "decentralise": "decentralize",
351
+ "decentralised": "decentralized",
352
+ "decentralises": "decentralizes",
353
+ "decentralising": "decentralizing",
354
+ "decriminalisation": "decriminalization",
355
+ "decriminalise": "decriminalize",
356
+ "decriminalised": "decriminalized",
357
+ "decriminalises": "decriminalizes",
358
+ "decriminalising": "decriminalizing",
359
+ "defence": "defense",
360
+ "defenceless": "defenseless",
361
+ "defences": "defenses",
362
+ "dehumanisation": "dehumanization",
363
+ "dehumanise": "dehumanize",
364
+ "dehumanised": "dehumanized",
365
+ "dehumanises": "dehumanizes",
366
+ "dehumanising": "dehumanizing",
367
+ "demeanour": "demeanor",
368
+ "demilitarisation": "demilitarization",
369
+ "demilitarise": "demilitarize",
370
+ "demilitarised": "demilitarized",
371
+ "demilitarises": "demilitarizes",
372
+ "demilitarising": "demilitarizing",
373
+ "demobilisation": "demobilization",
374
+ "demobilise": "demobilize",
375
+ "demobilised": "demobilized",
376
+ "demobilises": "demobilizes",
377
+ "demobilising": "demobilizing",
378
+ "democratisation": "democratization",
379
+ "democratise": "democratize",
380
+ "democratised": "democratized",
381
+ "democratises": "democratizes",
382
+ "democratising": "democratizing",
383
+ "demonise": "demonize",
384
+ "demonised": "demonized",
385
+ "demonises": "demonizes",
386
+ "demonising": "demonizing",
387
+ "demoralisation": "demoralization",
388
+ "demoralise": "demoralize",
389
+ "demoralised": "demoralized",
390
+ "demoralises": "demoralizes",
391
+ "demoralising": "demoralizing",
392
+ "denationalisation": "denationalization",
393
+ "denationalise": "denationalize",
394
+ "denationalised": "denationalized",
395
+ "denationalises": "denationalizes",
396
+ "denationalising": "denationalizing",
397
+ "deodorise": "deodorize",
398
+ "deodorised": "deodorized",
399
+ "deodorises": "deodorizes",
400
+ "deodorising": "deodorizing",
401
+ "depersonalise": "depersonalize",
402
+ "depersonalised": "depersonalized",
403
+ "depersonalises": "depersonalizes",
404
+ "depersonalising": "depersonalizing",
405
+ "deputise": "deputize",
406
+ "deputised": "deputized",
407
+ "deputises": "deputizes",
408
+ "deputising": "deputizing",
409
+ "desensitisation": "desensitization",
410
+ "desensitise": "desensitize",
411
+ "desensitised": "desensitized",
412
+ "desensitises": "desensitizes",
413
+ "desensitising": "desensitizing",
414
+ "destabilisation": "destabilization",
415
+ "destabilise": "destabilize",
416
+ "destabilised": "destabilized",
417
+ "destabilises": "destabilizes",
418
+ "destabilising": "destabilizing",
419
+ "dialled": "dialed",
420
+ "dialling": "dialing",
421
+ "dialogue": "dialog",
422
+ "dialogues": "dialogs",
423
+ "diarrhoea": "diarrhea",
424
+ "digitise": "digitize",
425
+ "digitised": "digitized",
426
+ "digitises": "digitizes",
427
+ "digitising": "digitizing",
428
+ "disc": "disk",
429
+ "discolour": "discolor",
430
+ "discoloured": "discolored",
431
+ "discolouring": "discoloring",
432
+ "discolours": "discolors",
433
+ "discs": "disks",
434
+ "disembowelled": "disemboweled",
435
+ "disembowelling": "disemboweling",
436
+ "disfavour": "disfavor",
437
+ "dishevelled": "disheveled",
438
+ "dishonour": "dishonor",
439
+ "dishonourable": "dishonorable",
440
+ "dishonourably": "dishonorably",
441
+ "dishonoured": "dishonored",
442
+ "dishonouring": "dishonoring",
443
+ "dishonours": "dishonors",
444
+ "disorganisation": "disorganization",
445
+ "disorganised": "disorganized",
446
+ "distil": "distill",
447
+ "distils": "distills",
448
+ "dramatisation": "dramatization",
449
+ "dramatisations": "dramatizations",
450
+ "dramatise": "dramatize",
451
+ "dramatised": "dramatized",
452
+ "dramatises": "dramatizes",
453
+ "dramatising": "dramatizing",
454
+ "draught": "draft",
455
+ "draughtboard": "draftboard",
456
+ "draughtboards": "draftboards",
457
+ "draughtier": "draftier",
458
+ "draughtiest": "draftiest",
459
+ "draughts": "drafts",
460
+ "draughtsman": "draftsman",
461
+ "draughtsmanship": "draftsmanship",
462
+ "draughtsmen": "draftsmen",
463
+ "draughtswoman": "draftswoman",
464
+ "draughtswomen": "draftswomen",
465
+ "draughty": "drafty",
466
+ "drivelled": "driveled",
467
+ "drivelling": "driveling",
468
+ "duelled": "dueled",
469
+ "duelling": "dueling",
470
+ "economise": "economize",
471
+ "economised": "economized",
472
+ "economises": "economizes",
473
+ "economising": "economizing",
474
+ "editorialise": "editorialize",
475
+ "editorialised": "editorialized",
476
+ "editorialises": "editorializes",
477
+ "editorialising": "editorializing",
478
+ "edoema": "edema",
479
+ "empathise": "empathize",
480
+ "empathised": "empathized",
481
+ "empathises": "empathizes",
482
+ "empathising": "empathizing",
483
+ "emphasise": "emphasize",
484
+ "emphasised": "emphasized",
485
+ "emphasises": "emphasizes",
486
+ "emphasising": "emphasizing",
487
+ "enamelled": "enameled",
488
+ "enamelling": "enameling",
489
+ "enamoured": "enamored",
490
+ "encyclopaedia": "encyclopedia",
491
+ "encyclopaedias": "encyclopedias",
492
+ "encyclopaedic": "encyclopedic",
493
+ "endeavour": "endeavor",
494
+ "endeavoured": "endeavored",
495
+ "endeavouring": "endeavoring",
496
+ "endeavours": "endeavors",
497
+ "energise": "energize",
498
+ "energised": "energized",
499
+ "energises": "energizes",
500
+ "energising": "energizing",
501
+ "enrol": "enroll",
502
+ "enrols": "enrolls",
503
+ "enthral": "enthrall",
504
+ "enthrals": "enthralls",
505
+ "epaulette": "epaulet",
506
+ "epaulettes": "epaulets",
507
+ "epicentre": "epicenter",
508
+ "epicentres": "epicenters",
509
+ "epilogue": "epilog",
510
+ "epilogues": "epilogs",
511
+ "epitomise": "epitomize",
512
+ "epitomised": "epitomized",
513
+ "epitomises": "epitomizes",
514
+ "epitomising": "epitomizing",
515
+ "equalisation": "equalization",
516
+ "equalise": "equalize",
517
+ "equalised": "equalized",
518
+ "equaliser": "equalizer",
519
+ "equalisers": "equalizers",
520
+ "equalises": "equalizes",
521
+ "equalising": "equalizing",
522
+ "eulogise": "eulogize",
523
+ "eulogised": "eulogized",
524
+ "eulogises": "eulogizes",
525
+ "eulogising": "eulogizing",
526
+ "evangelise": "evangelize",
527
+ "evangelised": "evangelized",
528
+ "evangelises": "evangelizes",
529
+ "evangelising": "evangelizing",
530
+ "exorcise": "exorcize",
531
+ "exorcised": "exorcized",
532
+ "exorcises": "exorcizes",
533
+ "exorcising": "exorcizing",
534
+ "extemporisation": "extemporization",
535
+ "extemporise": "extemporize",
536
+ "extemporised": "extemporized",
537
+ "extemporises": "extemporizes",
538
+ "extemporising": "extemporizing",
539
+ "externalisation": "externalization",
540
+ "externalisations": "externalizations",
541
+ "externalise": "externalize",
542
+ "externalised": "externalized",
543
+ "externalises": "externalizes",
544
+ "externalising": "externalizing",
545
+ "factorise": "factorize",
546
+ "factorised": "factorized",
547
+ "factorises": "factorizes",
548
+ "factorising": "factorizing",
549
+ "faecal": "fecal",
550
+ "faeces": "feces",
551
+ "familiarisation": "familiarization",
552
+ "familiarise": "familiarize",
553
+ "familiarised": "familiarized",
554
+ "familiarises": "familiarizes",
555
+ "familiarising": "familiarizing",
556
+ "fantasise": "fantasize",
557
+ "fantasised": "fantasized",
558
+ "fantasises": "fantasizes",
559
+ "fantasising": "fantasizing",
560
+ "favour": "favor",
561
+ "favourable": "favorable",
562
+ "favourably": "favorably",
563
+ "favoured": "favored",
564
+ "favouring": "favoring",
565
+ "favourite": "favorite",
566
+ "favourites": "favorites",
567
+ "favouritism": "favoritism",
568
+ "favours": "favors",
569
+ "feminise": "feminize",
570
+ "feminised": "feminized",
571
+ "feminises": "feminizes",
572
+ "feminising": "feminizing",
573
+ "fertilisation": "fertilization",
574
+ "fertilise": "fertilize",
575
+ "fertilised": "fertilized",
576
+ "fertiliser": "fertilizer",
577
+ "fertilisers": "fertilizers",
578
+ "fertilises": "fertilizes",
579
+ "fertilising": "fertilizing",
580
+ "fervour": "fervor",
581
+ "fibre": "fiber",
582
+ "fibreglass": "fiberglass",
583
+ "fibres": "fibers",
584
+ "fictionalisation": "fictionalization",
585
+ "fictionalisations": "fictionalizations",
586
+ "fictionalise": "fictionalize",
587
+ "fictionalised": "fictionalized",
588
+ "fictionalises": "fictionalizes",
589
+ "fictionalising": "fictionalizing",
590
+ "fillet": "filet",
591
+ "filleted": "fileted",
592
+ "filleting": "fileting",
593
+ "fillets": "filets",
594
+ "finalisation": "finalization",
595
+ "finalise": "finalize",
596
+ "finalised": "finalized",
597
+ "finalises": "finalizes",
598
+ "finalising": "finalizing",
599
+ "flautist": "flutist",
600
+ "flautists": "flutists",
601
+ "flavour": "flavor",
602
+ "flavoured": "flavored",
603
+ "flavouring": "flavoring",
604
+ "flavourings": "flavorings",
605
+ "flavourless": "flavorless",
606
+ "flavours": "flavors",
607
+ "flavoursome": "flavorsome",
608
+ "flyer / flier": "flier / flyer",
609
+ "foetal": "fetal",
610
+ "foetid": "fetid",
611
+ "foetus": "fetus",
612
+ "foetuses": "fetuses",
613
+ "formalisation": "formalization",
614
+ "formalise": "formalize",
615
+ "formalised": "formalized",
616
+ "formalises": "formalizes",
617
+ "formalising": "formalizing",
618
+ "fossilisation": "fossilization",
619
+ "fossilise": "fossilize",
620
+ "fossilised": "fossilized",
621
+ "fossilises": "fossilizes",
622
+ "fossilising": "fossilizing",
623
+ "fraternisation": "fraternization",
624
+ "fraternise": "fraternize",
625
+ "fraternised": "fraternized",
626
+ "fraternises": "fraternizes",
627
+ "fraternising": "fraternizing",
628
+ "fulfil": "fulfill",
629
+ "fulfilment": "fulfillment",
630
+ "fulfils": "fulfills",
631
+ "funnelled": "funneled",
632
+ "funnelling": "funneling",
633
+ "gage": "gauge",
634
+ "gaged": "gauged",
635
+ "gages": "gauges",
636
+ "gaging": "gauging",
637
+ "galvanise": "galvanize",
638
+ "galvanised": "galvanized",
639
+ "galvanises": "galvanizes",
640
+ "galvanising": "galvanizing",
641
+ "gambolled": "gamboled",
642
+ "gambolling": "gamboling",
643
+ "gaol": "jail",
644
+ "gaolbird": "jailbird",
645
+ "gaolbirds": "jailbirds",
646
+ "gaolbreak": "jailbreak",
647
+ "gaolbreaks": "jailbreaks",
648
+ "gaoled": "jailed",
649
+ "gaoler": "jailer",
650
+ "gaolers": "jailers",
651
+ "gaoling": "jailing",
652
+ "gaols": "jails",
653
+ "gasses": "gases",
654
+ "generalisation": "generalization",
655
+ "generalisations": "generalizations",
656
+ "generalise": "generalize",
657
+ "generalised": "generalized",
658
+ "generalises": "generalizes",
659
+ "generalising": "generalizing",
660
+ "ghettoise": "ghettoize",
661
+ "ghettoised": "ghettoized",
662
+ "ghettoises": "ghettoizes",
663
+ "ghettoising": "ghettoizing",
664
+ "gipsies": "gypsies",
665
+ "glamor": "glamour",
666
+ "glamorise": "glamorize",
667
+ "glamorised": "glamorized",
668
+ "glamorises": "glamorizes",
669
+ "glamorising": "glamorizing",
670
+ "globalisation": "globalization",
671
+ "globalise": "globalize",
672
+ "globalised": "globalized",
673
+ "globalises": "globalizes",
674
+ "globalising": "globalizing",
675
+ "glueing": "gluing",
676
+ "goitre": "goiter",
677
+ "goitres": "goiters",
678
+ "gonorrhoea": "gonorrhea",
679
+ "gramme": "gram",
680
+ "grammes": "grams",
681
+ "gravelled": "graveled",
682
+ "grey": "gray",
683
+ "greyed": "grayed",
684
+ "greying": "graying",
685
+ "greyish": "grayish",
686
+ "greyness": "grayness",
687
+ "greys": "grays",
688
+ "grovelled": "groveled",
689
+ "grovelling": "groveling",
690
+ "groyne": "groin",
691
+ "groynes": "groins",
692
+ "gruelling": "grueling",
693
+ "gruellingly": "gruelingly",
694
+ "gryphon": "griffin",
695
+ "gryphons": "griffins",
696
+ "gynaecological": "gynecological",
697
+ "gynaecologist": "gynecologist",
698
+ "gynaecologists": "gynecologists",
699
+ "gynaecology": "gynecology",
700
+ "haematological": "hematological",
701
+ "haematologist": "hematologist",
702
+ "haematologists": "hematologists",
703
+ "haematology": "hematology",
704
+ "haemoglobin": "hemoglobin",
705
+ "haemophilia": "hemophilia",
706
+ "haemophiliac": "hemophiliac",
707
+ "haemophiliacs": "hemophiliacs",
708
+ "haemorrhage": "hemorrhage",
709
+ "haemorrhaged": "hemorrhaged",
710
+ "haemorrhages": "hemorrhages",
711
+ "haemorrhaging": "hemorrhaging",
712
+ "haemorrhoids": "hemorrhoids",
713
+ "harbour": "harbor",
714
+ "harboured": "harbored",
715
+ "harbouring": "harboring",
716
+ "harbours": "harbors",
717
+ "harmonisation": "harmonization",
718
+ "harmonise": "harmonize",
719
+ "harmonised": "harmonized",
720
+ "harmonises": "harmonizes",
721
+ "harmonising": "harmonizing",
722
+ "homoeopath": "homeopath",
723
+ "homoeopathic": "homeopathic",
724
+ "homoeopaths": "homeopaths",
725
+ "homoeopathy": "homeopathy",
726
+ "homogenise": "homogenize",
727
+ "homogenised": "homogenized",
728
+ "homogenises": "homogenizes",
729
+ "homogenising": "homogenizing",
730
+ "honour": "honor",
731
+ "honourable": "honorable",
732
+ "honourably": "honorably",
733
+ "honoured": "honored",
734
+ "honouring": "honoring",
735
+ "honours": "honors",
736
+ "hospitalisation": "hospitalization",
737
+ "hospitalise": "hospitalize",
738
+ "hospitalised": "hospitalized",
739
+ "hospitalises": "hospitalizes",
740
+ "hospitalising": "hospitalizing",
741
+ "humanise": "humanize",
742
+ "humanised": "humanized",
743
+ "humanises": "humanizes",
744
+ "humanising": "humanizing",
745
+ "humour": "humor",
746
+ "humoured": "humored",
747
+ "humouring": "humoring",
748
+ "humourless": "humorless",
749
+ "humours": "humors",
750
+ "hybridise": "hybridize",
751
+ "hybridised": "hybridized",
752
+ "hybridises": "hybridizes",
753
+ "hybridising": "hybridizing",
754
+ "hypnotise": "hypnotize",
755
+ "hypnotised": "hypnotized",
756
+ "hypnotises": "hypnotizes",
757
+ "hypnotising": "hypnotizing",
758
+ "hypothesise": "hypothesize",
759
+ "hypothesised": "hypothesized",
760
+ "hypothesises": "hypothesizes",
761
+ "hypothesising": "hypothesizing",
762
+ "idealisation": "idealization",
763
+ "idealise": "idealize",
764
+ "idealised": "idealized",
765
+ "idealises": "idealizes",
766
+ "idealising": "idealizing",
767
+ "idolise": "idolize",
768
+ "idolised": "idolized",
769
+ "idolises": "idolizes",
770
+ "idolising": "idolizing",
771
+ "immobilisation": "immobilization",
772
+ "immobilise": "immobilize",
773
+ "immobilised": "immobilized",
774
+ "immobiliser": "immobilizer",
775
+ "immobilisers": "immobilizers",
776
+ "immobilises": "immobilizes",
777
+ "immobilising": "immobilizing",
778
+ "immortalise": "immortalize",
779
+ "immortalised": "immortalized",
780
+ "immortalises": "immortalizes",
781
+ "immortalising": "immortalizing",
782
+ "immunisation": "immunization",
783
+ "immunise": "immunize",
784
+ "immunised": "immunized",
785
+ "immunises": "immunizes",
786
+ "immunising": "immunizing",
787
+ "impanelled": "impaneled",
788
+ "impanelling": "impaneling",
789
+ "imperilled": "imperiled",
790
+ "imperilling": "imperiling",
791
+ "individualise": "individualize",
792
+ "individualised": "individualized",
793
+ "individualises": "individualizes",
794
+ "individualising": "individualizing",
795
+ "industrialise": "industrialize",
796
+ "industrialised": "industrialized",
797
+ "industrialises": "industrializes",
798
+ "industrialising": "industrializing",
799
+ "inflexion": "inflection",
800
+ "inflexions": "inflections",
801
+ "initialise": "initialize",
802
+ "initialised": "initialized",
803
+ "initialises": "initializes",
804
+ "initialising": "initializing",
805
+ "initialled": "initialed",
806
+ "initialling": "initialing",
807
+ "instal": "install",
808
+ "instalment": "installment",
809
+ "instalments": "installments",
810
+ "instals": "installs",
811
+ "instil": "instill",
812
+ "instils": "instills",
813
+ "institutionalisation": "institutionalization",
814
+ "institutionalise": "institutionalize",
815
+ "institutionalised": "institutionalized",
816
+ "institutionalises": "institutionalizes",
817
+ "institutionalising": "institutionalizing",
818
+ "intellectualise": "intellectualize",
819
+ "intellectualised": "intellectualized",
820
+ "intellectualises": "intellectualizes",
821
+ "intellectualising": "intellectualizing",
822
+ "internalisation": "internalization",
823
+ "internalise": "internalize",
824
+ "internalised": "internalized",
825
+ "internalises": "internalizes",
826
+ "internalising": "internalizing",
827
+ "internationalisation": "internationalization",
828
+ "internationalise": "internationalize",
829
+ "internationalised": "internationalized",
830
+ "internationalises": "internationalizes",
831
+ "internationalising": "internationalizing",
832
+ "ionisation": "ionization",
833
+ "ionise": "ionize",
834
+ "ionised": "ionized",
835
+ "ioniser": "ionizer",
836
+ "ionisers": "ionizers",
837
+ "ionises": "ionizes",
838
+ "ionising": "ionizing",
839
+ "italicise": "italicize",
840
+ "italicised": "italicized",
841
+ "italicises": "italicizes",
842
+ "italicising": "italicizing",
843
+ "itemise": "itemize",
844
+ "itemised": "itemized",
845
+ "itemises": "itemizes",
846
+ "itemising": "itemizing",
847
+ "jeopardise": "jeopardize",
848
+ "jeopardised": "jeopardized",
849
+ "jeopardises": "jeopardizes",
850
+ "jeopardising": "jeopardizing",
851
+ "jewelled": "jeweled",
852
+ "jeweller": "jeweler",
853
+ "jewellers": "jewelers",
854
+ "jewellery": "jewelry",
855
+ "judgement": "judgment",
856
+ "kilogramme": "kilogram",
857
+ "kilogrammes": "kilograms",
858
+ "kilometre": "kilometer",
859
+ "kilometres": "kilometers",
860
+ "labelled": "labeled",
861
+ "labelling": "labeling",
862
+ "labour": "labor",
863
+ "laboured": "labored",
864
+ "labourer": "laborer",
865
+ "labourers": "laborers",
866
+ "labouring": "laboring",
867
+ "labours": "labors",
868
+ "lacklustre": "lackluster",
869
+ "legalisation": "legalization",
870
+ "legalise": "legalize",
871
+ "legalised": "legalized",
872
+ "legalises": "legalizes",
873
+ "legalising": "legalizing",
874
+ "legitimise": "legitimize",
875
+ "legitimised": "legitimized",
876
+ "legitimises": "legitimizes",
877
+ "legitimising": "legitimizing",
878
+ "leukaemia": "leukemia",
879
+ "levelled": "leveled",
880
+ "leveller": "leveler",
881
+ "levellers": "levelers",
882
+ "levelling": "leveling",
883
+ "libelled": "libeled",
884
+ "libelling": "libeling",
885
+ "libellous": "libelous",
886
+ "liberalisation": "liberalization",
887
+ "liberalise": "liberalize",
888
+ "liberalised": "liberalized",
889
+ "liberalises": "liberalizes",
890
+ "liberalising": "liberalizing",
891
+ "licence": "license",
892
+ "licenced": "licensed",
893
+ "licences": "licenses",
894
+ "licencing": "licensing",
895
+ "likeable": "likable",
896
+ "lionisation": "lionization",
897
+ "lionise": "lionize",
898
+ "lionised": "lionized",
899
+ "lionises": "lionizes",
900
+ "lionising": "lionizing",
901
+ "liquidise": "liquidize",
902
+ "liquidised": "liquidized",
903
+ "liquidiser": "liquidizer",
904
+ "liquidisers": "liquidizers",
905
+ "liquidises": "liquidizes",
906
+ "liquidising": "liquidizing",
907
+ "litre": "liter",
908
+ "litres": "liters",
909
+ "localise": "localize",
910
+ "localised": "localized",
911
+ "localises": "localizes",
912
+ "localising": "localizing",
913
+ "louvre": "louver",
914
+ "louvred": "louvered",
915
+ "louvres": "louvers",
916
+ "lustre": "luster",
917
+ "magnetise": "magnetize",
918
+ "magnetised": "magnetized",
919
+ "magnetises": "magnetizes",
920
+ "magnetising": "magnetizing",
921
+ "manoeuvrability": "maneuverability",
922
+ "manoeuvrable": "maneuverable",
923
+ "manoeuvre": "maneuver",
924
+ "manoeuvred": "maneuvered",
925
+ "manoeuvres": "maneuvers",
926
+ "manoeuvring": "maneuvering",
927
+ "manoeuvrings": "maneuverings",
928
+ "marginalisation": "marginalization",
929
+ "marginalise": "marginalize",
930
+ "marginalised": "marginalized",
931
+ "marginalises": "marginalizes",
932
+ "marginalising": "marginalizing",
933
+ "marshalled": "marshaled",
934
+ "marshalling": "marshaling",
935
+ "marvelled": "marveled",
936
+ "marvelling": "marveling",
937
+ "marvellous": "marvelous",
938
+ "marvellously": "marvelously",
939
+ "materialisation": "materialization",
940
+ "materialise": "materialize",
941
+ "materialised": "materialized",
942
+ "materialises": "materializes",
943
+ "materialising": "materializing",
944
+ "maximisation": "maximization",
945
+ "maximise": "maximize",
946
+ "maximised": "maximized",
947
+ "maximises": "maximizes",
948
+ "maximising": "maximizing",
949
+ "meagre": "meager",
950
+ "mechanisation": "mechanization",
951
+ "mechanise": "mechanize",
952
+ "mechanised": "mechanized",
953
+ "mechanises": "mechanizes",
954
+ "mechanising": "mechanizing",
955
+ "mediaeval": "medieval",
956
+ "memorialise": "memorialize",
957
+ "memorialised": "memorialized",
958
+ "memorialises": "memorializes",
959
+ "memorialising": "memorializing",
960
+ "memorise": "memorize",
961
+ "memorised": "memorized",
962
+ "memorises": "memorizes",
963
+ "memorising": "memorizing",
964
+ "mesmerise": "mesmerize",
965
+ "mesmerised": "mesmerized",
966
+ "mesmerises": "mesmerizes",
967
+ "mesmerising": "mesmerizing",
968
+ "metabolise": "metabolize",
969
+ "metabolised": "metabolized",
970
+ "metabolises": "metabolizes",
971
+ "metabolising": "metabolizing",
972
+ "metre": "meter",
973
+ "metres": "meters",
974
+ "mhm": "hmm",
975
+ "micrometre": "micrometer",
976
+ "micrometres": "micrometers",
977
+ "militarise": "militarize",
978
+ "militarised": "militarized",
979
+ "militarises": "militarizes",
980
+ "militarising": "militarizing",
981
+ "milligramme": "milligram",
982
+ "milligrammes": "milligrams",
983
+ "millilitre": "milliliter",
984
+ "millilitres": "milliliters",
985
+ "millimetre": "millimeter",
986
+ "millimetres": "millimeters",
987
+ "miniaturisation": "miniaturization",
988
+ "miniaturise": "miniaturize",
989
+ "miniaturised": "miniaturized",
990
+ "miniaturises": "miniaturizes",
991
+ "miniaturising": "miniaturizing",
992
+ "minibusses": "minibuses",
993
+ "minimise": "minimize",
994
+ "minimised": "minimized",
995
+ "minimises": "minimizes",
996
+ "minimising": "minimizing",
997
+ "misbehaviour": "misbehavior",
998
+ "misdemeanour": "misdemeanor",
999
+ "misdemeanours": "misdemeanors",
1000
+ "misspelt": "misspelled",
1001
+ "mitre": "miter",
1002
+ "mitres": "miters",
1003
+ "mm": "hmm",
1004
+ "mmm": "hmm",
1005
+ "mobilisation": "mobilization",
1006
+ "mobilise": "mobilize",
1007
+ "mobilised": "mobilized",
1008
+ "mobilises": "mobilizes",
1009
+ "mobilising": "mobilizing",
1010
+ "modelled": "modeled",
1011
+ "modeller": "modeler",
1012
+ "modellers": "modelers",
1013
+ "modelling": "modeling",
1014
+ "modernise": "modernize",
1015
+ "modernised": "modernized",
1016
+ "modernises": "modernizes",
1017
+ "modernising": "modernizing",
1018
+ "moisturise": "moisturize",
1019
+ "moisturised": "moisturized",
1020
+ "moisturiser": "moisturizer",
1021
+ "moisturisers": "moisturizers",
1022
+ "moisturises": "moisturizes",
1023
+ "moisturising": "moisturizing",
1024
+ "monologue": "monolog",
1025
+ "monologues": "monologs",
1026
+ "monopolisation": "monopolization",
1027
+ "monopolise": "monopolize",
1028
+ "monopolised": "monopolized",
1029
+ "monopolises": "monopolizes",
1030
+ "monopolising": "monopolizing",
1031
+ "moralise": "moralize",
1032
+ "moralised": "moralized",
1033
+ "moralises": "moralizes",
1034
+ "moralising": "moralizing",
1035
+ "motorised": "motorized",
1036
+ "mould": "mold",
1037
+ "moulded": "molded",
1038
+ "moulder": "molder",
1039
+ "mouldered": "moldered",
1040
+ "mouldering": "moldering",
1041
+ "moulders": "molders",
1042
+ "mouldier": "moldier",
1043
+ "mouldiest": "moldiest",
1044
+ "moulding": "molding",
1045
+ "mouldings": "moldings",
1046
+ "moulds": "molds",
1047
+ "mouldy": "moldy",
1048
+ "moult": "molt",
1049
+ "moulted": "molted",
1050
+ "moulting": "molting",
1051
+ "moults": "molts",
1052
+ "moustache": "mustache",
1053
+ "moustached": "mustached",
1054
+ "moustaches": "mustaches",
1055
+ "moustachioed": "mustachioed",
1056
+ "multicoloured": "multicolored",
1057
+ "nationalisation": "nationalization",
1058
+ "nationalisations": "nationalizations",
1059
+ "nationalise": "nationalize",
1060
+ "nationalised": "nationalized",
1061
+ "nationalises": "nationalizes",
1062
+ "nationalising": "nationalizing",
1063
+ "naturalisation": "naturalization",
1064
+ "naturalise": "naturalize",
1065
+ "naturalised": "naturalized",
1066
+ "naturalises": "naturalizes",
1067
+ "naturalising": "naturalizing",
1068
+ "neighbour": "neighbor",
1069
+ "neighbourhood": "neighborhood",
1070
+ "neighbourhoods": "neighborhoods",
1071
+ "neighbouring": "neighboring",
1072
+ "neighbourliness": "neighborliness",
1073
+ "neighbourly": "neighborly",
1074
+ "neighbours": "neighbors",
1075
+ "neutralisation": "neutralization",
1076
+ "neutralise": "neutralize",
1077
+ "neutralised": "neutralized",
1078
+ "neutralises": "neutralizes",
1079
+ "neutralising": "neutralizing",
1080
+ "normalisation": "normalization",
1081
+ "normalise": "normalize",
1082
+ "normalised": "normalized",
1083
+ "normalises": "normalizes",
1084
+ "normalising": "normalizing",
1085
+ "odour": "odor",
1086
+ "odourless": "odorless",
1087
+ "odours": "odors",
1088
+ "oesophagus": "esophagus",
1089
+ "oesophaguses": "esophaguses",
1090
+ "oestrogen": "estrogen",
1091
+ "offence": "offense",
1092
+ "offences": "offenses",
1093
+ "omelette": "omelet",
1094
+ "omelettes": "omelets",
1095
+ "optimise": "optimize",
1096
+ "optimised": "optimized",
1097
+ "optimises": "optimizes",
1098
+ "optimising": "optimizing",
1099
+ "organisation": "organization",
1100
+ "organisational": "organizational",
1101
+ "organisations": "organizations",
1102
+ "organise": "organize",
1103
+ "organised": "organized",
1104
+ "organiser": "organizer",
1105
+ "organisers": "organizers",
1106
+ "organises": "organizes",
1107
+ "organising": "organizing",
1108
+ "orthopaedic": "orthopedic",
1109
+ "orthopaedics": "orthopedics",
1110
+ "ostracise": "ostracize",
1111
+ "ostracised": "ostracized",
1112
+ "ostracises": "ostracizes",
1113
+ "ostracising": "ostracizing",
1114
+ "outmanoeuvre": "outmaneuver",
1115
+ "outmanoeuvred": "outmaneuvered",
1116
+ "outmanoeuvres": "outmaneuvers",
1117
+ "outmanoeuvring": "outmaneuvering",
1118
+ "overemphasise": "overemphasize",
1119
+ "overemphasised": "overemphasized",
1120
+ "overemphasises": "overemphasizes",
1121
+ "overemphasising": "overemphasizing",
1122
+ "oxidisation": "oxidization",
1123
+ "oxidise": "oxidize",
1124
+ "oxidised": "oxidized",
1125
+ "oxidises": "oxidizes",
1126
+ "oxidising": "oxidizing",
1127
+ "paederast": "pederast",
1128
+ "paederasts": "pederasts",
1129
+ "paediatric": "pediatric",
1130
+ "paediatrician": "pediatrician",
1131
+ "paediatricians": "pediatricians",
1132
+ "paediatrics": "pediatrics",
1133
+ "paedophile": "pedophile",
1134
+ "paedophiles": "pedophiles",
1135
+ "paedophilia": "pedophilia",
1136
+ "palaeolithic": "paleolithic",
1137
+ "palaeontologist": "paleontologist",
1138
+ "palaeontologists": "paleontologists",
1139
+ "palaeontology": "paleontology",
1140
+ "panelled": "paneled",
1141
+ "panelling": "paneling",
1142
+ "panellist": "panelist",
1143
+ "panellists": "panelists",
1144
+ "paralyse": "paralyze",
1145
+ "paralysed": "paralyzed",
1146
+ "paralyses": "paralyzes",
1147
+ "paralysing": "paralyzing",
1148
+ "parcelled": "parceled",
1149
+ "parcelling": "parceling",
1150
+ "parlour": "parlor",
1151
+ "parlours": "parlors",
1152
+ "particularise": "particularize",
1153
+ "particularised": "particularized",
1154
+ "particularises": "particularizes",
1155
+ "particularising": "particularizing",
1156
+ "passivisation": "passivization",
1157
+ "passivise": "passivize",
1158
+ "passivised": "passivized",
1159
+ "passivises": "passivizes",
1160
+ "passivising": "passivizing",
1161
+ "pasteurisation": "pasteurization",
1162
+ "pasteurise": "pasteurize",
1163
+ "pasteurised": "pasteurized",
1164
+ "pasteurises": "pasteurizes",
1165
+ "pasteurising": "pasteurizing",
1166
+ "patronise": "patronize",
1167
+ "patronised": "patronized",
1168
+ "patronises": "patronizes",
1169
+ "patronising": "patronizing",
1170
+ "patronisingly": "patronizingly",
1171
+ "pedalled": "pedaled",
1172
+ "pedalling": "pedaling",
1173
+ "pedestrianisation": "pedestrianization",
1174
+ "pedestrianise": "pedestrianize",
1175
+ "pedestrianised": "pedestrianized",
1176
+ "pedestrianises": "pedestrianizes",
1177
+ "pedestrianising": "pedestrianizing",
1178
+ "penalise": "penalize",
1179
+ "penalised": "penalized",
1180
+ "penalises": "penalizes",
1181
+ "penalising": "penalizing",
1182
+ "pencilled": "penciled",
1183
+ "pencilling": "penciling",
1184
+ "personalise": "personalize",
1185
+ "personalised": "personalized",
1186
+ "personalises": "personalizes",
1187
+ "personalising": "personalizing",
1188
+ "pharmacopoeia": "pharmacopeia",
1189
+ "pharmacopoeias": "pharmacopeias",
1190
+ "philosophise": "philosophize",
1191
+ "philosophised": "philosophized",
1192
+ "philosophises": "philosophizes",
1193
+ "philosophising": "philosophizing",
1194
+ "philtre": "filter",
1195
+ "philtres": "filters",
1196
+ "phoney": "phony",
1197
+ "plagiarise": "plagiarize",
1198
+ "plagiarised": "plagiarized",
1199
+ "plagiarises": "plagiarizes",
1200
+ "plagiarising": "plagiarizing",
1201
+ "plough": "plow",
1202
+ "ploughed": "plowed",
1203
+ "ploughing": "plowing",
1204
+ "ploughman": "plowman",
1205
+ "ploughmen": "plowmen",
1206
+ "ploughs": "plows",
1207
+ "ploughshare": "plowshare",
1208
+ "ploughshares": "plowshares",
1209
+ "polarisation": "polarization",
1210
+ "polarise": "polarize",
1211
+ "polarised": "polarized",
1212
+ "polarises": "polarizes",
1213
+ "polarising": "polarizing",
1214
+ "politicisation": "politicization",
1215
+ "politicise": "politicize",
1216
+ "politicised": "politicized",
1217
+ "politicises": "politicizes",
1218
+ "politicising": "politicizing",
1219
+ "popularisation": "popularization",
1220
+ "popularise": "popularize",
1221
+ "popularised": "popularized",
1222
+ "popularises": "popularizes",
1223
+ "popularising": "popularizing",
1224
+ "pouffe": "pouf",
1225
+ "pouffes": "poufs",
1226
+ "practise": "practice",
1227
+ "practised": "practiced",
1228
+ "practises": "practices",
1229
+ "practising": "practicing",
1230
+ "praesidium": "presidium",
1231
+ "praesidiums": "presidiums",
1232
+ "pressurisation": "pressurization",
1233
+ "pressurise": "pressurize",
1234
+ "pressurised": "pressurized",
1235
+ "pressurises": "pressurizes",
1236
+ "pressurising": "pressurizing",
1237
+ "pretence": "pretense",
1238
+ "pretences": "pretenses",
1239
+ "primaeval": "primeval",
1240
+ "prioritisation": "prioritization",
1241
+ "prioritise": "prioritize",
1242
+ "prioritised": "prioritized",
1243
+ "prioritises": "prioritizes",
1244
+ "prioritising": "prioritizing",
1245
+ "privatisation": "privatization",
1246
+ "privatisations": "privatizations",
1247
+ "privatise": "privatize",
1248
+ "privatised": "privatized",
1249
+ "privatises": "privatizes",
1250
+ "privatising": "privatizing",
1251
+ "professionalisation": "professionalization",
1252
+ "professionalise": "professionalize",
1253
+ "professionalised": "professionalized",
1254
+ "professionalises": "professionalizes",
1255
+ "professionalising": "professionalizing",
1256
+ "programme": "program",
1257
+ "programmes": "programs",
1258
+ "prologue": "prolog",
1259
+ "prologues": "prologs",
1260
+ "propagandise": "propagandize",
1261
+ "propagandised": "propagandized",
1262
+ "propagandises": "propagandizes",
1263
+ "propagandising": "propagandizing",
1264
+ "proselytise": "proselytize",
1265
+ "proselytised": "proselytized",
1266
+ "proselytiser": "proselytizer",
1267
+ "proselytisers": "proselytizers",
1268
+ "proselytises": "proselytizes",
1269
+ "proselytising": "proselytizing",
1270
+ "psychoanalyse": "psychoanalyze",
1271
+ "psychoanalysed": "psychoanalyzed",
1272
+ "psychoanalyses": "psychoanalyzes",
1273
+ "psychoanalysing": "psychoanalyzing",
1274
+ "publicise": "publicize",
1275
+ "publicised": "publicized",
1276
+ "publicises": "publicizes",
1277
+ "publicising": "publicizing",
1278
+ "pulverisation": "pulverization",
1279
+ "pulverise": "pulverize",
1280
+ "pulverised": "pulverized",
1281
+ "pulverises": "pulverizes",
1282
+ "pulverising": "pulverizing",
1283
+ "pummelled": "pummeled",
1284
+ "pummelling": "pummeling",
1285
+ "pyjama": "pajama",
1286
+ "pyjamas": "pajamas",
1287
+ "pzazz": "pizzazz",
1288
+ "quarrelled": "quarreled",
1289
+ "quarrelling": "quarreling",
1290
+ "radicalise": "radicalize",
1291
+ "radicalised": "radicalized",
1292
+ "radicalises": "radicalizes",
1293
+ "radicalising": "radicalizing",
1294
+ "rancour": "rancor",
1295
+ "randomise": "randomize",
1296
+ "randomised": "randomized",
1297
+ "randomises": "randomizes",
1298
+ "randomising": "randomizing",
1299
+ "rationalisation": "rationalization",
1300
+ "rationalisations": "rationalizations",
1301
+ "rationalise": "rationalize",
1302
+ "rationalised": "rationalized",
1303
+ "rationalises": "rationalizes",
1304
+ "rationalising": "rationalizing",
1305
+ "ravelled": "raveled",
1306
+ "ravelling": "raveling",
1307
+ "realisable": "realizable",
1308
+ "realisation": "realization",
1309
+ "realisations": "realizations",
1310
+ "realise": "realize",
1311
+ "realised": "realized",
1312
+ "realises": "realizes",
1313
+ "realising": "realizing",
1314
+ "recognisable": "recognizable",
1315
+ "recognisably": "recognizably",
1316
+ "recognisance": "recognizance",
1317
+ "recognise": "recognize",
1318
+ "recognised": "recognized",
1319
+ "recognises": "recognizes",
1320
+ "recognising": "recognizing",
1321
+ "reconnoitre": "reconnoiter",
1322
+ "reconnoitred": "reconnoitered",
1323
+ "reconnoitres": "reconnoiters",
1324
+ "reconnoitring": "reconnoitering",
1325
+ "refuelled": "refueled",
1326
+ "refuelling": "refueling",
1327
+ "regularisation": "regularization",
1328
+ "regularise": "regularize",
1329
+ "regularised": "regularized",
1330
+ "regularises": "regularizes",
1331
+ "regularising": "regularizing",
1332
+ "remodelled": "remodeled",
1333
+ "remodelling": "remodeling",
1334
+ "remould": "remold",
1335
+ "remoulded": "remolded",
1336
+ "remoulding": "remolding",
1337
+ "remoulds": "remolds",
1338
+ "reorganisation": "reorganization",
1339
+ "reorganisations": "reorganizations",
1340
+ "reorganise": "reorganize",
1341
+ "reorganised": "reorganized",
1342
+ "reorganises": "reorganizes",
1343
+ "reorganising": "reorganizing",
1344
+ "revelled": "reveled",
1345
+ "reveller": "reveler",
1346
+ "revellers": "revelers",
1347
+ "revelling": "reveling",
1348
+ "revitalise": "revitalize",
1349
+ "revitalised": "revitalized",
1350
+ "revitalises": "revitalizes",
1351
+ "revitalising": "revitalizing",
1352
+ "revolutionise": "revolutionize",
1353
+ "revolutionised": "revolutionized",
1354
+ "revolutionises": "revolutionizes",
1355
+ "revolutionising": "revolutionizing",
1356
+ "rhapsodise": "rhapsodize",
1357
+ "rhapsodised": "rhapsodized",
1358
+ "rhapsodises": "rhapsodizes",
1359
+ "rhapsodising": "rhapsodizing",
1360
+ "rigour": "rigor",
1361
+ "rigours": "rigors",
1362
+ "ritualised": "ritualized",
1363
+ "rivalled": "rivaled",
1364
+ "rivalling": "rivaling",
1365
+ "romanticise": "romanticize",
1366
+ "romanticised": "romanticized",
1367
+ "romanticises": "romanticizes",
1368
+ "romanticising": "romanticizing",
1369
+ "rumour": "rumor",
1370
+ "rumoured": "rumored",
1371
+ "rumours": "rumors",
1372
+ "sabre": "saber",
1373
+ "sabres": "sabers",
1374
+ "saltpetre": "saltpeter",
1375
+ "sanitise": "sanitize",
1376
+ "sanitised": "sanitized",
1377
+ "sanitises": "sanitizes",
1378
+ "sanitising": "sanitizing",
1379
+ "satirise": "satirize",
1380
+ "satirised": "satirized",
1381
+ "satirises": "satirizes",
1382
+ "satirising": "satirizing",
1383
+ "saviour": "savior",
1384
+ "saviours": "saviors",
1385
+ "savour": "savor",
1386
+ "savoured": "savored",
1387
+ "savouries": "savories",
1388
+ "savouring": "savoring",
1389
+ "savours": "savors",
1390
+ "savoury": "savory",
1391
+ "scandalise": "scandalize",
1392
+ "scandalised": "scandalized",
1393
+ "scandalises": "scandalizes",
1394
+ "scandalising": "scandalizing",
1395
+ "sceptic": "skeptic",
1396
+ "sceptical": "skeptical",
1397
+ "sceptically": "skeptically",
1398
+ "scepticism": "skepticism",
1399
+ "sceptics": "skeptics",
1400
+ "sceptre": "scepter",
1401
+ "sceptres": "scepters",
1402
+ "scrutinise": "scrutinize",
1403
+ "scrutinised": "scrutinized",
1404
+ "scrutinises": "scrutinizes",
1405
+ "scrutinising": "scrutinizing",
1406
+ "secularisation": "secularization",
1407
+ "secularise": "secularize",
1408
+ "secularised": "secularized",
1409
+ "secularises": "secularizes",
1410
+ "secularising": "secularizing",
1411
+ "sensationalise": "sensationalize",
1412
+ "sensationalised": "sensationalized",
1413
+ "sensationalises": "sensationalizes",
1414
+ "sensationalising": "sensationalizing",
1415
+ "sensitise": "sensitize",
1416
+ "sensitised": "sensitized",
1417
+ "sensitises": "sensitizes",
1418
+ "sensitising": "sensitizing",
1419
+ "sentimentalise": "sentimentalize",
1420
+ "sentimentalised": "sentimentalized",
1421
+ "sentimentalises": "sentimentalizes",
1422
+ "sentimentalising": "sentimentalizing",
1423
+ "sepulchre": "sepulcher",
1424
+ "sepulchres": "sepulchers",
1425
+ "serialisation": "serialization",
1426
+ "serialisations": "serializations",
1427
+ "serialise": "serialize",
1428
+ "serialised": "serialized",
1429
+ "serialises": "serializes",
1430
+ "serialising": "serializing",
1431
+ "sermonise": "sermonize",
1432
+ "sermonised": "sermonized",
1433
+ "sermonises": "sermonizes",
1434
+ "sermonising": "sermonizing",
1435
+ "sheikh": "sheik",
1436
+ "shovelled": "shoveled",
1437
+ "shovelling": "shoveling",
1438
+ "shrivelled": "shriveled",
1439
+ "shrivelling": "shriveling",
1440
+ "signalise": "signalize",
1441
+ "signalised": "signalized",
1442
+ "signalises": "signalizes",
1443
+ "signalising": "signalizing",
1444
+ "signalled": "signaled",
1445
+ "signalling": "signaling",
1446
+ "smoulder": "smolder",
1447
+ "smouldered": "smoldered",
1448
+ "smouldering": "smoldering",
1449
+ "smoulders": "smolders",
1450
+ "snivelled": "sniveled",
1451
+ "snivelling": "sniveling",
1452
+ "snorkelled": "snorkeled",
1453
+ "snorkelling": "snorkeling",
1454
+ "snowplough": "snowplow",
1455
+ "snowploughs": "snowplows",
1456
+ "socialisation": "socialization",
1457
+ "socialise": "socialize",
1458
+ "socialised": "socialized",
1459
+ "socialises": "socializes",
1460
+ "socialising": "socializing",
1461
+ "sodomise": "sodomize",
1462
+ "sodomised": "sodomized",
1463
+ "sodomises": "sodomizes",
1464
+ "sodomising": "sodomizing",
1465
+ "solemnise": "solemnize",
1466
+ "solemnised": "solemnized",
1467
+ "solemnises": "solemnizes",
1468
+ "solemnising": "solemnizing",
1469
+ "sombre": "somber",
1470
+ "specialisation": "specialization",
1471
+ "specialisations": "specializations",
1472
+ "specialise": "specialize",
1473
+ "specialised": "specialized",
1474
+ "specialises": "specializes",
1475
+ "specialising": "specializing",
1476
+ "spectre": "specter",
1477
+ "spectres": "specters",
1478
+ "spiralled": "spiraled",
1479
+ "spiralling": "spiraling",
1480
+ "splendour": "splendor",
1481
+ "splendours": "splendors",
1482
+ "squirrelled": "squirreled",
1483
+ "squirrelling": "squirreling",
1484
+ "stabilisation": "stabilization",
1485
+ "stabilise": "stabilize",
1486
+ "stabilised": "stabilized",
1487
+ "stabiliser": "stabilizer",
1488
+ "stabilisers": "stabilizers",
1489
+ "stabilises": "stabilizes",
1490
+ "stabilising": "stabilizing",
1491
+ "standardisation": "standardization",
1492
+ "standardise": "standardize",
1493
+ "standardised": "standardized",
1494
+ "standardises": "standardizes",
1495
+ "standardising": "standardizing",
1496
+ "stencilled": "stenciled",
1497
+ "stencilling": "stenciling",
1498
+ "sterilisation": "sterilization",
1499
+ "sterilisations": "sterilizations",
1500
+ "sterilise": "sterilize",
1501
+ "sterilised": "sterilized",
1502
+ "steriliser": "sterilizer",
1503
+ "sterilisers": "sterilizers",
1504
+ "sterilises": "sterilizes",
1505
+ "sterilising": "sterilizing",
1506
+ "stigmatisation": "stigmatization",
1507
+ "stigmatise": "stigmatize",
1508
+ "stigmatised": "stigmatized",
1509
+ "stigmatises": "stigmatizes",
1510
+ "stigmatising": "stigmatizing",
1511
+ "storey": "story",
1512
+ "storeys": "stories",
1513
+ "subsidisation": "subsidization",
1514
+ "subsidise": "subsidize",
1515
+ "subsidised": "subsidized",
1516
+ "subsidiser": "subsidizer",
1517
+ "subsidisers": "subsidizers",
1518
+ "subsidises": "subsidizes",
1519
+ "subsidising": "subsidizing",
1520
+ "succour": "succor",
1521
+ "succoured": "succored",
1522
+ "succouring": "succoring",
1523
+ "succours": "succors",
1524
+ "sulphate": "sulfate",
1525
+ "sulphates": "sulfates",
1526
+ "sulphide": "sulfide",
1527
+ "sulphides": "sulfides",
1528
+ "sulphur": "sulfur",
1529
+ "sulphurous": "sulfurous",
1530
+ "summarise": "summarize",
1531
+ "summarised": "summarized",
1532
+ "summarises": "summarizes",
1533
+ "summarising": "summarizing",
1534
+ "swivelled": "swiveled",
1535
+ "swivelling": "swiveling",
1536
+ "symbolise": "symbolize",
1537
+ "symbolised": "symbolized",
1538
+ "symbolises": "symbolizes",
1539
+ "symbolising": "symbolizing",
1540
+ "sympathise": "sympathize",
1541
+ "sympathised": "sympathized",
1542
+ "sympathiser": "sympathizer",
1543
+ "sympathisers": "sympathizers",
1544
+ "sympathises": "sympathizes",
1545
+ "sympathising": "sympathizing",
1546
+ "synchronisation": "synchronization",
1547
+ "synchronise": "synchronize",
1548
+ "synchronised": "synchronized",
1549
+ "synchronises": "synchronizes",
1550
+ "synchronising": "synchronizing",
1551
+ "synthesise": "synthesize",
1552
+ "synthesised": "synthesized",
1553
+ "synthesiser": "synthesizer",
1554
+ "synthesisers": "synthesizers",
1555
+ "synthesises": "synthesizes",
1556
+ "synthesising": "synthesizing",
1557
+ "syphon": "siphon",
1558
+ "syphoned": "siphoned",
1559
+ "syphoning": "siphoning",
1560
+ "syphons": "siphons",
1561
+ "systematisation": "systematization",
1562
+ "systematise": "systematize",
1563
+ "systematised": "systematized",
1564
+ "systematises": "systematizes",
1565
+ "systematising": "systematizing",
1566
+ "tantalise": "tantalize",
1567
+ "tantalised": "tantalized",
1568
+ "tantalises": "tantalizes",
1569
+ "tantalising": "tantalizing",
1570
+ "tantalisingly": "tantalizingly",
1571
+ "tasselled": "tasseled",
1572
+ "technicolour": "technicolor",
1573
+ "temporise": "temporize",
1574
+ "temporised": "temporized",
1575
+ "temporises": "temporizes",
1576
+ "temporising": "temporizing",
1577
+ "tenderise": "tenderize",
1578
+ "tenderised": "tenderized",
1579
+ "tenderises": "tenderizes",
1580
+ "tenderising": "tenderizing",
1581
+ "terrorise": "terrorize",
1582
+ "terrorised": "terrorized",
1583
+ "terrorises": "terrorizes",
1584
+ "terrorising": "terrorizing",
1585
+ "theatre": "theater",
1586
+ "theatregoer": "theatergoer",
1587
+ "theatregoers": "theatergoers",
1588
+ "theatres": "theaters",
1589
+ "theorise": "theorize",
1590
+ "theorised": "theorized",
1591
+ "theorises": "theorizes",
1592
+ "theorising": "theorizing",
1593
+ "tonne": "ton",
1594
+ "tonnes": "tons",
1595
+ "towelled": "toweled",
1596
+ "towelling": "toweling",
1597
+ "toxaemia": "toxemia",
1598
+ "tranquillise": "tranquilize",
1599
+ "tranquillised": "tranquilized",
1600
+ "tranquilliser": "tranquilizer",
1601
+ "tranquillisers": "tranquilizers",
1602
+ "tranquillises": "tranquilizes",
1603
+ "tranquillising": "tranquilizing",
1604
+ "tranquillity": "tranquility",
1605
+ "tranquillize": "tranquilize",
1606
+ "tranquillized": "tranquilized",
1607
+ "tranquillizer": "tranquilizer",
1608
+ "tranquillizers": "tranquilizers",
1609
+ "tranquillizes": "tranquilizes",
1610
+ "tranquillizing": "tranquilizing",
1611
+ "tranquilly": "tranquility",
1612
+ "transistorised": "transistorized",
1613
+ "traumatise": "traumatize",
1614
+ "traumatised": "traumatized",
1615
+ "traumatises": "traumatizes",
1616
+ "traumatising": "traumatizing",
1617
+ "travelled": "traveled",
1618
+ "traveller": "traveler",
1619
+ "travellers": "travelers",
1620
+ "travelling": "traveling",
1621
+ "travelogue": "travelog",
1622
+ "travelogues": "travelogs",
1623
+ "trialled": "trialed",
1624
+ "trialling": "trialing",
1625
+ "tricolour": "tricolor",
1626
+ "tricolours": "tricolors",
1627
+ "trivialise": "trivialize",
1628
+ "trivialised": "trivialized",
1629
+ "trivialises": "trivializes",
1630
+ "trivialising": "trivializing",
1631
+ "tumour": "tumor",
1632
+ "tumours": "tumors",
1633
+ "tunnelled": "tunneled",
1634
+ "tunnelling": "tunneling",
1635
+ "tyrannise": "tyrannize",
1636
+ "tyrannised": "tyrannized",
1637
+ "tyrannises": "tyrannizes",
1638
+ "tyrannising": "tyrannizing",
1639
+ "tyre": "tire",
1640
+ "tyres": "tires",
1641
+ "unauthorised": "unauthorized",
1642
+ "uncivilised": "uncivilized",
1643
+ "underutilised": "underutilized",
1644
+ "unequalled": "unequaled",
1645
+ "unfavourable": "unfavorable",
1646
+ "unfavourably": "unfavorably",
1647
+ "unionisation": "unionization",
1648
+ "unionise": "unionize",
1649
+ "unionised": "unionized",
1650
+ "unionises": "unionizes",
1651
+ "unionising": "unionizing",
1652
+ "unorganised": "unorganized",
1653
+ "unravelled": "unraveled",
1654
+ "unravelling": "unraveling",
1655
+ "unrecognisable": "unrecognizable",
1656
+ "unrecognised": "unrecognized",
1657
+ "unrivalled": "unrivaled",
1658
+ "unsavoury": "unsavory",
1659
+ "untrammelled": "untrammeled",
1660
+ "urbanisation": "urbanization",
1661
+ "urbanise": "urbanize",
1662
+ "urbanised": "urbanized",
1663
+ "urbanises": "urbanizes",
1664
+ "urbanising": "urbanizing",
1665
+ "utilisable": "utilizable",
1666
+ "utilisation": "utilization",
1667
+ "utilise": "utilize",
1668
+ "utilised": "utilized",
1669
+ "utilises": "utilizes",
1670
+ "utilising": "utilizing",
1671
+ "valour": "valor",
1672
+ "vandalise": "vandalize",
1673
+ "vandalised": "vandalized",
1674
+ "vandalises": "vandalizes",
1675
+ "vandalising": "vandalizing",
1676
+ "vaporisation": "vaporization",
1677
+ "vaporise": "vaporize",
1678
+ "vaporised": "vaporized",
1679
+ "vaporises": "vaporizes",
1680
+ "vaporising": "vaporizing",
1681
+ "vapour": "vapor",
1682
+ "vapours": "vapors",
1683
+ "verbalise": "verbalize",
1684
+ "verbalised": "verbalized",
1685
+ "verbalises": "verbalizes",
1686
+ "verbalising": "verbalizing",
1687
+ "victimisation": "victimization",
1688
+ "victimise": "victimize",
1689
+ "victimised": "victimized",
1690
+ "victimises": "victimizes",
1691
+ "victimising": "victimizing",
1692
+ "videodisc": "videodisk",
1693
+ "videodiscs": "videodisks",
1694
+ "vigour": "vigor",
1695
+ "visualisation": "visualization",
1696
+ "visualisations": "visualizations",
1697
+ "visualise": "visualize",
1698
+ "visualised": "visualized",
1699
+ "visualises": "visualizes",
1700
+ "visualising": "visualizing",
1701
+ "vocalisation": "vocalization",
1702
+ "vocalisations": "vocalizations",
1703
+ "vocalise": "vocalize",
1704
+ "vocalised": "vocalized",
1705
+ "vocalises": "vocalizes",
1706
+ "vocalising": "vocalizing",
1707
+ "vulcanised": "vulcanized",
1708
+ "vulgarisation": "vulgarization",
1709
+ "vulgarise": "vulgarize",
1710
+ "vulgarised": "vulgarized",
1711
+ "vulgarises": "vulgarizes",
1712
+ "vulgarising": "vulgarizing",
1713
+ "waggon": "wagon",
1714
+ "waggons": "wagons",
1715
+ "watercolour": "watercolor",
1716
+ "watercolours": "watercolors",
1717
+ "weaselled": "weaseled",
1718
+ "weaselling": "weaseling",
1719
+ "westernisation": "westernization",
1720
+ "westernise": "westernize",
1721
+ "westernised": "westernized",
1722
+ "westernises": "westernizes",
1723
+ "westernising": "westernizing",
1724
+ "womanise": "womanize",
1725
+ "womanised": "womanized",
1726
+ "womaniser": "womanizer",
1727
+ "womanisers": "womanizers",
1728
+ "womanises": "womanizes",
1729
+ "womanising": "womanizing",
1730
+ "woollen": "woolen",
1731
+ "woollens": "woolens",
1732
+ "woollies": "woolies",
1733
+ "woolly": "wooly",
1734
+ "worshipped": "worshiped",
1735
+ "worshipper": "worshiper",
1736
+ "worshipping": "worshiping",
1737
+ "yodelled": "yodeled",
1738
+ "yodelling": "yodeling",
1739
+ "yoghourt": "yogurt",
1740
+ "yoghourts": "yogurts",
1741
+ "yoghurt": "yogurt",
1742
+ "yoghurts": "yogurts"
1743
+ }
1744
+
1745
+
1746
+ english_name_normalizer = {
1747
+ # ── Double-letter variants ──────────────────────────────────────────────
1748
+ "alan": "allen",
1749
+ "allan": "allen",
1750
+ "bridgette": "bridget",
1751
+ "charly": "charlie",
1752
+ "charley": "charlie",
1753
+ "garry": "gary",
1754
+ "gregg": "greg",
1755
+ "jacky": "jackie",
1756
+ "joann": "joanne",
1757
+ "joane": "joanne",
1758
+ "kellye": "kelly",
1759
+ "kelli": "kelly",
1760
+ "kelley": "kelly",
1761
+ "lilly": "lily",
1762
+ "micheal": "michael",
1763
+ "michele": "michelle",
1764
+ "mollie": "molly",
1765
+ "phillip": "philip",
1766
+ "sallie": "sally",
1767
+ "stacey": "stacy",
1768
+ "stacie": "stacy",
1769
+ "tracey": "tracy",
1770
+ "tracie": "tracy",
1771
+ "bret": "brett",
1772
+ "carrol": "carol",
1773
+ "carole": "carol",
1774
+ "carroll": "carol",
1775
+ "allison": "alison",
1776
+ "alyson": "alison",
1777
+ "russel": "russell",
1778
+ "douglass": "douglas",
1779
+ "dominick": "dominic",
1780
+ "robb": "rob",
1781
+ # ── Chr/Kr variants ─────────────────────────────────────────────────────
1782
+ "kris": "chris",
1783
+ "kristopher": "christopher",
1784
+ "cristopher": "christopher",
1785
+ "kristina": "christina",
1786
+ "kristen": "kristin",
1787
+ # ── C/K variants ────────────────────────────────────────────────────────
1788
+ "karl": "carl",
1789
+ "kathy": "cathy",
1790
+ "katherine": "catherine",
1791
+ "kathryn": "catherine",
1792
+ "catharine": "catherine",
1793
+ "erik": "eric",
1794
+ "erick": "eric",
1795
+ "caren": "karen",
1796
+ "caryn": "karen",
1797
+ "karin": "karen",
1798
+ "katelyn": "caitlin",
1799
+ "kaitlyn": "caitlin",
1800
+ "kaitlin": "caitlin",
1801
+ "nikole": "nicole",
1802
+ "veronika": "veronica",
1803
+ "viktor": "victor",
1804
+ "viktoria": "victoria",
1805
+ "kevan": "kevin",
1806
+ "patrik": "patrick",
1807
+ "frederik": "frederick",
1808
+ "fredrick": "frederick",
1809
+ "lukas": "lucas",
1810
+ # ── Silent letters / alternate spellings ───────────────────────────────
1811
+ "ann": "anne",
1812
+ "jon": "john",
1813
+ "johnathan": "jonathan",
1814
+ "jonathon": "jonathan",
1815
+ "sara": "sarah",
1816
+ "mathew": "matthew",
1817
+ "nicolas": "nicholas",
1818
+ "rachael": "rachel",
1819
+ "rebekah": "rebecca",
1820
+ "devorah": "deborah",
1821
+ "theresa": "teresa",
1822
+ "suzanne": "susanne",
1823
+ "antony": "anthony",
1824
+ "martyn": "martin",
1825
+ "denis": "dennis",
1826
+ "laurence": "lawrence",
1827
+ "tomas": "thomas",
1828
+ "tobey": "toby",
1829
+ # ── Mac/Mc extensions ───────────────────────────────────────────────────
1830
+ "macarthur": "mcarthur",
1831
+ "macartney": "mccartney",
1832
+ "macarthy": "mccarthy",
1833
+ "maccarthy": "mccarthy",
1834
+ "macdonald": "mcdonald",
1835
+ "mackay": "mckay",
1836
+ "mackenzie": "mckenzie",
1837
+ "macleod": "mcleod",
1838
+ "maclean": "mclean",
1839
+ "macmillan": "mcmillan",
1840
+ "macintosh": "mcintosh",
1841
+ "macintyre": "mcintyre",
1842
+ "macnamara": "mcnamara",
1843
+ "macgowan": "mcgowan",
1844
+ # ── International ─────────────────────────────────
1845
+ "mohamad": "mohammed",
1846
+ "mohamed": "mohammed",
1847
+ "mohammad": "mohammed",
1848
+ "muhammad": "mohammed",
1849
+ "muhamad": "mohammed",
1850
+ "muhammed": "mohammed",
1851
+ "mouhamed": "mohammed",
1852
+ "mouhamad": "mohammed",
1853
+ "mahomet": "mohammed",
1854
+ "fatimah": "fatima",
1855
+ "yusuf": "yousef",
1856
+ "yusef": "yousef",
1857
+ "myriam": "miriam",
1858
+ "rajeev": "rajiv",
1859
+ # ── Miscellaneous homophones ────────────────────────────────────────────
1860
+ "alphonso": "alfonso",
1861
+ "bryan": "brian",
1862
+ "geoffrey": "jeffrey",
1863
+ "jeffery": "jeffrey",
1864
+ "geoff": "jeff",
1865
+ "neal": "neil",
1866
+ "shaun": "sean",
1867
+ "shawn": "sean",
1868
+ "shayne": "shane",
1869
+ "stephen": "steven",
1870
+ "toni": "tony",
1871
+ "leigh": "lee",
1872
+ "lewis": "louis",
1873
+ "marc": "mark",
1874
+ "meghan": "megan",
1875
+ "nathalie": "natalie",
1876
+ "robyn": "robin",
1877
+ "rodger": "roger",
1878
+ "linsey": "lindsay",
1879
+ "lindsey": "lindsay",
1880
+ "zackary": "zachary",
1881
+ "zachery": "zachary",
1882
+ "zak": "zach",
1883
+ "sheri": "sherry",
1884
+ "cheri": "sherry",
1885
+ "sherrie": "sherry",
1886
+ "terri": "terry",
1887
+ "lori": "laurie",
1888
+ "jaime": "jamie",
1889
+ "jayson": "jason",
1890
+ "lesley": "leslie",
1891
+ "lynda": "linda",
1892
+ "lynne": "lynn",
1893
+ "gayle": "gail",
1894
+ "rhonda": "ronda",
1895
+ "yvonne": "ivonne",
1896
+ "stewart": "stuart",
1897
+ "walther": "walter",
1898
+ "symon": "simon",
1899
+ "collin": "colin",
1900
+ "dillon": "dylan",
1901
+ "aron": "aaron",
1902
+ "artur": "arthur",
1903
+ "henri": "henry",
1904
+ "josef": "joseph",
1905
+ "pieter": "peter",
1906
+ }
1907
+
normalizer/eval_utils.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ import json
4
+ from difflib import SequenceMatcher
5
+
6
+ import evaluate
7
+ from collections import defaultdict
8
+
9
+
10
+ def normalize_compound_pairs(refs, preds):
11
+ """Align compound word boundaries between ref/pred pairs.
12
+
13
+ When a mismatch region has identical characters ignoring whitespace,
14
+ normalize both sides to the joined form.
15
+ """
16
+ new_refs, new_preds = [], []
17
+ for ref_text, pred_text in zip(refs, preds):
18
+ ref_words = ref_text.split()
19
+ pred_words = pred_text.split()
20
+
21
+ sm = SequenceMatcher(None, ref_words, pred_words)
22
+ new_rw, new_pw = [], []
23
+
24
+ for tag, i1, i2, j1, j2 in sm.get_opcodes():
25
+ if tag == "equal":
26
+ new_rw.extend(ref_words[i1:i2])
27
+ new_pw.extend(pred_words[j1:j2])
28
+ else:
29
+ rc = "".join(ref_words[i1:i2])
30
+ pc = "".join(pred_words[j1:j2])
31
+ if rc == pc:
32
+ new_rw.append(rc)
33
+ new_pw.append(pc)
34
+ else:
35
+ new_rw.extend(ref_words[i1:i2])
36
+ new_pw.extend(pred_words[j1:j2])
37
+
38
+ new_refs.append(" ".join(new_rw))
39
+ new_preds.append(" ".join(new_pw))
40
+ return new_refs, new_preds
41
+
42
+
43
+ def read_manifest(manifest_path: str):
44
+ """
45
+ Reads a manifest file (jsonl format) and returns a list of dictionaries containing samples.
46
+ """
47
+ data = []
48
+ with open(manifest_path, "r", encoding="utf-8") as f:
49
+ for line in f:
50
+ if len(line) > 0:
51
+ datum = json.loads(line)
52
+ data.append(datum)
53
+ return data
54
+
55
+
56
+ def write_manifest(
57
+ references: list,
58
+ transcriptions: list,
59
+ model_id: str,
60
+ dataset_path: str,
61
+ dataset_name: str,
62
+ split: str,
63
+ audio_length: list = None,
64
+ transcription_time: list = None,
65
+ audio_filepaths: list = None,
66
+ ):
67
+ """
68
+ Writes a manifest file (jsonl format) and returns the path to the file.
69
+
70
+ Args:
71
+ references: Ground truth reference texts.
72
+ transcriptions: Model predicted transcriptions.
73
+ model_id: String identifier for the model.
74
+ dataset_path: Path to the dataset.
75
+ dataset_name: Name of the dataset.
76
+ split: Dataset split name.
77
+ audio_length: Length of each audio sample in seconds.
78
+ transcription_time: Transcription time of each sample in seconds.
79
+ audio_filepaths: List of file paths for each audio sample.
80
+ Returns:
81
+ Path to the manifest file.
82
+ """
83
+ model_id = model_id.replace("/", "-")
84
+ dataset_path = dataset_path.replace("/", "-")
85
+ dataset_name = dataset_name.replace("/", "-")
86
+
87
+ if len(references) != len(transcriptions):
88
+ raise ValueError(
89
+ f"The number of samples in `references` ({len(references)}) "
90
+ f"must match `transcriptions` ({len(transcriptions)})."
91
+ )
92
+
93
+ if audio_length is not None and len(audio_length) != len(references):
94
+ raise ValueError(
95
+ f"The number of samples in `audio_length` ({len(audio_length)}) "
96
+ f"must match `references` ({len(references)})."
97
+ )
98
+ if transcription_time is not None and len(transcription_time) != len(references):
99
+ raise ValueError(
100
+ f"The number of samples in `transcription_time` ({len(transcription_time)}) "
101
+ f"must match `references` ({len(references)})."
102
+ )
103
+ if audio_filepaths is not None and len(audio_filepaths) != len(references):
104
+ raise ValueError(
105
+ f"The number of samples in `audio_filepaths` ({len(audio_filepaths)}) "
106
+ f"must match `references` ({len(references)})."
107
+ )
108
+
109
+ # Filter out samples where the normalized reference is empty,
110
+ # e.g. all-filler words removed by normalization. Mutates the caller's
111
+ # lists in-place (via slice assignment) so downstream WER computation
112
+ # in caller scripts also sees the filtered data.
113
+ valid_indices = [
114
+ i for i, ref in enumerate(references) if isinstance(ref, str) and ref.strip()
115
+ ]
116
+ n_filtered = len(references) - len(valid_indices)
117
+ if n_filtered > 0:
118
+ print(f"Filtered {n_filtered} empty references")
119
+ references[:] = [references[i] for i in valid_indices]
120
+ transcriptions[:] = [transcriptions[i] for i in valid_indices]
121
+ if audio_length is not None:
122
+ audio_length[:] = [audio_length[i] for i in valid_indices]
123
+ if transcription_time is not None:
124
+ transcription_time[:] = [transcription_time[i] for i in valid_indices]
125
+ if audio_filepaths is not None:
126
+ audio_filepaths[:] = [audio_filepaths[i] for i in valid_indices]
127
+
128
+ audio_length = (
129
+ audio_length if audio_length is not None else len(references) * [None]
130
+ )
131
+ transcription_time = (
132
+ transcription_time
133
+ if transcription_time is not None
134
+ else len(references) * [None]
135
+ )
136
+ audio_filepaths = (
137
+ audio_filepaths if audio_filepaths is not None else len(references) * [None]
138
+ )
139
+
140
+ basedir = "./results/"
141
+ if not os.path.exists(basedir):
142
+ os.makedirs(basedir)
143
+
144
+ manifest_path = os.path.join(
145
+ basedir, f"MODEL_{model_id}_DATASET_{dataset_path}_{dataset_name}_{split}.jsonl"
146
+ )
147
+
148
+ with open(manifest_path, "w", encoding="utf-8") as f:
149
+ for idx, (text, transcript, audio_length, transcription_time, audio_filepath) in enumerate(
150
+ zip(references, transcriptions, audio_length, transcription_time, audio_filepaths)
151
+ ):
152
+ datum = {
153
+ "audio_filepath": audio_filepath if audio_filepath else f"sample_{idx}",
154
+ "duration": audio_length,
155
+ "time": transcription_time,
156
+ "text": text,
157
+ "pred_text": transcript,
158
+ }
159
+ f.write(f"{json.dumps(datum, ensure_ascii=False)}\n")
160
+ return manifest_path
161
+
162
+
163
+ def score_results(directory: str, model_id: str = None, multilingual: bool = False):
164
+ """
165
+ Scores all result files in a directory and returns a composite score over all evaluated datasets.
166
+
167
+ Args:
168
+ directory: Path to the result directory, containing one or more jsonl files.
169
+ model_id: Optional, model name to filter out result files based on model name.
170
+ multilingual: If True, apply compound word boundary normalization before
171
+ WER computation. Should only be enabled for non-English benchmarks.
172
+
173
+ Returns:
174
+ Composite score over all evaluated datasets and a dictionary of all results.
175
+ """
176
+
177
+ # Strip trailing slash
178
+ if directory.endswith(os.pathsep):
179
+ directory = directory[:-1]
180
+
181
+ # Find all result files in the directory
182
+ result_files = list(glob.glob(f"{directory}/**/*.jsonl", recursive=True))
183
+ result_files = list(sorted(result_files))
184
+
185
+ # Filter files belonging to a specific model id
186
+ if model_id is not None and model_id != "":
187
+ print("Filtering models by id:", model_id)
188
+ model_id = model_id.replace("/", "-")
189
+ result_files = [fp for fp in result_files if model_id in fp]
190
+
191
+ # Check if any result files were found
192
+ if len(result_files) == 0:
193
+ raise ValueError(f"No result files found in {directory}")
194
+
195
+ # Utility function to parse the file path and extract model id, dataset path, dataset name and split
196
+ def parse_filepath(fp: str):
197
+ model_index = fp.find("MODEL_")
198
+ fp = fp[model_index:]
199
+ ds_index = fp.find("DATASET_")
200
+ model_id = fp[:ds_index].replace("MODEL_", "").rstrip("_")
201
+ author_index = model_id.find("-")
202
+ model_id = model_id[:author_index] + "/" + model_id[author_index + 1 :]
203
+
204
+ ds_fp = fp[ds_index:]
205
+ dataset_id = ds_fp.replace("DATASET_", "").rstrip(".jsonl")
206
+ return model_id, dataset_id
207
+
208
+ # Compute WER results per dataset, and RTFx over all datasets
209
+ results = {}
210
+ wer_metric = evaluate.load("wer")
211
+
212
+ for result_file in result_files:
213
+ manifest = read_manifest(result_file)
214
+ model_id_of_file, dataset_id = parse_filepath(result_file)
215
+
216
+ manifest = [datum for datum in manifest if datum["text"].strip()]
217
+
218
+ references = [datum["text"] for datum in manifest]
219
+ predictions = [datum["pred_text"] for datum in manifest]
220
+
221
+ time = [datum["time"] for datum in manifest]
222
+ duration = [datum["duration"] for datum in manifest]
223
+ compute_rtfx = all(time) and all(duration)
224
+
225
+ if multilingual:
226
+ references, predictions = normalize_compound_pairs(references, predictions)
227
+
228
+ wer = wer_metric.compute(references=references, predictions=predictions)
229
+ wer = round(100 * wer, 2)
230
+
231
+ if compute_rtfx:
232
+ audio_length = sum(duration)
233
+ inference_time = sum(time)
234
+ rtfx = round(sum(duration) / sum(time), 4)
235
+ else:
236
+ audio_length = inference_time = rtfx = None
237
+
238
+ result_key = f"{model_id_of_file} | {dataset_id}"
239
+ results[result_key] = {"wer": wer, "audio_length": audio_length, "inference_time": inference_time, "rtfx": rtfx}
240
+
241
+ print("*" * 80)
242
+ print("Results per dataset:")
243
+ print("*" * 80)
244
+
245
+ for k, v in results.items():
246
+ metrics = f"{k}: WER = {v['wer']:0.2f} %"
247
+ if v["rtfx"] is not None:
248
+ metrics += f", RTFx = {v['rtfx']:0.2f}"
249
+ print(metrics)
250
+
251
+ # composite WER should be computed over all datasets and with the same key
252
+ composite_wer = defaultdict(float)
253
+ composite_audio_length = defaultdict(float)
254
+ composite_inference_time = defaultdict(float)
255
+ count_entries = defaultdict(int)
256
+ for k, v in results.items():
257
+ key = k.split("|")[0].strip()
258
+ composite_wer[key] += v["wer"]
259
+ if v["rtfx"] is not None:
260
+ composite_audio_length[key] += v["audio_length"]
261
+ composite_inference_time[key] += v["inference_time"]
262
+ else:
263
+ composite_audio_length[key] = composite_inference_time[key] = None
264
+ count_entries[key] += 1
265
+
266
+ # normalize scores & print
267
+ print()
268
+ print("*" * 80)
269
+ print("Composite Results:")
270
+ print("*" * 80)
271
+ for k, v in composite_wer.items():
272
+ wer = v / count_entries[k]
273
+ print(f"{k}: WER = {wer:0.2f} %")
274
+ for k in composite_audio_length:
275
+ if composite_audio_length[k] is not None:
276
+ rtfx = composite_audio_length[k] / composite_inference_time[k]
277
+ print(f"{k}: RTFx = {rtfx:0.2f}")
278
+ print("*" * 80)
279
+ return composite_wer, results
normalizer/normalizer.py ADDED
@@ -0,0 +1,690 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2022 The OpenAI team and The HuggingFace Team. All rights reserved.
2
+ # Most of the code is copy pasted from the original whisper repository
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import re
17
+ import unicodedata
18
+ from fractions import Fraction
19
+ from typing import Iterator, List, Match, Optional, Union
20
+ from .english_abbreviations import english_name_normalizer, english_spelling_normalizer
21
+
22
+ import regex
23
+
24
+
25
+ # non-ASCII letters that are not separated by "NFKD" normalization
26
+ ADDITIONAL_DIACRITICS = {
27
+ "œ": "oe",
28
+ "Œ": "OE",
29
+ "ø": "o",
30
+ "Ø": "O",
31
+ "æ": "ae",
32
+ "Æ": "AE",
33
+ "ß": "ss",
34
+ "ẞ": "SS",
35
+ "đ": "d",
36
+ "Đ": "D",
37
+ "ð": "d",
38
+ "Ð": "D",
39
+ "þ": "th",
40
+ "Þ": "th",
41
+ "ł": "l",
42
+ "Ł": "L",
43
+ }
44
+
45
+
46
+ def remove_symbols_and_diacritics(s: str, keep=""):
47
+ """
48
+ Replace any other markers, symbols, and punctuations with a space, and drop any diacritics (category 'Mn' and some
49
+ manual mappings)
50
+ """
51
+
52
+ def replace_character(char):
53
+ if char in keep:
54
+ return char
55
+ elif char in ADDITIONAL_DIACRITICS:
56
+ return ADDITIONAL_DIACRITICS[char]
57
+
58
+ elif unicodedata.category(char) == "Mn":
59
+ return ""
60
+
61
+ elif unicodedata.category(char)[0] in "MSP":
62
+ return " "
63
+
64
+ return char
65
+
66
+ return "".join(replace_character(c) for c in unicodedata.normalize("NFKD", s))
67
+
68
+
69
+ def remove_symbols(s: str):
70
+ """
71
+ Replace any other markers, symbols, punctuations with a space, keeping diacritics
72
+ """
73
+ return "".join(" " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKC", s))
74
+
75
+
76
+ class BasicTextNormalizer:
77
+ def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
78
+ self.clean = remove_symbols_and_diacritics if remove_diacritics else remove_symbols
79
+ self.split_letters = split_letters
80
+
81
+ def __call__(self, s: str):
82
+ s = s.lower()
83
+ s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
84
+ s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
85
+ s = self.clean(s).lower()
86
+
87
+ if self.split_letters:
88
+ s = " ".join(regex.findall(r"\X", s, regex.U))
89
+
90
+ s = re.sub(r"\s+", " ", s) # replace any successive whitespace characters with a space
91
+
92
+ return s
93
+
94
+
95
+ class BasicMultilingualTextNormalizer:
96
+ def __init__(self, remove_diacritics: bool = True):
97
+ self.clean = remove_symbols_and_diacritics if remove_diacritics else remove_symbols
98
+
99
+ def __call__(self, s: str):
100
+ s = s.lower()
101
+ s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
102
+ s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
103
+ s = self.clean(s).lower()
104
+
105
+ # Remove punctuations and extra spaces
106
+ s = re.sub(r"[^\w\s]", "", s)
107
+ s = re.sub(r"\s+", " ", s).strip()
108
+
109
+ return s
110
+
111
+
112
+ class EnglishNumberNormalizer:
113
+ """
114
+ Convert any spelled-out numbers into arabic numbers, while handling:
115
+
116
+ - remove any commas
117
+ - keep the suffixes such as: `1960s`, `274th`, `32nd`, etc.
118
+ - spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars`
119
+ - spell out `one` and `ones`
120
+ - interpret successive single-digit numbers as nominal: `one oh one` -> `101`
121
+ """
122
+
123
+ def __init__(self):
124
+ super().__init__()
125
+
126
+ self.zeros = {"o", "oh", "zero"}
127
+ # fmt: off
128
+ self.ones = {
129
+ name: i
130
+ for i, name in enumerate(
131
+ ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"],
132
+ start=1,
133
+ )
134
+ }
135
+ # fmt: on
136
+ self.ones_plural = {
137
+ "sixes" if name == "six" else name + "s": (value, "s") for name, value in self.ones.items()
138
+ }
139
+ self.ones_ordinal = {
140
+ "zeroth": (0, "th"),
141
+ "first": (1, "st"),
142
+ "second": (2, "nd"),
143
+ "third": (3, "rd"),
144
+ "fifth": (5, "th"),
145
+ "twelfth": (12, "th"),
146
+ **{
147
+ name + ("h" if name.endswith("t") else "th"): (value, "th")
148
+ for name, value in self.ones.items()
149
+ if value > 3 and value != 5 and value != 12
150
+ },
151
+ }
152
+ self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal}
153
+
154
+ self.tens = {
155
+ "twenty": 20,
156
+ "thirty": 30,
157
+ "forty": 40,
158
+ "fifty": 50,
159
+ "sixty": 60,
160
+ "seventy": 70,
161
+ "eighty": 80,
162
+ "ninety": 90,
163
+ }
164
+ self.tens_plural = {name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()}
165
+ self.tens_ordinal = {name.replace("y", "ieth"): (value, "th") for name, value in self.tens.items()}
166
+ self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal}
167
+
168
+ self.multipliers = {
169
+ "hundred": 100,
170
+ "thousand": 1_000,
171
+ "million": 1_000_000,
172
+ "billion": 1_000_000_000,
173
+ "trillion": 1_000_000_000_000,
174
+ "quadrillion": 1_000_000_000_000_000,
175
+ "quintillion": 1_000_000_000_000_000_000,
176
+ "sextillion": 1_000_000_000_000_000_000_000,
177
+ "septillion": 1_000_000_000_000_000_000_000_000,
178
+ "octillion": 1_000_000_000_000_000_000_000_000_000,
179
+ "nonillion": 1_000_000_000_000_000_000_000_000_000_000,
180
+ "decillion": 1_000_000_000_000_000_000_000_000_000_000_000,
181
+ }
182
+ self.multipliers_plural = {name + "s": (value, "s") for name, value in self.multipliers.items()}
183
+ self.multipliers_ordinal = {name + "th": (value, "th") for name, value in self.multipliers.items()}
184
+ self.multipliers_suffixed = {**self.multipliers_plural, **self.multipliers_ordinal}
185
+ self.decimals = {*self.ones, *self.tens, *self.zeros}
186
+
187
+ self.preceding_prefixers = {
188
+ "minus": "-",
189
+ "negative": "-",
190
+ "plus": "+",
191
+ "positive": "+",
192
+ }
193
+ self.following_prefixers = {
194
+ "pound": "£",
195
+ "pounds": "£",
196
+ "euro": "€",
197
+ "euros": "€",
198
+ "dollar": "$",
199
+ "dollars": "$",
200
+ "cent": "¢",
201
+ "cents": "¢",
202
+ }
203
+ self.prefixes = set(list(self.preceding_prefixers.values()) + list(self.following_prefixers.values()))
204
+ self.suffixers = {
205
+ "per": {"cent": "%"},
206
+ "percent": "%",
207
+ }
208
+ self.specials = {"and", "double", "triple", "point"}
209
+
210
+ self.words = {
211
+ key
212
+ for mapping in [
213
+ self.zeros,
214
+ self.ones,
215
+ self.ones_suffixed,
216
+ self.tens,
217
+ self.tens_suffixed,
218
+ self.multipliers,
219
+ self.multipliers_suffixed,
220
+ self.preceding_prefixers,
221
+ self.following_prefixers,
222
+ self.suffixers,
223
+ self.specials,
224
+ ]
225
+ for key in mapping
226
+ }
227
+ self.literal_words = {"one", "ones"}
228
+
229
+ def process_words(self, words: List[str]) -> Iterator[str]:
230
+ prefix: Optional[str] = None
231
+ value: Optional[Union[str, int]] = None
232
+ skip = False
233
+
234
+ def to_fraction(s: str):
235
+ try:
236
+ return Fraction(s)
237
+ except ValueError:
238
+ return None
239
+
240
+ def output(result: Union[str, int]):
241
+ nonlocal prefix, value
242
+ result = str(result)
243
+ if prefix is not None:
244
+ result = prefix + result
245
+ value = None
246
+ prefix = None
247
+ return result
248
+
249
+ if len(words) == 0:
250
+ return
251
+
252
+ for i, current in enumerate(words):
253
+ prev = words[i - 1] if i != 0 else None
254
+ next = words[i + 1] if i != len(words) - 1 else None
255
+ if skip:
256
+ skip = False
257
+ continue
258
+
259
+ next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next)
260
+ has_prefix = current[0] in self.prefixes
261
+ current_without_prefix = current[1:] if has_prefix else current
262
+ if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
263
+ # arabic numbers (potentially with signs and fractions)
264
+ f = to_fraction(current_without_prefix)
265
+ if f is None:
266
+ raise ValueError("Converting the fraction failed")
267
+
268
+ if value is not None:
269
+ if isinstance(value, str) and value.endswith("."):
270
+ # concatenate decimals / ip address components
271
+ value = str(value) + str(current)
272
+ continue
273
+ else:
274
+ yield output(value)
275
+
276
+ prefix = current[0] if has_prefix else prefix
277
+ if f.denominator == 1:
278
+ value = f.numerator # store integers as int
279
+ else:
280
+ value = current_without_prefix
281
+ elif current not in self.words:
282
+ # non-numeric words
283
+ if value is not None:
284
+ yield output(value)
285
+ yield output(current)
286
+ elif current in self.zeros:
287
+ value = str(value or "") + "0"
288
+ elif current in self.ones:
289
+ ones = self.ones[current]
290
+
291
+ if value is None:
292
+ value = ones
293
+ elif isinstance(value, str) or prev in self.ones:
294
+ if prev in self.tens and ones < 10: # replace the last zero with the digit
295
+ value = value[:-1] + str(ones)
296
+ else:
297
+ value = str(value) + str(ones)
298
+ elif ones < 10:
299
+ if value % 10 == 0:
300
+ value += ones
301
+ else:
302
+ value = str(value) + str(ones)
303
+ else: # eleven to nineteen
304
+ if value % 100 == 0:
305
+ value += ones
306
+ else:
307
+ value = str(value) + str(ones)
308
+ elif current in self.ones_suffixed:
309
+ # ordinal or cardinal; yield the number right away
310
+ ones, suffix = self.ones_suffixed[current]
311
+ if value is None:
312
+ yield output(str(ones) + suffix)
313
+ elif isinstance(value, str) or prev in self.ones:
314
+ if prev in self.tens and ones < 10:
315
+ yield output(value[:-1] + str(ones) + suffix)
316
+ else:
317
+ yield output(str(value) + str(ones) + suffix)
318
+ elif ones < 10:
319
+ if value % 10 == 0:
320
+ yield output(str(value + ones) + suffix)
321
+ else:
322
+ yield output(str(value) + str(ones) + suffix)
323
+ else: # eleven to nineteen
324
+ if value % 100 == 0:
325
+ yield output(str(value + ones) + suffix)
326
+ else:
327
+ yield output(str(value) + str(ones) + suffix)
328
+ value = None
329
+ elif current in self.tens:
330
+ tens = self.tens[current]
331
+ if value is None:
332
+ value = tens
333
+ elif isinstance(value, str):
334
+ value = str(value) + str(tens)
335
+ else:
336
+ if value % 100 == 0:
337
+ value += tens
338
+ else:
339
+ value = str(value) + str(tens)
340
+ elif current in self.tens_suffixed:
341
+ # ordinal or cardinal; yield the number right away
342
+ tens, suffix = self.tens_suffixed[current]
343
+ if value is None:
344
+ yield output(str(tens) + suffix)
345
+ elif isinstance(value, str):
346
+ yield output(str(value) + str(tens) + suffix)
347
+ else:
348
+ if value % 100 == 0:
349
+ yield output(str(value + tens) + suffix)
350
+ else:
351
+ yield output(str(value) + str(tens) + suffix)
352
+ elif current in self.multipliers:
353
+ multiplier = self.multipliers[current]
354
+ if value is None:
355
+ value = multiplier
356
+ elif isinstance(value, str) or value == 0:
357
+ f = to_fraction(value)
358
+ p = f * multiplier if f is not None else None
359
+ if f is not None and p.denominator == 1:
360
+ value = p.numerator
361
+ else:
362
+ yield output(value)
363
+ value = multiplier
364
+ else:
365
+ before = value // 1000 * 1000
366
+ residual = value % 1000
367
+ value = before + residual * multiplier
368
+ elif current in self.multipliers_suffixed:
369
+ multiplier, suffix = self.multipliers_suffixed[current]
370
+ if value is None:
371
+ yield output(str(multiplier) + suffix)
372
+ elif isinstance(value, str):
373
+ f = to_fraction(value)
374
+ p = f * multiplier if f is not None else None
375
+ if f is not None and p.denominator == 1:
376
+ yield output(str(p.numerator) + suffix)
377
+ else:
378
+ yield output(value)
379
+ yield output(str(multiplier) + suffix)
380
+ else: # int
381
+ before = value // 1000 * 1000
382
+ residual = value % 1000
383
+ value = before + residual * multiplier
384
+ yield output(str(value) + suffix)
385
+ value = None
386
+ elif current in self.preceding_prefixers:
387
+ # apply prefix (positive, minus, etc.) if it precedes a number
388
+ if value is not None:
389
+ yield output(value)
390
+
391
+ if next in self.words or next_is_numeric:
392
+ prefix = self.preceding_prefixers[current]
393
+ else:
394
+ yield output(current)
395
+ elif current in self.following_prefixers:
396
+ # apply prefix (dollars, cents, etc.) only after a number
397
+ if value is not None:
398
+ prefix = self.following_prefixers[current]
399
+ yield output(value)
400
+ else:
401
+ yield output(current)
402
+ elif current in self.suffixers:
403
+ # apply suffix symbols (percent -> '%')
404
+ if value is not None:
405
+ suffix = self.suffixers[current]
406
+ if isinstance(suffix, dict):
407
+ if next in suffix:
408
+ yield output(str(value) + suffix[next])
409
+ skip = True
410
+ else:
411
+ yield output(value)
412
+ yield output(current)
413
+ else:
414
+ yield output(str(value) + suffix)
415
+ else:
416
+ yield output(current)
417
+ elif current in self.specials:
418
+ if next not in self.words and not next_is_numeric:
419
+ # apply special handling only if the next word can be numeric
420
+ if value is not None:
421
+ yield output(value)
422
+ yield output(current)
423
+ elif current == "and":
424
+ # ignore "and" after hundreds, thousands, etc.
425
+ if prev not in self.multipliers:
426
+ if value is not None:
427
+ yield output(value)
428
+ yield output(current)
429
+ elif current == "double" or current == "triple":
430
+ if next in self.ones or next in self.zeros:
431
+ repeats = 2 if current == "double" else 3
432
+ ones = self.ones.get(next, 0)
433
+ value = str(value or "") + str(ones) * repeats
434
+ skip = True
435
+ else:
436
+ if value is not None:
437
+ yield output(value)
438
+ yield output(current)
439
+ elif current == "point":
440
+ if next in self.decimals or next_is_numeric:
441
+ value = str(value or "") + "."
442
+ else:
443
+ # should all have been covered at this point
444
+ raise ValueError(f"Unexpected token: {current}")
445
+ else:
446
+ # all should have been covered at this point
447
+ raise ValueError(f"Unexpected token: {current}")
448
+
449
+ if value is not None:
450
+ yield output(value)
451
+
452
+ def preprocess(self, s: str):
453
+ # replace "<number> and a half" with "<number> point five"
454
+ results = []
455
+
456
+ segments = re.split(r"\band\s+a\s+half\b", s)
457
+ for i, segment in enumerate(segments):
458
+ if len(segment.strip()) == 0:
459
+ continue
460
+ if i == len(segments) - 1:
461
+ results.append(segment)
462
+ else:
463
+ results.append(segment)
464
+ last_word = segment.rsplit(maxsplit=2)[-1]
465
+ if last_word in self.decimals or last_word in self.multipliers:
466
+ results.append("point five")
467
+ else:
468
+ results.append("and a half")
469
+
470
+ s = " ".join(results)
471
+
472
+ # put a space at number/letter boundary
473
+ s = re.sub(r"([a-z])([0-9])", r"\1 \2", s)
474
+ s = re.sub(r"([0-9])([a-z])", r"\1 \2", s)
475
+
476
+ # but remove spaces which could be a suffix
477
+ s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s)
478
+
479
+ return s
480
+
481
+ def postprocess(self, s: str):
482
+ def combine_cents(m: Match):
483
+ try:
484
+ currency = m.group(1)
485
+ integer = m.group(2)
486
+ cents = int(m.group(3))
487
+ return f"{currency}{integer}.{cents:02d}"
488
+ except ValueError:
489
+ return m.string
490
+
491
+ def extract_cents(m: Match):
492
+ try:
493
+ return f"¢{int(m.group(1))}"
494
+ except ValueError:
495
+ return m.string
496
+
497
+ # apply currency postprocessing; "$2 and ¢7" -> "$2.07"
498
+ s = re.sub(r"([€£$])([0-9]+) (?:and )?¢([0-9]{1,2})\b", combine_cents, s)
499
+ s = re.sub(r"[€£$]0.([0-9]{1,2})\b", extract_cents, s)
500
+
501
+ # write "one(s)" instead of "1(s)", just for the readability
502
+ s = re.sub(r"\b1(s?)\b", r"one\1", s)
503
+
504
+ return s
505
+
506
+ def __call__(self, s: str):
507
+ s = self.preprocess(s)
508
+ s = " ".join(word for word in self.process_words(s.split()) if word is not None)
509
+ s = self.postprocess(s)
510
+
511
+ return s
512
+
513
+
514
+ class EnglishSpellingNormalizer:
515
+ """
516
+ Applies British-American spelling mappings as listed in [1].
517
+
518
+ [1] https://www.tysto.com/uk-us-spelling-list.html
519
+ """
520
+
521
+ def __init__(self, english_spelling_mapping):
522
+ self.mapping = english_spelling_mapping
523
+
524
+ def __call__(self, s: str):
525
+ return " ".join(self.mapping.get(word, word) for word in s.split())
526
+
527
+
528
+ class EnglishAcronymNormalizer:
529
+ """
530
+ Collapse sequences of single-character tokens (letters or digits) into single words.
531
+
532
+ This normalizes acronym spacing so that both spaced-out and joined forms match:
533
+ - "b b c" -> "bbc"
534
+ - "5 g" -> "5g"
535
+
536
+ Lone single-character words surrounded by multi-character words are left untouched
537
+ (e.g. "a big cat" stays "a big cat").
538
+ """
539
+
540
+ def __call__(self, s: str) -> str:
541
+ words = s.split()
542
+ result = []
543
+ i = 0
544
+ while i < len(words):
545
+ if len(words[i]) == 1 and words[i].isalnum():
546
+ # Start of a potential acronym run
547
+ run = [words[i]]
548
+ j = i + 1
549
+ while j < len(words) and len(words[j]) == 1 and words[j].isalnum():
550
+ run.append(words[j])
551
+ j += 1
552
+ # Require 3+ tokens if the run contains common words "a" or "i",
553
+ # otherwise 2+ is enough (e.g. "5 g" -> "5g")
554
+ has_common_word = any(c in ("a", "i") for c in run)
555
+ min_run = 3 if has_common_word else 2
556
+ if len(run) >= min_run:
557
+ result.append("".join(run))
558
+ else:
559
+ result.extend(run)
560
+ i = j
561
+ else:
562
+ result.append(words[i])
563
+ i += 1
564
+ return " ".join(result)
565
+
566
+
567
+ class EnglishNameNormalizer:
568
+ """
569
+ Collapse common name spelling variants to a single canonical form.
570
+
571
+ This is intentionally conservative and token-based so it can be extended
572
+ with project-specific aliases when needed.
573
+ """
574
+
575
+ def __init__(self, english_name_mapping=english_name_normalizer):
576
+ self.mapping = english_name_mapping
577
+
578
+ def __call__(self, s: str):
579
+ return " ".join(self.mapping.get(word, word) for word in s.split())
580
+
581
+
582
+ class EnglishTextNormalizer:
583
+ def __init__(self, english_spelling_mapping=english_spelling_normalizer):
584
+ self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um|ah|aha|ahh|ahm|eh|ehehe|em|hm|huh|hum|mhum|uhm|umm|uhuh)\b"
585
+ self.replacers = {
586
+ # common contractions
587
+ r"\bwon't\b": "will not",
588
+ r"\bcan't\b": "can not",
589
+ r"\blet's\b": "let us",
590
+ r"\bain't\b": "aint",
591
+ r"\by'all\b": "you all",
592
+ r"\bwanna\b": "want to",
593
+ r"\bgotta\b": "got to",
594
+ r"\bgonna\b": "going to",
595
+ r"\bi'ma\b": "i am going to",
596
+ r"\bimma\b": "i am going to",
597
+ r"\bwoulda\b": "would have",
598
+ r"\bcoulda\b": "could have",
599
+ r"\bshoulda\b": "should have",
600
+ r"\bma'am\b": "madam",
601
+ # contractions in titles/prefixes
602
+ r"\bmr\b": "mister ",
603
+ r"\bmrs\b": "missus ",
604
+ r"\bst\b": "saint ",
605
+ r"\bdr\b": "doctor ",
606
+ r"\bprof\b": "professor ",
607
+ r"\bcapt\b": "captain ",
608
+ r"\bgov\b": "governor ",
609
+ r"\bald\b": "alderman ",
610
+ r"\bgen\b": "general ",
611
+ r"\bsen\b": "senator ",
612
+ r"\brep\b": "representative ",
613
+ r"\bpres\b": "president ",
614
+ r"\brev\b": "reverend ",
615
+ r"\bhon\b": "honorable ",
616
+ r"\basst\b": "assistant ",
617
+ r"\bassoc\b": "associate ",
618
+ r"\blt\b": "lieutenant ",
619
+ r"\bcol\b": "colonel ",
620
+ r"\bjr\b": "junior ",
621
+ r"\bsr\b": "senior ",
622
+ r"\besq\b": "esquire ",
623
+ # prefect tenses, ideally it should be any past participles, but it's harder..
624
+ r"'d been\b": " had been",
625
+ r"'s been\b": " has been",
626
+ r"'d gone\b": " had gone",
627
+ r"'s gone\b": " has gone",
628
+ r"'d done\b": " had done", # "'s done" is ambiguous
629
+ r"'s got\b": " has got",
630
+ # general contractions
631
+ r"n't\b": " not",
632
+ r"'re\b": " are",
633
+ r"\b(it|he|she|what|that|who|here|there|how|when|where|why|this)'s\b": r"\1 is",
634
+ r"'d\b": " would",
635
+ r"'ll\b": " will",
636
+ r"'t\b": " not",
637
+ r"'ve\b": " have",
638
+ r"'m\b": " am",
639
+ }
640
+ self.standardize_numbers = EnglishNumberNormalizer()
641
+ self.standardize_spellings = EnglishSpellingNormalizer(english_spelling_mapping)
642
+ self.standardize_names = EnglishNameNormalizer()
643
+ self.standardize_acronyms = EnglishAcronymNormalizer()
644
+ # Hardcoded compound words that become two tokens after hyphen/symbol removal
645
+ self.compound_words = {
646
+ r"\bwi\s+fi\b": "wifi",
647
+ r"\bhi\s+fi\b": "hifi",
648
+ r"\blo\s+fi\b": "lofi",
649
+ r"\bsci\s+fi\b": "scifi",
650
+ r"\be\s+mail\b": "email",
651
+ r"\be\s+book\b": "ebook",
652
+ r"\be\s+commerce\b": "ecommerce",
653
+ r"\bx\s+ray\b": "xray",
654
+ r"\bt\s+shirt\b": "tshirt",
655
+ r"\ba\s+m\b": "am",
656
+ r"\bp\s+m\b": "pm",
657
+ r"\bo\s+k\b": "okay",
658
+ }
659
+
660
+ def __call__(self, s: str):
661
+ s = s.lower()
662
+
663
+ s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets
664
+ s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis
665
+ s = re.sub(self.ignore_patterns, "", s)
666
+ s = re.sub(r"\s+'", "'", s) # standardize when there's a space before an apostrophe
667
+
668
+ for pattern, replacement in self.replacers.items():
669
+ s = re.sub(pattern, replacement, s)
670
+
671
+ s = re.sub(r"(\d),(\d)", r"\1\2", s) # remove commas between digits
672
+ s = re.sub(r"\.([^0-9]|$)", r" \1", s) # remove periods not followed by numbers
673
+ s = remove_symbols_and_diacritics(s, keep=".%$¢€£") # keep some symbols for numerics
674
+
675
+ # Normalize hardcoded compound words (e.g. "wi fi" -> "wifi" after hyphen removal)
676
+ for pattern, replacement in self.compound_words.items():
677
+ s = re.sub(pattern, replacement, s)
678
+
679
+ s = self.standardize_numbers(s)
680
+ s = self.standardize_spellings(s)
681
+ s = self.standardize_names(s)
682
+ s = self.standardize_acronyms(s)
683
+
684
+ # now remove prefix/suffix symbols that are not preceded/followed by numbers
685
+ s = re.sub(r"[.$¢€£]([^0-9])", r" \1", s)
686
+ s = re.sub(r"([^0-9])%", r"\1 ", s)
687
+
688
+ s = re.sub(r"\s+", " ", s) # replace any successive whitespace characters with a space
689
+
690
+ return s
transformers/requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ torch
2
+ transformers
3
+ evaluate
4
+ datasets
5
+ librosa
6
+ jiwer
7
+ num2words
8
+ peft
transformers/run_eval.py ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import re
4
+ import torch
5
+ from torch.nn.attention import sdpa_kernel, SDPBackend
6
+ from transformers import AutoConfig, AutoModelForSpeechSeq2Seq, AutoModelForMultimodalLM, AutoModelForCTC, AutoProcessor, MODEL_FOR_MULTIMODAL_LM_MAPPING, MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, MODEL_FOR_CTC_MAPPING, CompileConfig
7
+ import evaluate
8
+ from normalizer import data_utils
9
+ from tqdm import tqdm
10
+ import random
11
+ import numpy as np
12
+
13
+ wer_metric = evaluate.load("wer")
14
+ torch.set_float32_matmul_precision('high')
15
+
16
+
17
+ def remove_brackets(text):
18
+ """
19
+ Remove parentheses from text, replacing them with spaces.
20
+
21
+ Some models (e.g. Cohere ASR) output parentheses that would cause the
22
+ normalizer to delete the enclosed text entirely, leading to false
23
+ deletion errors in the predictions.
24
+ """
25
+ text = text.replace("(", " ").replace(")", " ")
26
+ text = re.sub(r'\s+', ' ', text)
27
+ return text
28
+
29
+
30
+ def main(args):
31
+
32
+ # Set seed due to randomness in some models (e.g. VibeVoice's acoustic tokenizer sampling)
33
+ seed = 42
34
+ random.seed(seed)
35
+ np.random.seed(seed)
36
+ torch.manual_seed(seed)
37
+ torch.cuda.manual_seed_all(seed)
38
+ torch.backends.cudnn.deterministic = True
39
+
40
+ torch_dtype = getattr(torch, args.dtype)
41
+
42
+ config = AutoConfig.from_pretrained(args.model_id, revision=args.revision)
43
+ if type(config) in MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING:
44
+ cls_model = AutoModelForSpeechSeq2Seq
45
+ elif type(config) in MODEL_FOR_MULTIMODAL_LM_MAPPING:
46
+ cls_model = AutoModelForMultimodalLM
47
+ elif type(config) in MODEL_FOR_CTC_MAPPING:
48
+ cls_model = AutoModelForCTC
49
+ else:
50
+ raise ValueError(f"Model config of type {type(config)} not recognized in Transformers mappings.")
51
+ is_ctc = cls_model == AutoModelForCTC
52
+
53
+ if "vibevoice" in args.model_id.lower():
54
+ model = cls_model.from_pretrained(
55
+ args.model_id,
56
+ dtype=torch_dtype,
57
+ attn_implementation={
58
+ "acoustic_tokenizer_encoder_config": "eager",
59
+ "semantic_tokenizer_encoder_config": "eager",
60
+ "text_config": "sdpa",
61
+ }
62
+ )
63
+ else:
64
+ model = cls_model.from_pretrained(
65
+ args.model_id,
66
+ dtype=torch_dtype,
67
+ revision=args.revision,
68
+ attn_implementation=args.attn_implementation,
69
+ )
70
+ model.to(args.device)
71
+ model.eval()
72
+ print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e9:.2f}B parameters")
73
+ processor = AutoProcessor.from_pretrained(args.model_id, revision=args.revision)
74
+ has_transcription_processor = hasattr(processor, "apply_transcription_request")
75
+ is_cohere = "cohere" in args.model_id.lower() and "transcribe" in args.model_id.lower()
76
+
77
+ # Optional prompt for audio language models, newer models should use `apply_transcription_request`
78
+ text = None
79
+ if "granite-speech-3.3" in args.model_id.lower():
80
+ # create text prompt
81
+ chat = [
82
+ {
83
+ "role": "system",
84
+ "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
85
+ },
86
+ {
87
+ "role": "user",
88
+ "content": "<|audio|>can you transcribe the speech into a written format?",
89
+ }
90
+ ]
91
+
92
+ text = processor.apply_chat_template(
93
+ chat, tokenize=False, add_generation_prompt=True
94
+ )
95
+
96
+ # Extract sampling rate
97
+ if hasattr(processor, "feature_extractor") and processor.feature_extractor is not None:
98
+ sampling_rate = processor.feature_extractor.sampling_rate
99
+ elif hasattr(processor, "audio_processor") and processor.audio_processor is not None:
100
+ sampling_rate = processor.audio_processor.sampling_rate
101
+ else:
102
+ sampling_rate = 16_000
103
+
104
+ # Set generate arguments
105
+ if model.can_generate():
106
+ gen_kwargs = {"max_new_tokens": args.max_new_tokens}
107
+ if getattr(model.generation_config, "is_multilingual", False):
108
+ gen_kwargs["language"] = "en"
109
+ gen_kwargs["task"] = "transcribe"
110
+ # Clear deprecated Whisper generation config fields to suppress warnings
111
+ if hasattr(model.generation_config, "forced_decoder_ids"):
112
+ model.generation_config.forced_decoder_ids = None
113
+ if hasattr(model.generation_config, "suppress_tokens"):
114
+ model.generation_config.suppress_tokens = []
115
+ if hasattr(model.generation_config, "begin_suppress_tokens"):
116
+ model.generation_config.begin_suppress_tokens = []
117
+ if "granite-speech-3.3" in args.model_id.lower():
118
+ gen_kwargs["repetition_penalty"] = 1.0
119
+ elif args.max_new_tokens:
120
+ raise ValueError("`max_new_tokens` should only be set for auto-regressive models, but got a CTC model.")
121
+
122
+ if args.torch_compile is not None:
123
+ if model.can_generate():
124
+ gen_kwargs["compile_config"] = CompileConfig(mode=args.torch_compile, fullgraph=args.compile_fullgraph)
125
+ # enable static k/v cache for autoregressive models
126
+ model.generation_config.cache_implementation = "static"
127
+ else:
128
+ model = torch.compile(model, mode=args.torch_compile, fullgraph=args.compile_fullgraph)
129
+
130
+ # Ensure warm-up runs when using torch.compile
131
+ if args.warmup_steps is None or args.warmup_steps < 1:
132
+ print("`--torch_compile` is enabled; forcing `--warmup_steps=10` to trigger compilation before timed runs.")
133
+ args.warmup_steps = 10
134
+
135
+ def benchmark(batch, min_new_tokens=None):
136
+ # Load audio inputs
137
+ audios = [audio["array"] for audio in batch["audio"]]
138
+ minibatch_size = len(audios)
139
+ sampling_rate = batch["audio"][0]["sampling_rate"]
140
+ batch["audio_length_s"] = [len(audio) / sampling_rate for audio in audios]
141
+ batch["audio_filepath"] = data_utils.extract_audio_filepaths_from_batch(batch, minibatch_size)
142
+ if text is not None:
143
+ texts=[text] * minibatch_size
144
+ else:
145
+ texts = None
146
+
147
+ # START TIMING
148
+ torch.cuda.synchronize(device=args.device)
149
+ start_event = torch.cuda.Event(enable_timing=True)
150
+ end_event = torch.cuda.Event(enable_timing=True)
151
+ start_event.record()
152
+
153
+ # 1. Pre-Processing
154
+ # 1.1 Pad audios to max batch size if using torch compile to prevent re-compilations
155
+ padding_size = None
156
+ if minibatch_size != args.batch_size and args.torch_compile is not None:
157
+ padding_size = args.batch_size - minibatch_size
158
+ padding_audios = [audios[-1] for _ in range(padding_size)]
159
+ audios.extend(padding_audios)
160
+
161
+ if is_cohere:
162
+ inputs = processor(
163
+ audios,
164
+ sampling_rate=sampling_rate,
165
+ return_tensors="pt",
166
+ language="en",
167
+ punctuation=False,
168
+ )
169
+ elif has_transcription_processor:
170
+ if "voxtral" in args.model_id.lower():
171
+ inputs = processor.apply_transcription_request(
172
+ language="en", # English for benchmark consistency
173
+ audio=audios,
174
+ model_id=args.model_id,
175
+ sampling_rate=sampling_rate,
176
+ format=["wav"] * len(audios),
177
+ )
178
+ else:
179
+ inputs = processor.apply_transcription_request(audios)
180
+ prompt_len = inputs["input_ids"].shape[1]
181
+ elif texts is not None:
182
+ inputs = processor(
183
+ texts,
184
+ audios,
185
+ device=args.device, # Computation device; returned tensors are put on CPU
186
+ return_tensors="pt",
187
+ ).to(args.device)
188
+ prompt_len = inputs["input_ids"].shape[1]
189
+ elif not model.can_generate(): #or len(audios[0]) > processor.feature_extractor.n_samples:
190
+ # 1.2 Either CTC pre-processing (normalize to mean 0, std 1), or long-form Whisper processing
191
+ inputs = processor(
192
+ audios,
193
+ sampling_rate=sampling_rate,
194
+ truncation=False,
195
+ padding="longest",
196
+ return_tensors="pt",
197
+ return_attention_mask=True,
198
+ )
199
+ else:
200
+ # 1.3 Standard Whisper processing: pad audios to 30-seconds and converted to log-mel
201
+ if args.longform:
202
+ inputs = processor(
203
+ audios,
204
+ sampling_rate=sampling_rate,
205
+ return_tensors="pt",
206
+ truncation=False,
207
+ padding="longest",
208
+ return_attention_mask=True,
209
+ )
210
+ else:
211
+ inputs = processor(audios, sampling_rate=sampling_rate, return_tensors="pt", padding="longest", return_attention_mask=True, device=args.device)
212
+
213
+ inputs = inputs.to(args.device, dtype=torch_dtype)
214
+
215
+ # 2. Model Inference
216
+ if args.torch_compile is not None:
217
+ sdpa_backends = [SDPBackend.MATH]
218
+ else:
219
+ sdpa_backends = [SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
220
+ with sdpa_kernel(sdpa_backends):
221
+ if model.can_generate():
222
+ # 2.1 Auto-regressive generation for LM-based models
223
+ if args.longform:
224
+ pred_ids = model.generate(**inputs, **gen_kwargs, return_timestamps=True)
225
+ else:
226
+ pred_ids = model.generate(**inputs, **gen_kwargs, min_new_tokens=min_new_tokens)
227
+ else:
228
+ # 2.2. Single forward pass for CTC
229
+ with torch.no_grad():
230
+ logits = model(**inputs).logits
231
+ pred_ids = logits.argmax(-1)
232
+
233
+ # 3. Post-processing
234
+ # 3.1 Strip padded ids from predictions
235
+ if padding_size is not None:
236
+ pred_ids = pred_ids[:-padding_size, ...]
237
+
238
+ # 3.2 Convert token ids to text transcription
239
+ if is_cohere:
240
+ audio_chunk_index = inputs.get("audio_chunk_index")
241
+ pred_text = processor.decode(
242
+ pred_ids, skip_special_tokens=True,
243
+ audio_chunk_index=audio_chunk_index, language="en",
244
+ )
245
+ pred_text = [remove_brackets(t) for t in pred_text]
246
+ elif "vibevoice" in args.model_id.lower():
247
+ # VibeVoice: strip the input prompt tokens then use the model's own decode API
248
+ generated_ids = pred_ids[:, prompt_len:]
249
+ try:
250
+ pred_text = processor.decode(generated_ids, return_format="transcription_only")
251
+ except Exception as e:
252
+ print(f"Batch decoding failed with error: {e}. Falling back to individual sample decoding.")
253
+ pred_text = []
254
+ for i, sample_ids in enumerate(generated_ids):
255
+ try:
256
+ decoded = processor.decode(sample_ids.unsqueeze(0), return_format="transcription_only")
257
+ pred_text.append(decoded[0] if isinstance(decoded, list) else decoded)
258
+ except Exception as sample_error:
259
+ print(f"Sample {i} decoding failed with error: {sample_error}. Setting to empty transcript.")
260
+ pred_text.append("")
261
+ elif has_transcription_processor or texts is not None:
262
+ # Strip input prompt tokens
263
+ pred_text = processor.decode(pred_ids[:, prompt_len:], skip_special_tokens=True)
264
+ elif is_ctc:
265
+ # don't use skip_special_tokens as it collapses double letters
266
+ pred_text = processor.batch_decode(pred_ids)
267
+ else:
268
+ pred_text = processor.decode(pred_ids, skip_special_tokens=True)
269
+
270
+ # END TIMING
271
+ end_event.record()
272
+ torch.cuda.synchronize(device=args.device)
273
+ runtime = start_event.elapsed_time(end_event) / 1000.0
274
+
275
+ # normalize by minibatch size since we want the per-sample time
276
+ batch["transcription_time_s"] = minibatch_size * [runtime / minibatch_size]
277
+
278
+ # normalize transcriptions with English normalizer
279
+ batch["predictions"] = [data_utils.normalizer(pred) for pred in pred_text]
280
+ batch["references"] = batch["norm_text"]
281
+ return batch
282
+
283
+ if args.warmup_steps is not None:
284
+ dataset = data_utils.load_data(args)
285
+ dataset = data_utils.prepare_data(dataset, sampling_rate=sampling_rate)
286
+
287
+ num_warmup_samples = args.warmup_steps * args.batch_size
288
+ if args.streaming:
289
+ warmup_dataset = dataset.take(num_warmup_samples)
290
+ else:
291
+ warmup_dataset = dataset.select(range(min(num_warmup_samples, len(dataset))))
292
+ warmup_dataset = iter(warmup_dataset.map(benchmark, batch_size=args.batch_size, batched=True, fn_kwargs={"min_new_tokens": args.max_new_tokens}))
293
+
294
+ for _ in tqdm(warmup_dataset, desc="Warming up..."):
295
+ continue
296
+
297
+ dataset = data_utils.load_data(args)
298
+ if args.max_eval_samples is not None and args.max_eval_samples > 0:
299
+ print(f"Subsampling dataset to first {args.max_eval_samples} samples!")
300
+ if args.streaming:
301
+ dataset = dataset.take(args.max_eval_samples)
302
+ else:
303
+ dataset = dataset.select(range(min(args.max_eval_samples, len(dataset))))
304
+ dataset = data_utils.prepare_data(dataset, sampling_rate=sampling_rate)
305
+
306
+ dataset = dataset.map(
307
+ benchmark, batch_size=args.batch_size, batched=True, remove_columns=["audio"],
308
+ )
309
+
310
+ all_results = {
311
+ "audio_length_s": [],
312
+ "transcription_time_s": [],
313
+ "predictions": [],
314
+ "references": [],
315
+ "audio_filepath": [],
316
+ }
317
+ result_iter = iter(dataset)
318
+ for result in tqdm(result_iter, desc="Samples..."):
319
+ for key in all_results:
320
+ all_results[key].append(result[key])
321
+
322
+ # Write manifest results (WER and RTFX)
323
+ # Filtering of empty references is handled inside write_manifest.
324
+ manifest_path = data_utils.write_manifest(
325
+ all_results["references"],
326
+ all_results["predictions"],
327
+ args.model_id,
328
+ args.dataset_path,
329
+ args.dataset,
330
+ args.split,
331
+ audio_length=all_results["audio_length_s"],
332
+ transcription_time=all_results["transcription_time_s"],
333
+ audio_filepaths=all_results["audio_filepath"],
334
+ )
335
+ print("Results saved at path:", os.path.abspath(manifest_path))
336
+
337
+ wer = wer_metric.compute(
338
+ references=all_results["references"], predictions=all_results["predictions"]
339
+ )
340
+ wer = round(100 * wer, 2)
341
+ rtfx = round(sum(all_results["audio_length_s"]) / sum(all_results["transcription_time_s"]), 2)
342
+ print("WER:", wer, "%", "RTFx:", rtfx)
343
+
344
+
345
+ if __name__ == "__main__":
346
+ parser = argparse.ArgumentParser()
347
+
348
+ parser.add_argument(
349
+ "--model_id",
350
+ type=str,
351
+ required=True,
352
+ help="Model identifier. Should be loadable with 🤗 Transformers",
353
+ )
354
+ parser.add_argument(
355
+ "--dataset_path",
356
+ type=str,
357
+ default="hf-audio/open-asr-leaderboard",
358
+ help="Dataset path. By default, it is `hf-audio/open-asr-leaderboard`",
359
+ )
360
+ parser.add_argument(
361
+ "--dataset",
362
+ type=str,
363
+ required=True,
364
+ help="Dataset name. *E.g.* `'librispeech_asr` for the LibriSpeech ASR dataset, or `'common_voice'` for Common Voice. The full list of dataset names "
365
+ "can be found at `https://huggingface.co/datasets/hf-audio/open-asr-leaderboard`",
366
+ )
367
+ parser.add_argument(
368
+ "--split",
369
+ type=str,
370
+ default="test",
371
+ help="Split of the dataset. *E.g.* `'validation`' for the dev split, or `'test'` for the test split.",
372
+ )
373
+ parser.add_argument(
374
+ "--device",
375
+ type=int,
376
+ default=-1,
377
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
378
+ )
379
+ parser.add_argument(
380
+ "--batch_size",
381
+ type=int,
382
+ default=16,
383
+ help="Number of samples to go through each streamed batch.",
384
+ )
385
+ parser.add_argument(
386
+ "--max_eval_samples",
387
+ type=int,
388
+ default=None,
389
+ help="Number of samples to be evaluated. Put a lower number e.g. 64 for testing this script.",
390
+ )
391
+ parser.add_argument(
392
+ "--streaming",
393
+ action="store_true",
394
+ help="Stream the dataset lazily over the network instead of downloading it in full before the evaluation. Off by default for reproducible benchmark timings.",
395
+ )
396
+ parser.add_argument(
397
+ "--max_new_tokens",
398
+ type=int,
399
+ default=None,
400
+ help="Maximum number of tokens to generate (for auto-regressive models).",
401
+ )
402
+ parser.add_argument(
403
+ "--longform",
404
+ action="store_true",
405
+ help="Whether to use longform mode.",
406
+ )
407
+ parser.add_argument(
408
+ "--torch_compile",
409
+ type=str,
410
+ default=None,
411
+ help="Mode for torch compiling model forward pass. Can be either 'default', 'reduce-overhead', 'max-autotune' or 'max-autotune-no-cudagraphs'.",
412
+ )
413
+ parser.add_argument(
414
+ "--compile_fullgraph",
415
+ action="store_true",
416
+ help="Whether to do full graph compilation.",
417
+ )
418
+ parser.add_argument(
419
+ "--dtype",
420
+ type=str,
421
+ default="bfloat16",
422
+ help="The dtype to use for model loading and inference. E.g. 'bfloat16', 'float16', 'float32'.",
423
+ )
424
+ parser.add_argument(
425
+ "--attn_implementation",
426
+ type=str,
427
+ default="sdpa",
428
+ help="Attention implementation to use for model loading (e.g. 'sdpa', 'eager', 'flash_attention_2').",
429
+ )
430
+ parser.add_argument(
431
+ "--warmup_steps",
432
+ type=int,
433
+ default=10,
434
+ help="Number of warm-up steps to run before launching the timed runs.",
435
+ )
436
+ parser.add_argument(
437
+ "--revision",
438
+ type=str,
439
+ default=None,
440
+ help="Model revision to use (e.g. 'refs/pr/11' for a PR branch). Defaults to the main branch.",
441
+ )
442
+ args = parser.parse_args()
443
+
444
+ print("*" * 100)
445
+ print(f"Evaluating {args.model_id} on {args.dataset_path} / {args.dataset} / {args.split}")
446
+ print("*" * 100)
447
+
448
+ main(args)
transformers/run_eval_ml.py ADDED
@@ -0,0 +1,389 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import torch
4
+ from torch.nn.attention import sdpa_kernel, SDPBackend
5
+ from transformers import AutoConfig, AutoModelForSpeechSeq2Seq, AutoModelForMultimodalLM, AutoModelForCTC, AutoProcessor, MODEL_FOR_MULTIMODAL_LM_MAPPING, MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, MODEL_FOR_CTC_MAPPING, CompileConfig
6
+ import evaluate
7
+ from normalizer import data_utils
8
+ from normalizer.eval_utils import normalize_compound_pairs
9
+ from tqdm import tqdm
10
+ from datasets import load_dataset, Audio
11
+ import random
12
+ import numpy as np
13
+
14
+ wer_metric = evaluate.load("wer")
15
+ torch.set_float32_matmul_precision('high')
16
+
17
+
18
+ def main(args):
19
+
20
+ # Set seed for reproducibility
21
+ seed = 42
22
+ random.seed(seed)
23
+ np.random.seed(seed)
24
+ torch.manual_seed(seed)
25
+ torch.cuda.manual_seed_all(seed)
26
+ torch.backends.cudnn.deterministic = True
27
+
28
+ torch_dtype = getattr(torch, args.dtype)
29
+
30
+ config = AutoConfig.from_pretrained(args.model_id)
31
+ if type(config) in MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING:
32
+ cls_model = AutoModelForSpeechSeq2Seq
33
+ elif type(config) in MODEL_FOR_MULTIMODAL_LM_MAPPING:
34
+ cls_model = AutoModelForMultimodalLM
35
+ elif type(config) in MODEL_FOR_CTC_MAPPING:
36
+ cls_model = AutoModelForCTC
37
+ else:
38
+ raise ValueError(f"Model config of type {type(config)} not recognized in Transformers mappings.")
39
+ is_ctc = cls_model == AutoModelForCTC
40
+
41
+ model = cls_model.from_pretrained(
42
+ args.model_id,
43
+ dtype=torch_dtype,
44
+ attn_implementation=args.attn_implementation,
45
+ )
46
+ model.to(args.device)
47
+ model.eval()
48
+ print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e9:.2f}B parameters")
49
+ processor = AutoProcessor.from_pretrained(args.model_id)
50
+ has_transcription_processor = hasattr(processor, "apply_transcription_request")
51
+
52
+ # Extract sampling rate from processor
53
+ if hasattr(processor, "feature_extractor") and processor.feature_extractor is not None:
54
+ sampling_rate = processor.feature_extractor.sampling_rate
55
+ elif hasattr(processor, "audio_processor") and processor.audio_processor is not None:
56
+ sampling_rate = processor.audio_processor.sampling_rate
57
+ else:
58
+ sampling_rate = 16_000
59
+
60
+ # Set generate arguments (only for auto-regressive models)
61
+ if model.can_generate():
62
+ gen_kwargs = {}
63
+ if args.max_new_tokens is not None:
64
+ gen_kwargs["max_new_tokens"] = args.max_new_tokens
65
+
66
+ # For multilingual models, set task to transcribe and pass language (None = auto-detect)
67
+ if getattr(model.generation_config, "is_multilingual", False):
68
+ gen_kwargs["task"] = "transcribe"
69
+ if args.language is not None:
70
+ gen_kwargs["language"] = args.language
71
+ elif args.max_new_tokens:
72
+ raise ValueError("`max_new_tokens` should only be set for auto-regressive models, but got a CTC model.")
73
+
74
+ CONFIG_NAME = args.config_name
75
+ SPLIT_NAME = args.split
76
+
77
+ # Determine language for normalization: use --language if provided, otherwise extract from config_name
78
+ if args.language is not None:
79
+ norm_language = args.language
80
+ else:
81
+ try:
82
+ norm_language = CONFIG_NAME.split("_", 1)[1]
83
+ except IndexError:
84
+ norm_language = "en"
85
+ print(f"Language not specified, extracted '{norm_language}' from config_name '{CONFIG_NAME}'")
86
+
87
+ if args.torch_compile is not None:
88
+ if model.can_generate():
89
+ gen_kwargs["compile_config"] = CompileConfig(mode=args.torch_compile, fullgraph=args.compile_fullgraph)
90
+ model.generation_config.cache_implementation = "static"
91
+ else:
92
+ model = torch.compile(model, mode=args.torch_compile, fullgraph=args.compile_fullgraph)
93
+
94
+ # Ensure warm-up runs when using torch.compile
95
+ if args.warmup_steps is None or args.warmup_steps < 1:
96
+ print("`--torch_compile` is enabled; forcing `--warmup_steps=10` to trigger compilation before timed runs.")
97
+ args.warmup_steps = 10
98
+
99
+ # Load dataset
100
+ print(f"Loading dataset: {args.dataset} with config: {CONFIG_NAME}")
101
+ dataset = load_dataset(
102
+ args.dataset,
103
+ CONFIG_NAME,
104
+ split=SPLIT_NAME,
105
+ streaming=args.streaming,
106
+ token=True,
107
+ )
108
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
109
+
110
+ if args.max_eval_samples is not None and args.max_eval_samples > 0:
111
+ print(f"Subsampling dataset to first {args.max_eval_samples} samples!")
112
+ if args.streaming:
113
+ dataset = dataset.take(args.max_eval_samples)
114
+ else:
115
+ dataset = dataset.select(range(min(args.max_eval_samples, len(dataset))))
116
+
117
+ def benchmark(batch, min_new_tokens=None):
118
+ audios = [audio["array"] for audio in batch["audio"]]
119
+ minibatch_size = len(audios)
120
+ sampling_rate = batch["audio"][0]["sampling_rate"]
121
+ batch["audio_length_s"] = [len(audio) / sampling_rate for audio in audios]
122
+ batch["audio_filepath"] = data_utils.extract_audio_filepaths_from_batch(batch, minibatch_size)
123
+
124
+ # START TIMING
125
+ torch.cuda.synchronize(device=args.device)
126
+ start_event = torch.cuda.Event(enable_timing=True)
127
+ end_event = torch.cuda.Event(enable_timing=True)
128
+ start_event.record()
129
+
130
+ # 1. Pre-Processing
131
+ # Pad audios to max batch size if using torch compile to prevent re-compilations
132
+ padding_size = None
133
+ if minibatch_size != args.batch_size and args.torch_compile is not None:
134
+ padding_size = args.batch_size - minibatch_size
135
+ padding_audios = [audios[-1] for _ in range(padding_size)]
136
+ audios.extend(padding_audios)
137
+
138
+ if has_transcription_processor:
139
+ if "voxtral" in args.model_id.lower():
140
+ inputs = processor.apply_transcription_request(
141
+ language=args.language, # None = auto-detect
142
+ audio=audios,
143
+ model_id=args.model_id,
144
+ sampling_rate=sampling_rate,
145
+ format=["wav"] * len(audios),
146
+ )
147
+ else:
148
+ inputs = processor.apply_transcription_request(audios)
149
+ prompt_len = inputs["input_ids"].shape[1]
150
+ elif not model.can_generate():
151
+ # CTC pre-processing: normalize to mean 0, std 1
152
+ inputs = processor(
153
+ audios,
154
+ sampling_rate=sampling_rate,
155
+ truncation=False,
156
+ padding="longest",
157
+ return_tensors="pt",
158
+ return_attention_mask=True,
159
+ )
160
+ else:
161
+ # Standard Whisper processing: pad audios to 30-seconds and convert to log-mel
162
+ inputs = processor(
163
+ audios,
164
+ sampling_rate=sampling_rate,
165
+ return_tensors="pt",
166
+ padding="longest",
167
+ return_attention_mask=True,
168
+ device=args.device,
169
+ )
170
+
171
+ inputs = inputs.to(args.device, dtype=torch_dtype)
172
+
173
+ # 2. Model Inference
174
+ if args.torch_compile is not None:
175
+ sdpa_backends = [SDPBackend.MATH]
176
+ else:
177
+ sdpa_backends = [SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
178
+ with sdpa_kernel(sdpa_backends):
179
+ if model.can_generate():
180
+ pred_ids = model.generate(**inputs, **gen_kwargs, min_new_tokens=min_new_tokens)
181
+ else:
182
+ # Single forward pass for CTC
183
+ with torch.no_grad():
184
+ logits = model(**inputs).logits
185
+ pred_ids = logits.argmax(-1)
186
+
187
+ # 3. Post-processing
188
+ # Strip padded ids from predictions
189
+ if padding_size is not None:
190
+ pred_ids = pred_ids[:-padding_size, ...]
191
+
192
+ # Convert token ids to text transcription
193
+ if has_transcription_processor:
194
+ pred_text = processor.batch_decode(pred_ids[:, prompt_len:], skip_special_tokens=True)
195
+ elif is_ctc:
196
+ # don't use skip_special_tokens as it collapses double letters
197
+ pred_text = processor.batch_decode(pred_ids)
198
+ else:
199
+ pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)
200
+
201
+ # END TIMING
202
+ end_event.record()
203
+ torch.cuda.synchronize(device=args.device)
204
+ runtime = start_event.elapsed_time(end_event) / 1000.0
205
+
206
+ batch["transcription_time_s"] = minibatch_size * [runtime / minibatch_size]
207
+
208
+ # Normalize with multilingual normalizer
209
+ batch["predictions"] = [data_utils.ml_normalizer(pred, lang=norm_language) for pred in pred_text]
210
+ batch["references"] = [data_utils.ml_normalizer(ref, lang=norm_language) for ref in batch["text"]]
211
+
212
+ return batch
213
+
214
+ if args.warmup_steps is not None and args.warmup_steps > 0:
215
+ print(f"Running {args.warmup_steps} warmup steps...")
216
+ num_warmup_samples = args.warmup_steps * args.batch_size
217
+ if args.streaming:
218
+ warmup_dataset = dataset.take(num_warmup_samples)
219
+ else:
220
+ warmup_dataset = dataset.select(range(min(num_warmup_samples, len(dataset))))
221
+ warmup_dataset = iter(warmup_dataset.map(
222
+ benchmark, batch_size=args.batch_size, batched=True,
223
+ fn_kwargs={"min_new_tokens": args.max_new_tokens}
224
+ ))
225
+ for _ in tqdm(warmup_dataset, desc="Warming up..."):
226
+ continue
227
+
228
+ # Reload dataset for actual evaluation (reset streaming pointer)
229
+ dataset = load_dataset(
230
+ args.dataset,
231
+ CONFIG_NAME,
232
+ split=SPLIT_NAME,
233
+ streaming=args.streaming,
234
+ token=True,
235
+ )
236
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
237
+
238
+ if args.max_eval_samples is not None and args.max_eval_samples > 0:
239
+ if args.streaming:
240
+ dataset = dataset.take(args.max_eval_samples)
241
+ else:
242
+ dataset = dataset.select(range(min(args.max_eval_samples, len(dataset))))
243
+
244
+ dataset = dataset.map(
245
+ benchmark, batch_size=args.batch_size, batched=True, remove_columns=["audio"],
246
+ )
247
+
248
+ all_results = {
249
+ "audio_length_s": [],
250
+ "transcription_time_s": [],
251
+ "predictions": [],
252
+ "references": [],
253
+ "audio_filepath": [],
254
+ }
255
+
256
+ result_iter = iter(dataset)
257
+ for result in tqdm(result_iter, desc="Samples..."):
258
+ for key in all_results:
259
+ all_results[key].append(result[key])
260
+
261
+ # Filter empty references (consistent with English pipeline)
262
+ filtered = [
263
+ (ref, pred, dur, time_s, fpath)
264
+ for ref, pred, dur, time_s, fpath in zip(
265
+ all_results["references"], all_results["predictions"],
266
+ all_results["audio_length_s"], all_results["transcription_time_s"],
267
+ all_results["audio_filepath"]
268
+ )
269
+ if data_utils.is_target_text_in_range(ref)
270
+ ]
271
+ if filtered:
272
+ all_results["references"], all_results["predictions"], all_results["audio_length_s"], all_results["transcription_time_s"], all_results["audio_filepath"] = zip(*filtered)
273
+ all_results = {k: list(v) for k, v in all_results.items()}
274
+
275
+ # Write manifest results (WER and RTFX)
276
+ manifest_path = data_utils.write_manifest(
277
+ all_results["references"],
278
+ all_results["predictions"],
279
+ args.model_id,
280
+ args.dataset,
281
+ CONFIG_NAME,
282
+ args.split,
283
+ audio_length=all_results["audio_length_s"],
284
+ transcription_time=all_results["transcription_time_s"],
285
+ audio_filepaths=all_results["audio_filepath"],
286
+ )
287
+ print("Results saved at path:", os.path.abspath(manifest_path))
288
+
289
+ wer_refs, wer_preds = normalize_compound_pairs(all_results["references"], all_results["predictions"])
290
+ wer = wer_metric.compute(references=wer_refs, predictions=wer_preds)
291
+ wer = round(100 * wer, 2)
292
+ rtfx = round(sum(all_results["audio_length_s"]) / sum(all_results["transcription_time_s"]), 2)
293
+ print("WER:", wer, "%", "RTFx:", rtfx)
294
+
295
+
296
+ if __name__ == "__main__":
297
+ parser = argparse.ArgumentParser()
298
+
299
+ parser.add_argument(
300
+ "--model_id",
301
+ type=str,
302
+ required=True,
303
+ help="Model identifier. Should be loadable with Transformers",
304
+ )
305
+ parser.add_argument(
306
+ "--dataset",
307
+ type=str,
308
+ required=True,
309
+ help="Dataset name. E.g. 'nithinraok/asr-leaderboard-datasets'",
310
+ )
311
+ parser.add_argument(
312
+ "--config_name",
313
+ type=str,
314
+ required=True,
315
+ help="Config name for the dataset. E.g. 'fleurs_de' for German FLEURS.",
316
+ )
317
+ parser.add_argument(
318
+ "--language",
319
+ type=str,
320
+ default=None,
321
+ help="Language code, e.g. 'de' for German. If not set, the model will auto-detect the language.",
322
+ )
323
+ parser.add_argument(
324
+ "--split",
325
+ type=str,
326
+ default="test",
327
+ help="Split of the dataset.",
328
+ )
329
+ parser.add_argument(
330
+ "--device",
331
+ type=int,
332
+ default=-1,
333
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
334
+ )
335
+ parser.add_argument(
336
+ "--batch_size",
337
+ type=int,
338
+ default=64,
339
+ help="Number of samples to go through each batch.",
340
+ )
341
+ parser.add_argument(
342
+ "--max_eval_samples",
343
+ type=int,
344
+ default=None,
345
+ help="Number of samples to be evaluated. Put a lower number e.g. 64 for testing this script.",
346
+ )
347
+ parser.add_argument(
348
+ "--streaming",
349
+ action="store_true",
350
+ help="Stream the dataset lazily over the network instead of downloading it in full before the evaluation. Off by default for reproducible benchmark timings.",
351
+ )
352
+ parser.add_argument(
353
+ "--max_new_tokens",
354
+ type=int,
355
+ default=None,
356
+ help="Maximum number of tokens to generate.",
357
+ )
358
+ parser.add_argument(
359
+ "--torch_compile",
360
+ type=str,
361
+ default=None,
362
+ help="Mode for torch compiling model forward pass. Can be either 'default', 'reduce-overhead', 'max-autotune' or 'max-autotune-no-cudagraphs'.",
363
+ )
364
+ parser.add_argument(
365
+ "--compile_fullgraph",
366
+ action="store_true",
367
+ help="Whether to do full graph compilation.",
368
+ )
369
+ parser.add_argument(
370
+ "--dtype",
371
+ type=str,
372
+ default="bfloat16",
373
+ help="The dtype to use for model loading and inference. E.g. 'bfloat16', 'float16', 'float32'.",
374
+ )
375
+ parser.add_argument(
376
+ "--attn_implementation",
377
+ type=str,
378
+ default="sdpa",
379
+ help="Attention implementation to use for model loading (e.g. 'sdpa', 'eager', 'flash_attention_2').",
380
+ )
381
+ parser.add_argument(
382
+ "--warmup_steps",
383
+ type=int,
384
+ default=10,
385
+ help="Number of warm-up steps to run before launching the timed runs.",
386
+ )
387
+ args = parser.parse_args()
388
+
389
+ main(args)