File size: 3,722 Bytes
4f006bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
language:
- en
license: other
license_name: cc-by-nc-4.0-derived
base_model: google-bert/bert-base-cased
library_name: transformers
pipeline_tag: token-classification
tags:
- finance
- terminology
- term-extraction
- token-classification
- bert
- english
- ner
datasets:
- wmt-2025-terminology
---

# BERT Finance Term Extractor (English)

A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.

---

## 🧠 Model Description

This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.

It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.

---

## πŸ—οΈ Training Pipeline

The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.

### Data Processing

- Input format: **CoNLL-style token-tag sequences**
- Sentences are split by blank lines
- Labels are converted into integer IDs (`label2id`, `id2label`)
- Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)

### Tokenization & Label Alignment

- Tokenizer: `BertTokenizerFast`
- Tokenization uses `is_split_into_words=True`
- Word-piece alignment handled via `word_ids()`
- Special tokens assigned label `-100` (ignored in loss)

---

## βš™οΈ Training Details

- Base model: `google-bert/bert-base-cased`
- Task: Token Classification (NER-style)
- Framework: Hugging Face `Trainer`

### Training Arguments

- learning_rate: 2e-5  
- batch_size: 16  
- num_train_epochs: 5  
- max_seq_length: 256  
- weight_decay: 0.01  

### Training Strategy

- Evaluation: **per epoch**
- Checkpoint saving: **per epoch**
- Best model selection:
  - metric: F1 score
  - `load_best_model_at_end=True`
- Logging:
  - TensorBoard enabled
  - logging every 10 steps

### Hardware Optimization

- Optional **fp16 mixed precision**
- Multi-worker dataloading

---

## πŸ“Š Evaluation

Evaluation is performed using the `seqeval` library.

Metrics:

- F1 score (primary metric)
- Full classification report (printed during training)

Example:

```text
precision    recall  f1-score   support
...

🎯 Intended Use

This model is suitable for:

Financial terminology extraction
Terminology preprocessing for translation systems
Supporting CAT tools
Domain-specific NLP pipelines
🚫 Out-of-Scope Use

This model is not intended for:

General-purpose NER tasks
Legal or compliance decision-making
Fully automated terminology validation without human review
πŸš€ Usage
from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="owen4512/bert-base-cased-finance-term-extractor",
    aggregation_strategy="simple"
)

text = "The firm increased exposure to derivatives and sovereign bonds."
print(pipe(text))
🧾 Example

Input:
"The company issued convertible bonds and derivatives."

Output:
["convertible bonds", "derivatives"]

⚠️ Limitations
Domain-specific: performance outside finance may degrade
Rare or unseen terms may not be recognized
Tokenization may split multi-word terms
Human validation is recommended
πŸ“œ License

This model is derived from data released under CC BY-NC 4.0.

βœ… Non-commercial use allowed
❌ Commercial use prohibited without permission
βœ… Attribution required

The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.

πŸ™ Acknowledgements
Base model: google-bert/bert-base-cased
Dataset: WMT 2025 terminology resources
Framework: Hugging Face Transformers & Datasets
Metrics: seqeval