owen4512 commited on
Commit
4f006bd
Β·
verified Β·
1 Parent(s): c3358bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -9
README.md CHANGED
@@ -1,9 +1,155 @@
1
- ---
2
- language:
3
- - en
4
- license: other
5
- license_name: cc-by-nc-4.0-derived
6
- base_model: google-bert/bert-base-cased
7
- library_name: transformers
8
- pipeline_tag: token-classification
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: other
5
+ license_name: cc-by-nc-4.0-derived
6
+ base_model: google-bert/bert-base-cased
7
+ library_name: transformers
8
+ pipeline_tag: token-classification
9
+ tags:
10
+ - finance
11
+ - terminology
12
+ - term-extraction
13
+ - token-classification
14
+ - bert
15
+ - english
16
+ - ner
17
+ datasets:
18
+ - wmt-2025-terminology
19
+ ---
20
+
21
+ # BERT Finance Term Extractor (English)
22
+
23
+ A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.
24
+
25
+ ---
26
+
27
+ ## 🧠 Model Description
28
+
29
+ This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.
30
+
31
+ It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.
32
+
33
+ ---
34
+
35
+ ## πŸ—οΈ Training Pipeline
36
+
37
+ The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.
38
+
39
+ ### Data Processing
40
+
41
+ - Input format: **CoNLL-style token-tag sequences**
42
+ - Sentences are split by blank lines
43
+ - Labels are converted into integer IDs (`label2id`, `id2label`)
44
+ - Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)
45
+
46
+ ### Tokenization & Label Alignment
47
+
48
+ - Tokenizer: `BertTokenizerFast`
49
+ - Tokenization uses `is_split_into_words=True`
50
+ - Word-piece alignment handled via `word_ids()`
51
+ - Special tokens assigned label `-100` (ignored in loss)
52
+
53
+ ---
54
+
55
+ ## βš™οΈ Training Details
56
+
57
+ - Base model: `google-bert/bert-base-cased`
58
+ - Task: Token Classification (NER-style)
59
+ - Framework: Hugging Face `Trainer`
60
+
61
+ ### Training Arguments
62
+
63
+ - learning_rate: 2e-5
64
+ - batch_size: 16
65
+ - num_train_epochs: 5
66
+ - max_seq_length: 256
67
+ - weight_decay: 0.01
68
+
69
+ ### Training Strategy
70
+
71
+ - Evaluation: **per epoch**
72
+ - Checkpoint saving: **per epoch**
73
+ - Best model selection:
74
+ - metric: F1 score
75
+ - `load_best_model_at_end=True`
76
+ - Logging:
77
+ - TensorBoard enabled
78
+ - logging every 10 steps
79
+
80
+ ### Hardware Optimization
81
+
82
+ - Optional **fp16 mixed precision**
83
+ - Multi-worker dataloading
84
+
85
+ ---
86
+
87
+ ## πŸ“Š Evaluation
88
+
89
+ Evaluation is performed using the `seqeval` library.
90
+
91
+ Metrics:
92
+
93
+ - F1 score (primary metric)
94
+ - Full classification report (printed during training)
95
+
96
+ Example:
97
+
98
+ ```text
99
+ precision recall f1-score support
100
+ ...
101
+
102
+ 🎯 Intended Use
103
+
104
+ This model is suitable for:
105
+
106
+ Financial terminology extraction
107
+ Terminology preprocessing for translation systems
108
+ Supporting CAT tools
109
+ Domain-specific NLP pipelines
110
+ 🚫 Out-of-Scope Use
111
+
112
+ This model is not intended for:
113
+
114
+ General-purpose NER tasks
115
+ Legal or compliance decision-making
116
+ Fully automated terminology validation without human review
117
+ πŸš€ Usage
118
+ from transformers import pipeline
119
+
120
+ pipe = pipeline(
121
+ "token-classification",
122
+ model="owen4512/bert-base-cased-finance-term-extractor",
123
+ aggregation_strategy="simple"
124
+ )
125
+
126
+ text = "The firm increased exposure to derivatives and sovereign bonds."
127
+ print(pipe(text))
128
+ 🧾 Example
129
+
130
+ Input:
131
+ "The company issued convertible bonds and derivatives."
132
+
133
+ Output:
134
+ ["convertible bonds", "derivatives"]
135
+
136
+ ⚠️ Limitations
137
+ Domain-specific: performance outside finance may degrade
138
+ Rare or unseen terms may not be recognized
139
+ Tokenization may split multi-word terms
140
+ Human validation is recommended
141
+ πŸ“œ License
142
+
143
+ This model is derived from data released under CC BY-NC 4.0.
144
+
145
+ βœ… Non-commercial use allowed
146
+ ❌ Commercial use prohibited without permission
147
+ βœ… Attribution required
148
+
149
+ The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.
150
+
151
+ πŸ™ Acknowledgements
152
+ Base model: google-bert/bert-base-cased
153
+ Dataset: WMT 2025 terminology resources
154
+ Framework: Hugging Face Transformers & Datasets
155
+ Metrics: seqeval