owen4512 commited on
Commit
f77b195
·
verified ·
1 Parent(s): b034f80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -154
README.md CHANGED
@@ -1,154 +1,166 @@
1
- ---
2
- language:
3
- - en
4
- license: other
5
- license_name: cc-by-nc-4.0-derived
6
- base_model: google-bert/bert-base-cased
7
- library_name: transformers
8
- pipeline_tag: token-classification
9
- tags:
10
- - finance
11
- - terminology
12
- - term-extraction
13
- - token-classification
14
- - bert
15
- - english
16
- - ner
17
- datasets:
18
- - wmt-2025-terminology
19
- ---
20
-
21
- # BERT Finance Term Extractor (English)
22
-
23
- A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.
24
-
25
- ---
26
-
27
- ## 🧠 Model Description
28
-
29
- This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.
30
-
31
- It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.
32
-
33
- ---
34
-
35
- ## 🏗️ Training Pipeline
36
-
37
- The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.
38
-
39
- ### Data Processing
40
-
41
- - Input format: **CoNLL-style token-tag sequences**
42
- - Sentences are split by blank lines
43
- - Labels are converted into integer IDs (`label2id`, `id2label`)
44
- - Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)
45
-
46
- ### Tokenization & Label Alignment
47
-
48
- - Tokenizer: `BertTokenizerFast`
49
- - Tokenization uses `is_split_into_words=True`
50
- - Word-piece alignment handled via `word_ids()`
51
- - Special tokens assigned label `-100` (ignored in loss)
52
-
53
- ---
54
-
55
- ## ⚙️ Training Details
56
-
57
- - Base model: `google-bert/bert-base-cased`
58
- - Task: Token Classification (NER-style)
59
- - Framework: Hugging Face `Trainer`
60
-
61
- ### Training Arguments
62
-
63
- - learning_rate: 2e-5
64
- - batch_size: 16
65
- - num_train_epochs: 5
66
- - max_seq_length: 256
67
- - weight_decay: 0.01
68
-
69
- ### Training Strategy
70
-
71
- - Evaluation: **per epoch**
72
- - Checkpoint saving: **per epoch**
73
- - Best model selection:
74
- - metric: F1 score
75
- - `load_best_model_at_end=True`
76
- - Logging:
77
- - TensorBoard enabled
78
- - logging every 10 steps
79
-
80
- ### Hardware Optimization
81
-
82
- - Optional **fp16 mixed precision**
83
- - Multi-worker dataloading
84
-
85
- ---
86
-
87
- ## 📊 Evaluation
88
-
89
- Evaluation is performed using the `seqeval` library.
90
-
91
- Metrics:
92
-
93
- - F1 score (primary metric)
94
- - Full classification report (printed during training)
95
-
96
- Example:
97
-
98
- ```text
99
- precision recall f1-score support
100
- ...
101
- 🎯 Intended Use
102
-
103
- This model is suitable for:
104
-
105
- Financial terminology extraction
106
- Terminology preprocessing for translation systems
107
- Supporting CAT tools
108
- Domain-specific NLP pipelines
109
- 🚫 Out-of-Scope Use
110
-
111
- This model is not intended for:
112
-
113
- General-purpose NER tasks
114
- Legal or compliance decision-making
115
- Fully automated terminology validation without human review
116
- 🚀 Usage
117
- from transformers import pipeline
118
-
119
- pipe = pipeline(
120
- "token-classification",
121
- model="owen4512/bert-base-cased-finance-term-extractor",
122
- aggregation_strategy="simple"
123
- )
124
-
125
- text = "The firm increased exposure to derivatives and sovereign bonds."
126
- print(pipe(text))
127
- 🧾 Example
128
-
129
- Input:
130
- "The company issued convertible bonds and derivatives."
131
-
132
- Output:
133
- ["convertible bonds", "derivatives"]
134
-
135
- ⚠️ Limitations
136
- Domain-specific: performance outside finance may degrade
137
- Rare or unseen terms may not be recognized
138
- Tokenization may split multi-word terms
139
- Human validation is recommended
140
- 📜 License
141
-
142
- This model is derived from data released under CC BY-NC 4.0.
143
-
144
- ✅ Non-commercial use allowed
145
- Commercial use prohibited without permission
146
- ✅ Attribution required
147
-
148
- The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.
149
-
150
- 🙏 Acknowledgements
151
- Base model: google-bert/bert-base-cased
152
- Dataset: WMT 2025 terminology resources
153
- Framework: Hugging Face Transformers & Datasets
154
- Metrics: seqeval
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ license: other
5
+ license_name: cc-by-nc-4.0-derived
6
+ base_model: bert-base-chinese
7
+ library_name: transformers
8
+ pipeline_tag: token-classification
9
+ tags:
10
+ - chinese
11
+ - finance
12
+ - terminology
13
+ - term-extraction
14
+ - token-classification
15
+ - bert
16
+ - ner
17
+ datasets:
18
+ - wmt-2025-terminology
19
+ ---
20
+
21
+ # 中文金融术语抽取模型 (BERT)
22
+
23
+ 基于 BERT 的中文金融术语抽取模型,用于从中文文本中识别领域相关术语。
24
+
25
+ ---
26
+
27
+ ## 🧠 模型简介
28
+
29
+ 该模型基于 `bert-base-chinese` 微调,执行 **token-level 分类(NER风格)**,用于识别文本中的金融术语。
30
+
31
+ 模型适用于翻译辅助、术语提取、金融文本分析等场景。
32
+
33
+ ---
34
+
35
+ ## 🏗️ 训练流程
36
+
37
+ 模型使用 Hugging Face Transformers + Datasets 构建完整训练管线。
38
+
39
+ ### 数据处理
40
+
41
+ - 输入格式:**CoNLL 格式(token + label)**
42
+ - 每个句子以空行分隔
43
+ - 自动构建:
44
+ - `label2id`
45
+ - `id2label`
46
+ - 自动划分训练/验证集:
47
+ - `dev_ratio = 0.1`
48
+
49
+ ---
50
+
51
+ ## 🔤 分词与标签对齐
52
+
53
+ - 使用:`BertTokenizerFast`
54
+ - 设置:
55
+ - `is_split_into_words=True`
56
+ - 使用 `word_ids()` 对齐 token 与标签
57
+ - 特殊 token(CLS/SEP/PAD)标记为 `-100`(忽略 loss)
58
+
59
+ ---
60
+
61
+ ## ⚙️ 训练配置
62
+
63
+ - Base model: `bert-base-chinese`
64
+ - 任务:Token Classification(NER)
65
+ - 框架:Hugging Face `Trainer`
66
+
67
+ ### 超参数
68
+
69
+ - learning_rate: 2e-5
70
+ - batch_size: 16
71
+ - num_train_epochs: 5
72
+ - max_seq_length: 256
73
+ - weight_decay: 0.01
74
+
75
+ ---
76
+
77
+ ## 🧪 训练策略
78
+
79
+ - 评估策略:每个 epoch
80
+ - 保存策略:每个 epoch
81
+ - 最优模型选择:
82
+ - 指标:F1
83
+ - `load_best_model_at_end=True`
84
+
85
+ ### 日志
86
+
87
+ - TensorBoard logging
88
+ - 每 50 step 记录一次
89
+
90
+ ---
91
+
92
+ ## ⚡ 硬件优化
93
+
94
+ - 支持 fp16(自动检测 GPU)
95
+ - 提升训练效率
96
+
97
+ ---
98
+
99
+ ## 📊 评估方法
100
+
101
+ 使用 `seqeval` 进行序列标注评估:
102
+
103
+ - F1 score(主要指标)
104
+ - classification report(训练时打印)
105
+
106
+ 示例输出:
107
+
108
+ ```text
109
+ precision recall f1-score support
110
+ ...
111
+ 🎯 适用场景
112
+
113
+ 该模型适用于:
114
+
115
+ 中文金融术语抽取
116
+ 翻译流程中的术语识别
117
+ CAT 工具辅助
118
+ 金融领域 NLP 任务
119
+ 🚫 不适用场景
120
+
121
+ 不建议用于:
122
+
123
+ 通用 NER 任务
124
+ 医疗/法律等高风险领域
125
+ 无人工审核的自动决策
126
+ 🚀 使用方法
127
+ from transformers import pipeline
128
+
129
+ pipe = pipeline(
130
+ "token-classification",
131
+ model="你的用户名/bert-base-chinese-finance-term-extractor",
132
+ aggregation_strategy="simple"
133
+ )
134
+
135
+ text = "公司发行了可转换债券和金融衍生品。"
136
+ print(pipe(text))
137
+ 🧾 示例
138
+
139
+ 输入:
140
+
141
+ "公司发行了可转换债券和金融衍生品。"
142
+
143
+ 输出:
144
+
145
+ ["可转换债券", "金融衍生品"]
146
+
147
+ ⚠️ 局限性
148
+ 模型针对金融领域,跨领域泛化能力有限
149
+ 对未见术语识别能力有限
150
+ 分词可能影响长术语识别
151
+ 建议人工校验
152
+ 📜 许可证
153
+
154
+ 该模型基于 CC BY-NC 4.0 数据训练:
155
+
156
+ ✅ 允许非商业使用
157
+ ❌ 禁止商业用途(除非获得授权)
158
+ ✅ 需要署名
159
+
160
+ 基础模型 bert-base-chinese 为 Apache 2.0,但微调模型受数据集限制。
161
+
162
+ 🙏 致谢
163
+ Base model: bert-base-chinese
164
+ Dataset: WMT 2025 术语资源
165
+ Framework: Hugging Face Transformers & Datasets
166
+ Evaluation: seqeval