owen4512
/

bert-base-chinese-finance-term-extractor

@@ -1,154 +1,166 @@
----
-language:
-- en
-license: other
-license_name: cc-by-nc-4.0-derived
-base_model: google-bert/bert-base-cased
-library_name: transformers
-pipeline_tag: token-classification
-tags:
-- finance
-- terminology
-- term-extraction
-- token-classification
-- bert
-- english
-- ner
-datasets:
-- wmt-2025-terminology
----
-# BERT Finance Term Extractor (English)
-A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.
----
-## 🧠 Model Description
-This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.
-It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.
----
-## 🏗️ Training Pipeline
-The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.
-### Data Processing
-- Input format: **CoNLL-style token-tag sequences**
-- Sentences are split by blank lines
-- Labels are converted into integer IDs (`label2id`, `id2label`)
-- Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)
-### Tokenization & Label Alignment
-- Tokenizer: `BertTokenizerFast`
-- Tokenization uses `is_split_into_words=True`
-- Word-piece alignment handled via `word_ids()`
-- Special tokens assigned label `-100` (ignored in loss)
----
-## ⚙️ Training Details
-- Base model: `google-bert/bert-base-cased`
-- Task: Token Classification (NER-style)
-- Framework: Hugging Face `Trainer`
-### Training Arguments
-- learning_rate: 2e-5
-- batch_size: 16
-- num_train_epochs: 5
-- max_seq_length: 256
-- weight_decay: 0.01
-### Training Strategy
-- Evaluation: **per epoch**
-- Checkpoint saving: **per epoch**
-- Best model selection:
-  - metric: F1 score
-  - `load_best_model_at_end=True`
-- Logging:
-  - TensorBoard enabled
-  - logging every 10 steps
-### Hardware Optimization
-- Optional **fp16 mixed precision**
-- Multi-worker dataloading
----
-## 📊 Evaluation
-Evaluation is performed using the `seqeval` library.
-Metrics:
-- F1 score (primary metric)
-- Full classification report (printed during training)
-Example:
-```text
-precision    recall  f1-score   support
-...
-🎯 Intended Use
-This model is suitable for:
-Financial terminology extraction
-Terminology preprocessing for translation systems
-Supporting CAT tools
-Domain-specific NLP pipelines
-🚫 Out-of-Scope Use
-This model is not intended for:
-General-purpose NER tasks
-Legal or compliance decision-making
-Fully automated terminology validation without human review
-🚀 Usage
-from transformers import pipeline
-pipe = pipeline(
-    "token-classification",
-    model="owen4512/bert-base-cased-finance-term-extractor",
-    aggregation_strategy="simple"
-)
-text = "The firm increased exposure to derivatives and sovereign bonds."
-print(pipe(text))
-🧾 Example
-Input:
-"The company issued convertible bonds and derivatives."
-Output:
-["convertible bonds", "derivatives"]
-⚠️ Limitations
-Domain-specific: performance outside finance may degrade
-Rare or unseen terms may not be recognized
-Tokenization may split multi-word terms
-Human validation is recommended
-📜 License
-This model is derived from data released under CC BY-NC 4.0.
-✅ Non-commercial use allowed
-❌ Commercial use prohibited without permission
-✅ Attribution required
-The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.
-🙏 Acknowledgements
-Base model: google-bert/bert-base-cased
-Dataset: WMT 2025 terminology resources
-Framework: Hugging Face Transformers & Datasets
-Metrics: seqeval

+---
+language:
+- zh
+license: other
+license_name: cc-by-nc-4.0-derived
+base_model: bert-base-chinese
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- chinese
+- finance
+- terminology
+- term-extraction
+- token-classification
+- bert
+- ner
+datasets:
+- wmt-2025-terminology
+---
+# 中文金融术语抽取模型 (BERT)
+基于 BERT 的中文金融术语抽取模型，用于从中文文本中识别领域相关术语。
+---
+## 🧠 模型简介
+该模型基于 `bert-base-chinese` 微调，执行 **token-level 分类（NER风格）**，用于识别文本中的金融术语。
+模型适用于翻译辅助、术语提取、金融文本分析等场景。
+---
+## 🏗️ 训练流程
+模型使用 Hugging Face Transformers + Datasets 构建完整训练管线。
+### 数据处理
+- 输入格式：**CoNLL 格式（token + label）**
+- 每个句子以空行分隔
+- 自动构建：
+  - `label2id`
+  - `id2label`
+- 自动划分训练/验证集：
+  - `dev_ratio = 0.1`
+---
+## 🔤 分词与标签对齐
+- 使用：`BertTokenizerFast`
+- 设置：
+  - `is_split_into_words=True`
+- 使用 `word_ids()` 对齐 token 与标签
+- 特殊 token（CLS/SEP/PAD）标记为 `-100`（忽略 loss）
+---
+## ⚙️ 训练配置
+- Base model: `bert-base-chinese`
+- 任务：Token Classification（NER）
+- 框架：Hugging Face `Trainer`
+### 超参数
+- learning_rate: 2e-5
+- batch_size: 16
+- num_train_epochs: 5
+- max_seq_length: 256
+- weight_decay: 0.01
+---
+## 🧪 训练策略
+- 评估策略：每个 epoch
+- 保存策略：每个 epoch
+- 最优模型选择：
+  - 指标：F1
+  - `load_best_model_at_end=True`
+### 日志
+- TensorBoard logging
+- 每 50 step 记录一次
+---
+## ⚡ 硬件优化
+- 支持 fp16（自动检测 GPU）
+- 提升训练效率
+---
+## 📊 评估方法
+使用 `seqeval` 进行序列标注评估：
+- F1 score（主要指标）
+- classification report（训练时打印）
+示例输出：
+```text
+precision    recall  f1-score   support
+...
+🎯 适用场景
+该模型适用于：
+中文金融术语抽取
+翻译流程中的术语识别
+CAT 工具辅助
+金融领域 NLP 任务
+🚫 不适用场景
+不建议用于：
+通用 NER 任务
+医疗/法律等高风险领域
+无人工审核的自动决策
+🚀 使用方法
+from transformers import pipeline
+pipe = pipeline(
+    "token-classification",
+    model="你的用户名/bert-base-chinese-finance-term-extractor",
+    aggregation_strategy="simple"
+)
+text = "公司发行了可转换债券和金融衍生品。"
+print(pipe(text))
+🧾 示例
+输入：
+"公司发行了可转换债券和金融衍生品。"
+输出：
+["可转换债券", "金融衍生品"]
+⚠️ 局限性
+模型针对金融领域，跨领域泛化能力有限
+对未见术语识别能力有限
+分词可能影响长术语识别
+建议人工校验
+📜 许可证
+该模型基于 CC BY-NC 4.0 数据训练：
+✅ 允许非商业使用
+❌ 禁止商业用途（除非获得授权）
+✅ 需要署名
+基础模型 bert-base-chinese 为 Apache 2.0，但微调模型受数据集限制。
+🙏 致谢
+Base model: bert-base-chinese
+Dataset: WMT 2025 术语资源
+Framework: Hugging Face Transformers & Datasets
+Evaluation: seqeval