owen4512
/

bert-base-cased-finance-term-extractor

@@ -1,9 +1,155 @@
----
-language:
-- en
-license: other
-license_name: cc-by-nc-4.0-derived
-base_model: google-bert/bert-base-cased
-library_name: transformers
-pipeline_tag: token-classification
----

+---
+language:
+- en
+license: other
+license_name: cc-by-nc-4.0-derived
+base_model: google-bert/bert-base-cased
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- finance
+- terminology
+- term-extraction
+- token-classification
+- bert
+- english
+- ner
+datasets:
+- wmt-2025-terminology
+---
+# BERT Finance Term Extractor (English)
+A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.
+---
+## 🧠 Model Description
+This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.
+It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.
+---
+## 🏗️ Training Pipeline
+The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.
+### Data Processing
+- Input format: **CoNLL-style token-tag sequences**
+- Sentences are split by blank lines
+- Labels are converted into integer IDs (`label2id`, `id2label`)
+- Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)
+### Tokenization & Label Alignment
+- Tokenizer: `BertTokenizerFast`
+- Tokenization uses `is_split_into_words=True`
+- Word-piece alignment handled via `word_ids()`
+- Special tokens assigned label `-100` (ignored in loss)
+---
+## ⚙️ Training Details
+- Base model: `google-bert/bert-base-cased`
+- Task: Token Classification (NER-style)
+- Framework: Hugging Face `Trainer`
+### Training Arguments
+- learning_rate: 2e-5
+- batch_size: 16
+- num_train_epochs: 5
+- max_seq_length: 256
+- weight_decay: 0.01
+### Training Strategy
+- Evaluation: **per epoch**
+- Checkpoint saving: **per epoch**
+- Best model selection:
+  - metric: F1 score
+  - `load_best_model_at_end=True`
+- Logging:
+  - TensorBoard enabled
+  - logging every 10 steps
+### Hardware Optimization
+- Optional **fp16 mixed precision**
+- Multi-worker dataloading
+---
+## 📊 Evaluation
+Evaluation is performed using the `seqeval` library.
+Metrics:
+- F1 score (primary metric)
+- Full classification report (printed during training)
+Example:
+```text
+precision    recall  f1-score   support
+...
+🎯 Intended Use
+This model is suitable for:
+Financial terminology extraction
+Terminology preprocessing for translation systems
+Supporting CAT tools
+Domain-specific NLP pipelines
+🚫 Out-of-Scope Use
+This model is not intended for:
+General-purpose NER tasks
+Legal or compliance decision-making
+Fully automated terminology validation without human review
+🚀 Usage
+from transformers import pipeline
+pipe = pipeline(
+    "token-classification",
+    model="owen4512/bert-base-cased-finance-term-extractor",
+    aggregation_strategy="simple"
+)
+text = "The firm increased exposure to derivatives and sovereign bonds."
+print(pipe(text))
+🧾 Example
+Input:
+"The company issued convertible bonds and derivatives."
+Output:
+["convertible bonds", "derivatives"]
+⚠️ Limitations
+Domain-specific: performance outside finance may degrade
+Rare or unseen terms may not be recognized
+Tokenization may split multi-word terms
+Human validation is recommended
+📜 License
+This model is derived from data released under CC BY-NC 4.0.
+✅ Non-commercial use allowed
+❌ Commercial use prohibited without permission
+✅ Attribution required
+The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.
+🙏 Acknowledgements
+Base model: google-bert/bert-base-cased
+Dataset: WMT 2025 terminology resources
+Framework: Hugging Face Transformers & Datasets
+Metrics: seqeval