---
language:
- en
license: other
license_name: cc-by-nc-4.0-derived
base_model: google-bert/bert-base-cased
library_name: transformers
pipeline_tag: token-classification
tags:
- finance
- terminology
- term-extraction
- token-classification
- bert
- english
- ner
datasets:
- wmt-2025-terminology
---

# BERT Finance Term Extractor (English)

A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.

---

## 🧠 Model Description

This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.

It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.

---

## 🏗️ Training Pipeline

The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.

### Data Processing

- Input format: **CoNLL-style token-tag sequences**
- Sentences are split by blank lines
- Labels are converted into integer IDs (`label2id`, `id2label`)
- Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)

### Tokenization & Label Alignment

- Tokenizer: `BertTokenizerFast`
- Tokenization uses `is_split_into_words=True`
- Word-piece alignment handled via `word_ids()`
- Special tokens assigned label `-100` (ignored in loss)

---

## ⚙️ Training Details

- Base model: `google-bert/bert-base-cased`
- Task: Token Classification (NER-style)
- Framework: Hugging Face `Trainer`

### Training Arguments

- learning_rate: 2e-5  
- batch_size: 16  
- num_train_epochs: 5  
- max_seq_length: 256  
- weight_decay: 0.01  

### Training Strategy

- Evaluation: **per epoch**
- Checkpoint saving: **per epoch**
- Best model selection:
  - metric: F1 score
  - `load_best_model_at_end=True`
- Logging:
  - TensorBoard enabled
  - logging every 10 steps

### Hardware Optimization

- Optional **fp16 mixed precision**
- Multi-worker dataloading

---

## 📊 Evaluation

Evaluation is performed using the `seqeval` library.

Metrics:

- F1 score (primary metric)
- Full classification report (printed during training)

Example:

```text
precision    recall  f1-score   support
...

🎯 Intended Use

This model is suitable for:

Financial terminology extraction
Terminology preprocessing for translation systems
Supporting CAT tools
Domain-specific NLP pipelines
🚫 Out-of-Scope Use

This model is not intended for:

General-purpose NER tasks
Legal or compliance decision-making
Fully automated terminology validation without human review
🚀 Usage
from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="owen4512/bert-base-cased-finance-term-extractor",
    aggregation_strategy="simple"
)

text = "The firm increased exposure to derivatives and sovereign bonds."
print(pipe(text))
🧾 Example

Input:
"The company issued convertible bonds and derivatives."

Output:
["convertible bonds", "derivatives"]

⚠️ Limitations
Domain-specific: performance outside finance may degrade
Rare or unseen terms may not be recognized
Tokenization may split multi-word terms
Human validation is recommended
📜 License

This model is derived from data released under CC BY-NC 4.0.

✅ Non-commercial use allowed
❌ Commercial use prohibited without permission
✅ Attribution required

The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.

🙏 Acknowledgements
Base model: google-bert/bert-base-cased
Dataset: WMT 2025 terminology resources
Framework: Hugging Face Transformers & Datasets
Metrics: seqeval