Token Classification
Transformers
Safetensors
English
bert
finance
terminology
term-extraction
english
ner
Instructions to use owen4512/bert-base-cased-finance-term-extractor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use owen4512/bert-base-cased-finance-term-extractor with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="owen4512/bert-base-cased-finance-term-extractor")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("owen4512/bert-base-cased-finance-term-extractor") model = AutoModelForTokenClassification.from_pretrained("owen4512/bert-base-cased-finance-term-extractor") - Notebooks
- Google Colab
- Kaggle
File size: 3,722 Bytes
4f006bd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
language:
- en
license: other
license_name: cc-by-nc-4.0-derived
base_model: google-bert/bert-base-cased
library_name: transformers
pipeline_tag: token-classification
tags:
- finance
- terminology
- term-extraction
- token-classification
- bert
- english
- ner
datasets:
- wmt-2025-terminology
---
# BERT Finance Term Extractor (English)
A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text.
---
## π§ Model Description
This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**.
It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines.
---
## ποΈ Training Pipeline
The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets.
### Data Processing
- Input format: **CoNLL-style token-tag sequences**
- Sentences are split by blank lines
- Labels are converted into integer IDs (`label2id`, `id2label`)
- Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`)
### Tokenization & Label Alignment
- Tokenizer: `BertTokenizerFast`
- Tokenization uses `is_split_into_words=True`
- Word-piece alignment handled via `word_ids()`
- Special tokens assigned label `-100` (ignored in loss)
---
## βοΈ Training Details
- Base model: `google-bert/bert-base-cased`
- Task: Token Classification (NER-style)
- Framework: Hugging Face `Trainer`
### Training Arguments
- learning_rate: 2e-5
- batch_size: 16
- num_train_epochs: 5
- max_seq_length: 256
- weight_decay: 0.01
### Training Strategy
- Evaluation: **per epoch**
- Checkpoint saving: **per epoch**
- Best model selection:
- metric: F1 score
- `load_best_model_at_end=True`
- Logging:
- TensorBoard enabled
- logging every 10 steps
### Hardware Optimization
- Optional **fp16 mixed precision**
- Multi-worker dataloading
---
## π Evaluation
Evaluation is performed using the `seqeval` library.
Metrics:
- F1 score (primary metric)
- Full classification report (printed during training)
Example:
```text
precision recall f1-score support
...
π― Intended Use
This model is suitable for:
Financial terminology extraction
Terminology preprocessing for translation systems
Supporting CAT tools
Domain-specific NLP pipelines
π« Out-of-Scope Use
This model is not intended for:
General-purpose NER tasks
Legal or compliance decision-making
Fully automated terminology validation without human review
π Usage
from transformers import pipeline
pipe = pipeline(
"token-classification",
model="owen4512/bert-base-cased-finance-term-extractor",
aggregation_strategy="simple"
)
text = "The firm increased exposure to derivatives and sovereign bonds."
print(pipe(text))
π§Ύ Example
Input:
"The company issued convertible bonds and derivatives."
Output:
["convertible bonds", "derivatives"]
β οΈ Limitations
Domain-specific: performance outside finance may degrade
Rare or unseen terms may not be recognized
Tokenization may split multi-word terms
Human validation is recommended
π License
This model is derived from data released under CC BY-NC 4.0.
β
Non-commercial use allowed
β Commercial use prohibited without permission
β
Attribution required
The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data.
π Acknowledgements
Base model: google-bert/bert-base-cased
Dataset: WMT 2025 terminology resources
Framework: Hugging Face Transformers & Datasets
Metrics: seqeval |