--- language: - en license: other license_name: cc-by-nc-4.0-derived base_model: google-bert/bert-base-cased library_name: transformers pipeline_tag: token-classification tags: - finance - terminology - term-extraction - token-classification - bert - english - ner datasets: - wmt-2025-terminology --- # BERT Finance Term Extractor (English) A BERT-based token classification model fine-tuned for extracting finance-related terminology from English text. --- ## ๐Ÿง  Model Description This model is fine-tuned from `google-bert/bert-base-cased` for **domain-specific terminology extraction**. It performs token-level classification (NER-style) to identify financial terms in text. The model is particularly designed for applications in translation workflows, terminology mining, and domain-specific NLP pipelines. --- ## ๐Ÿ—๏ธ Training Pipeline The model is trained using a custom pipeline built on Hugging Face Transformers and Datasets. ### Data Processing - Input format: **CoNLL-style token-tag sequences** - Sentences are split by blank lines - Labels are converted into integer IDs (`label2id`, `id2label`) - Automatic **train/dev split** using configurable ratio (`dev_ratio=0.1`) ### Tokenization & Label Alignment - Tokenizer: `BertTokenizerFast` - Tokenization uses `is_split_into_words=True` - Word-piece alignment handled via `word_ids()` - Special tokens assigned label `-100` (ignored in loss) --- ## โš™๏ธ Training Details - Base model: `google-bert/bert-base-cased` - Task: Token Classification (NER-style) - Framework: Hugging Face `Trainer` ### Training Arguments - learning_rate: 2e-5 - batch_size: 16 - num_train_epochs: 5 - max_seq_length: 256 - weight_decay: 0.01 ### Training Strategy - Evaluation: **per epoch** - Checkpoint saving: **per epoch** - Best model selection: - metric: F1 score - `load_best_model_at_end=True` - Logging: - TensorBoard enabled - logging every 10 steps ### Hardware Optimization - Optional **fp16 mixed precision** - Multi-worker dataloading --- ## ๐Ÿ“Š Evaluation Evaluation is performed using the `seqeval` library. Metrics: - F1 score (primary metric) - Full classification report (printed during training) Example: ```text precision recall f1-score support ... ๐ŸŽฏ Intended Use This model is suitable for: Financial terminology extraction Terminology preprocessing for translation systems Supporting CAT tools Domain-specific NLP pipelines ๐Ÿšซ Out-of-Scope Use This model is not intended for: General-purpose NER tasks Legal or compliance decision-making Fully automated terminology validation without human review ๐Ÿš€ Usage from transformers import pipeline pipe = pipeline( "token-classification", model="owen4512/bert-base-cased-finance-term-extractor", aggregation_strategy="simple" ) text = "The firm increased exposure to derivatives and sovereign bonds." print(pipe(text)) ๐Ÿงพ Example Input: "The company issued convertible bonds and derivatives." Output: ["convertible bonds", "derivatives"] โš ๏ธ Limitations Domain-specific: performance outside finance may degrade Rare or unseen terms may not be recognized Tokenization may split multi-word terms Human validation is recommended ๐Ÿ“œ License This model is derived from data released under CC BY-NC 4.0. โœ… Non-commercial use allowed โŒ Commercial use prohibited without permission โœ… Attribution required The base model google-bert/bert-base-cased is licensed under Apache 2.0, but this fine-tuned model inherits restrictions from the training data. ๐Ÿ™ Acknowledgements Base model: google-bert/bert-base-cased Dataset: WMT 2025 terminology resources Framework: Hugging Face Transformers & Datasets Metrics: seqeval