--- license: mit language: - en metrics: - perplexity base_model: - google-bert/bert-base-uncased pipeline_tag: fill-mask library_name: transformers tags: - bert - masked-language-modeling - mlm - fill-mask - transformers - finance - central-bank - financial-nlp - economic-policy - monetary-policy - BIS - speeches - BIS-Speeches - pretraining - domain-adaptation - financial-domain-adaptation --- # Central Bank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication **Central Bank-BERT** is a **domain-adapted masked language model** based on `bert-base-uncased`, pretrained on more than **66 million tokens** across **over 2 million sentences** extracted from **central bank speeches published by the Bank for International Settlements (1996–2024)**. This model is specifically optimized for **masked token prediction** within the highly specialized domains of **monetary policy**, **financial regulation**, and **macroeconomic communication**, enabling deeper contextual understanding of central banking discourse and financial narratives. ## Dataset Summary - **Source**: BIS Central Bank Speeches (1996–2024) - **Total Speeches**: 19,609 - **MLM Sentences**: 2,087,615 (_~2.09M_) - **Total Tokens**: 66,359,113 (_~66.36M_) - **Avg. Tokens per Sentence**: 31.79 ## Model & Training Details | **Category** | **Details** | |------------------------|-----------------------------------------------------------------------------| | **Tokenizer** | `BertTokenizerFast` (base: `bert-base-uncased`)
Vocab Size: **30,522**
Max Seq Length: **128** | | **Model** | `BertForMaskedLM` (initialized from `bert-base-uncased`)
Total Params: **109,514,298 (~109.5M)**
Trainable Params: **109,514,298** | | **Training Setup** | Epochs: **1**
Batch Size (per device): **16**
Gradient Accumulation: **2**
Effective Batch Size: **32**
MLM Probability: **15%** | | **Hardware** | Device: **NVIDIA Tesla P100 (Kaggle)**
Mixed Precision: **fp16** | | **Training Duration** | **~8 hrs 18 mins**
Start: *2025-07-19 17:17*
End: *2025-07-20 01:35* | | **Evaluation Results** | **Perplexity**
bert-base: *13.06*
[`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT): **4.66** | > Lower perplexity demonstrates better fit on domain-specific central bank language. **Notebook: Training, Evaluation & Results** The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook [`cb-bert-mlm.ipynb`](./cb-bert-mlm.ipynb). This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process. **Model Files** - `model.safetensors`: Trained model weights - `config.json`: Model architecture and hyperparameters - `tokenizer.json`: Serialized tokenizer - `vocab.txt`: Vocabulary file - `tokenizer_config.json`: Tokenizer configuration - `special_tokens_map.json`: Special tokens mapping - `training_args.bin`: Training arguments used during pretraining This model repository includes all essential files required to load, evaluate, or fine-tune the [`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT) model using Hugging Face's `transformers` library. These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning. --- ## Downstream Models In addition to the domain-adapted masked language model (**[`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT)**), a suite of fine-tuned downstream classifiers has been released to support CBDC-specific research and policy analysis. These models share the same encoder backbone and are designed for different classification and information extraction tasks on central bank communication. | **Model** | **Purpose** | **Intended Use** | **Link** | | ------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------- | | **bilalzafar/CBDC-BERT** | Binary classifier: CBDC vs. Non-CBDC. | Flagging CBDC-related discourse in large corpora. | [CBDC-BERT](https://huggingface.co/bilalzafar/CBDC-BERT) | | **bilalzafar/CBDC-Stance** | 3-class stance model (Pro, Wait-and-See, Anti). | Research on policy stances and discourse monitoring. | [CBDC-Stance](https://huggingface.co/bilalzafar/CBDC-Stance) | | **bilalzafar/CBDC-Sentiment** | 3-class sentiment model (Positive, Neutral, Negative). | Tone analysis in central bank communications. | [CBDC-Sentiment](https://huggingface.co/bilalzafar/CBDC-Sentiment) | | **bilalzafar/CBDC-Type** | Classifies Retail, Wholesale, General CBDC mentions. | Distinguishing policy focus (retail vs wholesale). | [CBDC-Type](https://huggingface.co/bilalzafar/CBDC-Type) | | **bilalzafar/CBDC-Discourse** | 3-class discourse classifier (Feature, Process, Risk-Benefit). | Structured categorization of CBDC communications. | [CBDC-Discourse](https://huggingface.co/bilalzafar/CBDC-Discourse) | | **bilalzafar/CentralBank-NER** | Named Entity Recognition (NER) model for central banking discourse. | Identifying institutions, persons, and policy entities in speeches. | [CentralBank-NER](https://huggingface.co/bilalzafar/CentralBank-NER) | --- ## Repository and Replication Package All **training pipelines, preprocessing scripts, evaluation notebooks, and result outputs** are available in the companion GitHub repository: 🔗 [https://github.com/bilalezafar/CentralBank-BERT](https://github.com/bilalezafar/CentralBank-BERT) The repository includes: * End-to-end notebooks for **CentralBank-BERT pretraining** and all **downstream classifiers** (CBDC-BERT, Stance, Sentiment, Type, Discourse, NER). * Preprocessed **BIS speech dataset subsets** (CBDC-related sentences, annotated splits). * **Reproducible code** to generate figures, tables, and evaluation metrics reported in the manuscript. * Deployment-ready scripts for applying models to new corpora. This ensures full **transparency, reproducibility, and extension** of the CentralBank-BERT family of models. --- ## Citation If you use this model, please cite as: **Zafar, M. B. (2025). CentralBank-BERT: Machine learning evidence on central bank digital currency discourse. *Journal of Economics and Business.* [https://doi.org/10.1016/j.jeconbus.2026.106300](https://doi.org/10.1016/j.jeconbus.2026.106300)** ```bibtex @article{zafar2025centralbankbert, title={CentralBank-BERT: Machine learning evidence on central bank digital currency discourse}, author={Zafar, Muhammad Bilal}, year={2026}, journal={Journal of Economics and Business}, url={https://doi.org/10.1016/j.jeconbus.2026.106300} }