---
license: mit
language:
- en
metrics:
- perplexity
base_model:
- google-bert/bert-base-uncased
pipeline_tag: fill-mask
library_name: transformers
tags:
- bert
- masked-language-modeling
- mlm
- fill-mask
- transformers
- finance
- central-bank
- financial-nlp
- economic-policy
- monetary-policy
- BIS
- speeches
- BIS-Speeches
- pretraining
- domain-adaptation
- financial-domain-adaptation
---
# Central Bank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication
**Central Bank-BERT** is a **domain-adapted masked language model** based on `bert-base-uncased`, pretrained on more than **66 million tokens** across **over 2 million sentences** extracted from **central bank speeches published by the Bank for International Settlements (1996–2024)**.
This model is specifically optimized for **masked token prediction** within the highly specialized domains of **monetary policy**, **financial regulation**, and **macroeconomic communication**, enabling deeper contextual understanding of central banking discourse and financial narratives.
## Dataset Summary
- **Source**: BIS Central Bank Speeches (1996–2024)
- **Total Speeches**: 19,609
- **MLM Sentences**: 2,087,615 (_~2.09M_)
- **Total Tokens**: 66,359,113 (_~66.36M_)
- **Avg. Tokens per Sentence**: 31.79
## Model & Training Details
| **Category** | **Details** |
|------------------------|-----------------------------------------------------------------------------|
| **Tokenizer** | `BertTokenizerFast` (base: `bert-base-uncased`)
Vocab Size: **30,522**
Max Seq Length: **128** |
| **Model** | `BertForMaskedLM` (initialized from `bert-base-uncased`)
Total Params: **109,514,298 (~109.5M)**
Trainable Params: **109,514,298** |
| **Training Setup** | Epochs: **1**
Batch Size (per device): **16**
Gradient Accumulation: **2**
Effective Batch Size: **32**
MLM Probability: **15%** |
| **Hardware** | Device: **NVIDIA Tesla P100 (Kaggle)**
Mixed Precision: **fp16** |
| **Training Duration** | **~8 hrs 18 mins**
Start: *2025-07-19 17:17*
End: *2025-07-20 01:35* |
| **Evaluation Results** | **Perplexity**
bert-base: *13.06*
[`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT): **4.66** |
> Lower perplexity demonstrates better fit on domain-specific central bank language.
**Notebook: Training, Evaluation & Results**
The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook [`cb-bert-mlm.ipynb`](./cb-bert-mlm.ipynb). This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process.
**Model Files**
- `model.safetensors`: Trained model weights
- `config.json`: Model architecture and hyperparameters
- `tokenizer.json`: Serialized tokenizer
- `vocab.txt`: Vocabulary file
- `tokenizer_config.json`: Tokenizer configuration
- `special_tokens_map.json`: Special tokens mapping
- `training_args.bin`: Training arguments used during pretraining
This model repository includes all essential files required to load, evaluate, or fine-tune the [`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT) model using Hugging Face's `transformers` library.
These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning.
---
## Downstream Models
In addition to the domain-adapted masked language model (**[`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT)**), a suite of fine-tuned downstream classifiers has been released to support CBDC-specific research and policy analysis. These models share the same encoder backbone and are designed for different classification and information extraction tasks on central bank communication.
| **Model** | **Purpose** | **Intended Use** | **Link** |
| ------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **bilalzafar/CBDC-BERT** | Binary classifier: CBDC vs. Non-CBDC. | Flagging CBDC-related discourse in large corpora. | [CBDC-BERT](https://huggingface.co/bilalzafar/CBDC-BERT) |
| **bilalzafar/CBDC-Stance** | 3-class stance model (Pro, Wait-and-See, Anti). | Research on policy stances and discourse monitoring. | [CBDC-Stance](https://huggingface.co/bilalzafar/CBDC-Stance) |
| **bilalzafar/CBDC-Sentiment** | 3-class sentiment model (Positive, Neutral, Negative). | Tone analysis in central bank communications. | [CBDC-Sentiment](https://huggingface.co/bilalzafar/CBDC-Sentiment) |
| **bilalzafar/CBDC-Type** | Classifies Retail, Wholesale, General CBDC mentions. | Distinguishing policy focus (retail vs wholesale). | [CBDC-Type](https://huggingface.co/bilalzafar/CBDC-Type) |
| **bilalzafar/CBDC-Discourse** | 3-class discourse classifier (Feature, Process, Risk-Benefit). | Structured categorization of CBDC communications. | [CBDC-Discourse](https://huggingface.co/bilalzafar/CBDC-Discourse) |
| **bilalzafar/CentralBank-NER** | Named Entity Recognition (NER) model for central banking discourse. | Identifying institutions, persons, and policy entities in speeches. | [CentralBank-NER](https://huggingface.co/bilalzafar/CentralBank-NER) |
---
## Repository and Replication Package
All **training pipelines, preprocessing scripts, evaluation notebooks, and result outputs** are available in the companion GitHub repository:
🔗 [https://github.com/bilalezafar/CentralBank-BERT](https://github.com/bilalezafar/CentralBank-BERT)
The repository includes:
* End-to-end notebooks for **CentralBank-BERT pretraining** and all **downstream classifiers** (CBDC-BERT, Stance, Sentiment, Type, Discourse, NER).
* Preprocessed **BIS speech dataset subsets** (CBDC-related sentences, annotated splits).
* **Reproducible code** to generate figures, tables, and evaluation metrics reported in the manuscript.
* Deployment-ready scripts for applying models to new corpora.
This ensures full **transparency, reproducibility, and extension** of the CentralBank-BERT family of models.
---
## Citation
If you use this model, please cite as:
**Zafar, M. B. (2025). CentralBank-BERT: Machine learning evidence on central bank digital currency discourse. *Journal of Economics and Business.* [https://doi.org/10.1016/j.jeconbus.2026.106300](https://doi.org/10.1016/j.jeconbus.2026.106300)**
```bibtex
@article{zafar2025centralbankbert,
title={CentralBank-BERT: Machine learning evidence on central bank digital currency discourse},
author={Zafar, Muhammad Bilal},
year={2026},
journal={Journal of Economics and Business},
url={https://doi.org/10.1016/j.jeconbus.2026.106300}
}