---
license: apache-2.0
language:
- en
tags:
- clinical-nlp
- antimicrobial-stewardship
- bert
- multilabel-classification
- hospital
- medical
pipeline_tag: text-classification
library_name: pytorch
base_model: emilyalsentzer/Bio_ClinicalBERT
---

# NCAS Hospital Indication Classifier

A **BioClinicalBERT**-based multilabel classifier for categorising antimicrobial prescription
indication text from hospital electronic medical records (EMR).  Developed as part of a research
project at RMIT University / The Royal Melbourne Hospital (RMH) investigating automated
antimicrobial stewardship support.

## Model description

| Attribute | Value |
|-----------|-------|
| Base encoder | [emilyalsentzer/Bio_ClinicalBERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) |
| Pooling | Mean pooling over token embeddings |
| Classification head | Linear + Sigmoid |
| Task | Multilabel classification (8 categories) |
| Training data | ~2,000 manually annotated hospital prescription records (RMH 2021) |
| Held-out evaluation | 600 records from RMH 2022, 2023, 2024 |

## Label schema (8catb)

| Label | Description |
|-------|-------------|
| `respiratory - ioi` | Respiratory infection of indication |
| `skin and soft tissue - ioi` | Skin/soft-tissue infection of indication |
| `urinary tract - ioi` | Urinary tract infection of indication |
| `other` | Other or unspecified indication |
| `sepsis` | Sepsis or bacteraemia |
| `undifferentiated infection` | Infection without identified source |
| `organism only` | Organism identified but no clinical syndrome specified |
| `no indication documented` | No clinical indication present in the text |

A sample can receive one or more labels simultaneously (multilabel).

## Post-processing rule

After model prediction, `sepsis` is suppressed from any sample that also receives
`respiratory - ioi` OR `skin and soft tissue - ioi`.  If suppression would leave zero
labels, the removal is reverted (fallback guarantee).

## Usage

### Quick start

```python
from huggingface_hub import hf_hub_download
from ncas_indication.model import ClinicalBERTClassifier
from transformers import AutoTokenizer

# Download checkpoint
model_path = hf_hub_download(
    repo_id="jibmaird/NCAS-hospital-indication-classifier",
    filename="indication_classifier_model.pt",
)

# Load model (label names and thresholds are embedded in the checkpoint)
model, label_columns, thresholds = ClinicalBERTClassifier.from_checkpoint(model_path)
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
```

Or using the inference script from the [GitHub repository](https://github.com/jibmaird/NCAS-hospital-indication-classifier):

```bash
# Single text
python inference/predict.py --text "UTI prophylaxis post-renal transplant"

# CSV file
python inference/predict.py --input your_file.csv --output predictions.csv
```

### Desktop application

A cross-platform desktop GUI is available in the `app/` folder of the repository.
See [app/README.md](https://github.com/jibmaird/NCAS-hospital-indication-classifier/blob/main/app/README.md).

## Training

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning rate | 1e-5 |
| Batch size | 8 |
| Epochs | 20 |
| Optimizer | AdamW |
| Loss function | Weighted BCE (inverse-frequency weights) |
| Validation split | 20% of training data |
| Threshold selection | Per-label F1 maximisation on validation set |

### Training procedure

1. The combined dataset of ~2,000 labelled records was split 80/20 for training and validation.
2. Inverse-frequency class weights were applied to the BCE loss to address label imbalance.
3. Per-label decision thresholds were optimised on the validation set by grid search over
   [0.1, 0.2, …, 0.8] to maximise label-specific F1.
4. The model with the best weighted-macro F1 across epochs was retained.

## Checkpoint format

The `.pt` file is a standard PyTorch checkpoint dict with keys:

```python
{
    "model_state_dict":   ...,   # nn.Module weights
    "label_columns":      [...], # ordered label names
    "optimal_thresholds": [...], # per-label decision thresholds
    "n_labels":           8,
    "base_model":         "emilyalsentzer/Bio_ClinicalBERT",
}
```

## Limitations and intended use

- The model was trained and evaluated on de-identified records from a single Australian
  tertiary hospital (RMH).  Performance may differ on records from other hospitals,
  health systems, or clinical workflows.
- This model is intended for **research purposes** and is not a validated clinical decision
  support tool.  Clinical decisions must remain with qualified healthcare professionals.
- The training data cannot be shared due to privacy restrictions; the annotation schema
  and data format are documented in the companion GitHub repository.

## Citation

If you use this model in your research, please cite:

```bibtex
@article{ncas_indication_classifier_2025,
  title   = {Automated Classification of Antimicrobial Prescription Indications
             Using BioClinicalBERT},
  author  = {...},
  journal = {...},
  year    = {2025},
  note    = {Under review}
}
```

## Repository

Source code, training scripts, and the desktop application are available at:  
[https://github.com/jibmaird/NCAS-hospital-indication-classifier](https://github.com/jibmaird/NCAS-hospital-indication-classifier)

## License

Apache 2.0 — see [LICENSE](https://github.com/jibmaird/NCAS-hospital-indication-classifier/blob/main/LICENSE).