--- license: apache-2.0 language: - en tags: - clinical-nlp - antimicrobial-stewardship - bert - multilabel-classification - hospital - medical pipeline_tag: text-classification library_name: pytorch base_model: emilyalsentzer/Bio_ClinicalBERT --- # NCAS Hospital Indication Classifier A **BioClinicalBERT**-based multilabel classifier for categorising antimicrobial prescription indication text from hospital electronic medical records (EMR). Developed as part of a research project at RMIT University / The Royal Melbourne Hospital (RMH) investigating automated antimicrobial stewardship support. ## Model description | Attribute | Value | |-----------|-------| | Base encoder | [emilyalsentzer/Bio_ClinicalBERT](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) | | Pooling | Mean pooling over token embeddings | | Classification head | Linear + Sigmoid | | Task | Multilabel classification (8 categories) | | Training data | ~2,000 manually annotated hospital prescription records (RMH 2021) | | Held-out evaluation | 600 records from RMH 2022, 2023, 2024 | ## Label schema (8catb) | Label | Description | |-------|-------------| | `respiratory - ioi` | Respiratory infection of indication | | `skin and soft tissue - ioi` | Skin/soft-tissue infection of indication | | `urinary tract - ioi` | Urinary tract infection of indication | | `other` | Other or unspecified indication | | `sepsis` | Sepsis or bacteraemia | | `undifferentiated infection` | Infection without identified source | | `organism only` | Organism identified but no clinical syndrome specified | | `no indication documented` | No clinical indication present in the text | A sample can receive one or more labels simultaneously (multilabel). ## Post-processing rule After model prediction, `sepsis` is suppressed from any sample that also receives `respiratory - ioi` OR `skin and soft tissue - ioi`. If suppression would leave zero labels, the removal is reverted (fallback guarantee). ## Usage ### Quick start ```python from huggingface_hub import hf_hub_download from ncas_indication.model import ClinicalBERTClassifier from transformers import AutoTokenizer # Download checkpoint model_path = hf_hub_download( repo_id="jibmaird/NCAS-hospital-indication-classifier", filename="indication_classifier_model.pt", ) # Load model (label names and thresholds are embedded in the checkpoint) model, label_columns, thresholds = ClinicalBERTClassifier.from_checkpoint(model_path) tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") ``` Or using the inference script from the [GitHub repository](https://github.com/jibmaird/NCAS-hospital-indication-classifier): ```bash # Single text python inference/predict.py --text "UTI prophylaxis post-renal transplant" # CSV file python inference/predict.py --input your_file.csv --output predictions.csv ``` ### Desktop application A cross-platform desktop GUI is available in the `app/` folder of the repository. See [app/README.md](https://github.com/jibmaird/NCAS-hospital-indication-classifier/blob/main/app/README.md). ## Training ### Hyperparameters | Parameter | Value | |-----------|-------| | Learning rate | 1e-5 | | Batch size | 8 | | Epochs | 20 | | Optimizer | AdamW | | Loss function | Weighted BCE (inverse-frequency weights) | | Validation split | 20% of training data | | Threshold selection | Per-label F1 maximisation on validation set | ### Training procedure 1. The combined dataset of ~2,000 labelled records was split 80/20 for training and validation. 2. Inverse-frequency class weights were applied to the BCE loss to address label imbalance. 3. Per-label decision thresholds were optimised on the validation set by grid search over [0.1, 0.2, …, 0.8] to maximise label-specific F1. 4. The model with the best weighted-macro F1 across epochs was retained. ## Checkpoint format The `.pt` file is a standard PyTorch checkpoint dict with keys: ```python { "model_state_dict": ..., # nn.Module weights "label_columns": [...], # ordered label names "optimal_thresholds": [...], # per-label decision thresholds "n_labels": 8, "base_model": "emilyalsentzer/Bio_ClinicalBERT", } ``` ## Limitations and intended use - The model was trained and evaluated on de-identified records from a single Australian tertiary hospital (RMH). Performance may differ on records from other hospitals, health systems, or clinical workflows. - This model is intended for **research purposes** and is not a validated clinical decision support tool. Clinical decisions must remain with qualified healthcare professionals. - The training data cannot be shared due to privacy restrictions; the annotation schema and data format are documented in the companion GitHub repository. ## Citation If you use this model in your research, please cite: ```bibtex @article{ncas_indication_classifier_2025, title = {Automated Classification of Antimicrobial Prescription Indications Using BioClinicalBERT}, author = {...}, journal = {...}, year = {2025}, note = {Under review} } ``` ## Repository Source code, training scripts, and the desktop application are available at: [https://github.com/jibmaird/NCAS-hospital-indication-classifier](https://github.com/jibmaird/NCAS-hospital-indication-classifier) ## License Apache 2.0 — see [LICENSE](https://github.com/jibmaird/NCAS-hospital-indication-classifier/blob/main/LICENSE).