| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - pharmacovigilance |
| - adverse-drug-reaction |
| - biomedical-nlp |
| - named-entity-recognition |
| - text-classification |
| - ghana |
| - pubmedbert |
| - dapt |
| datasets: |
| - custom |
| metrics: |
| - f1 |
| base_model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext |
| model-index: |
| - name: Ghana ADR Detection — CLF (Phase 6 / Phase 2b) |
| results: |
| - task: |
| type: text-classification |
| name: ADR Binary Classification |
| metrics: |
| - type: f1 |
| value: 0.724 |
| name: Macro F1 (LOSO) |
| - name: Ghana ADR Detection — NER (Phase 7) |
| results: |
| - task: |
| type: token-classification |
| name: ADR Named Entity Recognition |
| metrics: |
| - type: f1 |
| value: 0.655 |
| name: Macro F1 (LOSO) |
| --- |
| |
| # Ghana ADR Detection System |
|
|
| Adverse drug reaction (ADR) detection from free-text clinical narratives, built on Ghanaian pharmacovigilance data. Fine-tuned from PubMedBERT with domain-adaptive pretraining (DAPT) on 128k Ghanaian biomedical sentences. |
|
|
|
|
| ## Model Description |
|
|
| This repository contains two production heads trained on the same DAPT backbone: |
|
|
| | Component | Path | Task | |
| |---|---|---| |
| | DAPT backbone | `dapt-backbone/` | PubMedBERT MLM-adapted on Ghanaian biomedical corpus (PPL 6.11 → 4.55) | |
| | CLF head (Phase 2b) | `checkpoints/clf_phase2b_{fold}/clf_best/` | Binary: `contains_adr` 0/1 | |
| | NER head (Phase 7) | `checkpoints/ner_phase7_{fold}/ner_best/` | Token labels: `DRUG`, `ADR`, `SEVERITY`, `PATIENT_DEMO` | |
|
|
| **Production config (Phase 7 Hybrid):** `clf_phase2b_cohort_study` + `ner_phase7_cohort_study`, threshold 0.55. |
|
|
| ## Performance |
|
|
| Evaluated with Leave-One-Source-Out (LOSO) cross-validation — each source domain is held out as the test set while the model trains on the remaining three. This measures real cross-domain generalisation across Ghanaian clinical writing styles. |
|
|
| ### Classification (CLF) |
|
|
| | Held-out source | N | F1 | |
| |---|---|---| |
| | case_report | 44 | 0.787 | |
| | cohort_study | 123 | 0.776 | |
| | fda_newsletter | 99 | 0.667 | |
| | qualitative_interview | 78 | 0.667 | |
| | **macro-avg** | — | **0.724** | |
|
|
| ### Named Entity Recognition (NER) |
|
|
| | Held-out source | N | F1 | DRUG F1 | ADR F1 | |
| |---|---|---|---|---| |
| | case_report | 44 | 0.598 | 0.862 | 0.545 | |
| | cohort_study | 123 | 0.785 | 0.823 | 0.884 | |
| | fda_newsletter | 99 | 0.587 | 0.626 | 0.634 | |
| | qualitative_interview | 78 | 0.650 | 0.560 | 0.842 | |
| | **macro-avg** | — | **0.655** | 0.718 | 0.727 | |
|
|
| Batch regression against 95 curated hard cases (Pidgin idioms, dialect, regulatory register, clinical shorthand, minimal pairs): **85/95 pass (89.5%)**. |
|
|
| ## Training Data |
|
|
| Built from four Ghanaian pharmacovigilance source domains: |
|
|
| | Source | Type | |
| |---|---| |
| | Ghana FDA DrugLens newsletters (5 issues) | PDF — regulatory | |
| | Ghana FDA Annual Report 2023 + ADR Guide | PDF — regulatory | |
| | PMC open-access case reports & cohort studies (9 articles) | JATS XML — clinical | |
| | Patient ADR interview transcripts | Qualitative — community | |
|
|
| **Gold dataset:** 2,870 annotated sentences |
| **Silver dataset:** 2,105+ records (DailyMed weak supervision, DrugLens NER, OpenFDA ICSR, synthetic curriculum) |
|
|
| ## How to Use |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| from transformers import AutoModelForTokenClassification |
| import torch |
| |
| # Load DAPT backbone tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("iamjamaal/ghana-adr-detection", subfolder="dapt-backbone") |
| |
| # Load CLF head (production fold: cohort_study) |
| clf_model = AutoModelForSequenceClassification.from_pretrained( |
| "iamjamaal/ghana-adr-detection", |
| subfolder="checkpoints/clf_phase2b_cohort_study/clf_best" |
| ) |
| |
| # Load NER head (production fold: cohort_study) |
| ner_model = AutoModelForTokenClassification.from_pretrained( |
| "iamjamaal/ghana-adr-detection", |
| subfolder="checkpoints/ner_phase7_cohort_study/ner_best" |
| ) |
| |
| text = "Patient developed severe oculogyric crisis after starting haloperidol." |
| |
| # CLF inference |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| logits = clf_model(**inputs).logits |
| prob_adr = torch.softmax(logits, dim=-1)[0][1].item() |
| contains_adr = prob_adr >= 0.55 |
| print(f"ADR: {contains_adr} (p={prob_adr:.3f})") |
| ``` |
|
|
| ## Repo Structure |
|
|
| ``` |
| dapt-backbone/ # DAPT backbone (config + safetensors) |
| checkpoints/ |
| clf_phase2b_case_report/clf_best/ # CLF checkpoint — case_report fold |
| clf_phase2b_cohort_study/clf_best/ # CLF checkpoint — cohort_study fold ← production |
| clf_phase2b_fda_newsletter/clf_best/ |
| clf_phase2b_qualitative_interview/clf_best/ |
| ner_phase7_case_report/ner_best/ # NER checkpoint — case_report fold |
| ner_phase7_cohort_study/ner_best/ # NER checkpoint — cohort_study fold ← production |
| ner_phase7_fda_newsletter/ner_best/ |
| ner_phase7_qualitative_interview/ner_best/ |
| ner_phase7_qualitative_interview_seed/ner_best/ |
| ``` |
|
|
| ## Code & Demo |
|
|
| - **Pipeline code:** [github.com/iamjamaal/ghana-pharmacovigilance-ai](https://github.com/iamjamaal/ghana-pharmacovigilance-ai) |
| - **Live demo:** Flask app with single-sentence analysis, batch upload, and Yellow Card–style reporting |
|
|
| ## Limitations |
|
|
| - Trained on Ghanaian pharmacovigilance sources; performance may degrade on clinical text from other regions. |
| - NER F1 on `SEVERITY` and `PATIENT_DEMO` is lower than `DRUG`/`ADR` due to limited annotation density. |
| - Ghanaian Pidgin and dialect constructions improve batch regression scores but may not generalise to other West African Pidgin variants. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{ghana-adr-2026, |
| title = {Ghana ADR Detection System}, |
| author = {Nabila, Noah Jamal}, |
| year = {2026}, |
| url = {https://huggingface.co/iamjamaal/ghana-adr-detection} |
| } |
| ``` |
|
|
| ## License |
|
|
| Code: MIT | Model weights: MIT | Dataset annotations: CC-BY-4.0 |
|
|