ghana-adr-detection / README.md
iamjamaal's picture
Update README.md
fb415cf verified
|
Raw
History Blame
5.94 kB
metadata
language:
  - en
license: mit
tags:
  - pharmacovigilance
  - adverse-drug-reaction
  - biomedical-nlp
  - named-entity-recognition
  - text-classification
  - ghana
  - pubmedbert
  - dapt
datasets:
  - custom
metrics:
  - f1
base_model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
model-index:
  - name: Ghana ADR Detection  CLF (Phase 6 / Phase 2b)
    results:
      - task:
          type: text-classification
          name: ADR Binary Classification
        metrics:
          - type: f1
            value: 0.724
            name: Macro F1 (LOSO)
  - name: Ghana ADR Detection  NER (Phase 7)
    results:
      - task:
          type: token-classification
          name: ADR Named Entity Recognition
        metrics:
          - type: f1
            value: 0.655
            name: Macro F1 (LOSO)

Ghana ADR Detection System

Adverse drug reaction (ADR) detection from free-text clinical narratives, built on Ghanaian pharmacovigilance data. Fine-tuned from PubMedBERT with domain-adaptive pretraining (DAPT) on 128k Ghanaian biomedical sentences.

Model Description

This repository contains two production heads trained on the same DAPT backbone:

Component Path Task
DAPT backbone dapt-backbone/ PubMedBERT MLM-adapted on Ghanaian biomedical corpus (PPL 6.11 → 4.55)
CLF head (Phase 2b) checkpoints/clf_phase2b_{fold}/clf_best/ Binary: contains_adr 0/1
NER head (Phase 7) checkpoints/ner_phase7_{fold}/ner_best/ Token labels: DRUG, ADR, SEVERITY, PATIENT_DEMO

Production config (Phase 7 Hybrid): clf_phase2b_cohort_study + ner_phase7_cohort_study, threshold 0.55.

Performance

Evaluated with Leave-One-Source-Out (LOSO) cross-validation — each source domain is held out as the test set while the model trains on the remaining three. This measures real cross-domain generalisation across Ghanaian clinical writing styles.

Classification (CLF)

Held-out source N F1
case_report 44 0.787
cohort_study 123 0.776
fda_newsletter 99 0.667
qualitative_interview 78 0.667
macro-avg 0.724

Named Entity Recognition (NER)

Held-out source N F1 DRUG F1 ADR F1
case_report 44 0.598 0.862 0.545
cohort_study 123 0.785 0.823 0.884
fda_newsletter 99 0.587 0.626 0.634
qualitative_interview 78 0.650 0.560 0.842
macro-avg 0.655 0.718 0.727

Batch regression against 95 curated hard cases (Pidgin idioms, dialect, regulatory register, clinical shorthand, minimal pairs): 85/95 pass (89.5%).

Training Data

Built from four Ghanaian pharmacovigilance source domains:

Source Type
Ghana FDA DrugLens newsletters (5 issues) PDF — regulatory
Ghana FDA Annual Report 2023 + ADR Guide PDF — regulatory
PMC open-access case reports & cohort studies (9 articles) JATS XML — clinical
Patient ADR interview transcripts Qualitative — community

Gold dataset: 2,870 annotated sentences
Silver dataset: 2,105+ records (DailyMed weak supervision, DrugLens NER, OpenFDA ICSR, synthetic curriculum)

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AutoModelForTokenClassification
import torch

# Load DAPT backbone tokenizer
tokenizer = AutoTokenizer.from_pretrained("iamjamaal/ghana-adr-detection", subfolder="dapt-backbone")

# Load CLF head (production fold: cohort_study)
clf_model = AutoModelForSequenceClassification.from_pretrained(
    "iamjamaal/ghana-adr-detection",
    subfolder="checkpoints/clf_phase2b_cohort_study/clf_best"
)

# Load NER head (production fold: cohort_study)
ner_model = AutoModelForTokenClassification.from_pretrained(
    "iamjamaal/ghana-adr-detection",
    subfolder="checkpoints/ner_phase7_cohort_study/ner_best"
)

text = "Patient developed severe oculogyric crisis after starting haloperidol."

# CLF inference
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = clf_model(**inputs).logits
prob_adr = torch.softmax(logits, dim=-1)[0][1].item()
contains_adr = prob_adr >= 0.55
print(f"ADR: {contains_adr} (p={prob_adr:.3f})")

Repo Structure

dapt-backbone/                          # DAPT backbone (config + safetensors)
checkpoints/
  clf_phase2b_case_report/clf_best/     # CLF checkpoint — case_report fold
  clf_phase2b_cohort_study/clf_best/    # CLF checkpoint — cohort_study fold ← production
  clf_phase2b_fda_newsletter/clf_best/
  clf_phase2b_qualitative_interview/clf_best/
  ner_phase7_case_report/ner_best/      # NER checkpoint — case_report fold
  ner_phase7_cohort_study/ner_best/     # NER checkpoint — cohort_study fold ← production
  ner_phase7_fda_newsletter/ner_best/
  ner_phase7_qualitative_interview/ner_best/
  ner_phase7_qualitative_interview_seed/ner_best/

Code & Demo

Limitations

  • Trained on Ghanaian pharmacovigilance sources; performance may degrade on clinical text from other regions.
  • NER F1 on SEVERITY and PATIENT_DEMO is lower than DRUG/ADR due to limited annotation density.
  • Ghanaian Pidgin and dialect constructions improve batch regression scores but may not generalise to other West African Pidgin variants.

Citation

@misc{ghana-adr-2026,
  title  = {Ghana ADR Detection System},
  author = {Nabila, Noah Jamal},
  year   = {2026},
  url    = {https://huggingface.co/iamjamaal/ghana-adr-detection}
}

License

Code: MIT | Model weights: MIT | Dataset annotations: CC-BY-4.0