--- license: apache-2.0 language: - en tags: - medical - biomedical - drug-safety - adverse-drug-reactions - pharmacovigilance - relation-extraction - dual-encoder - clinical-nlp - pubmedbert datasets: - ade-benchmark-corpus/ade_corpus_v2 metrics: - f1 - roc_auc pipeline_tag: text-classification model-index: - name: CRAG-dual-encoder-base results: - task: type: text-classification name: Drug-ADR Relation Extraction dataset: name: ADE Corpus V2 type: ade-benchmark-corpus/ade_corpus_v2 config: Ade_corpus_v2_drug_ade_relation metrics: - type: f1 value: 0.883 name: F1 Score --- # CRAG-dual-encoder-base **CRAG: Causal Reasoning for Adversomics Graphs** This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction. ## Model Description CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship. ### Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ CRAG Dual-Encoder Base │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Drug Context ADR Context │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ │ │ │PubMedBERT│ │PubMedBERT│ (separate weights) │ │ │ Drug │ │ ADR │ │ │ │ Encoder │ │ Encoder │ │ │ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ ▼ ▼ │ │ [CLS] Pool [CLS] Pool │ │ │ │ │ │ └────────┬────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Bilinear │ │ │ │ Fusion │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ MLP Head │ │ │ │ (256→1) │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ P(causal) │ └─────────────────────────────────────────────────────────────┘ ``` - **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext` - **Hidden Dimension:** 768 - **Fusion Dimension:** 256 - **Parameters:** ~220M (two separate BERT encoders) ### Training Procedure The model was trained in two phases: **Phase 1: Contrastive Pre-training (3 epochs)** - InfoNCE loss with temperature τ=0.07 - Learns to bring true drug-ADR pairs close in embedding space - Random negative sampling (mismatched pairs) **Phase 2: Classification Fine-tuning (5 epochs)** - Binary cross-entropy loss - Balanced positive/negative samples - Learning rate: 2e-5 with linear warmup ### Training Data - **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2) - **Configuration:** `Ade_corpus_v2_drug_ade_relation` - **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs - **Validation Examples:** ~850 pairs ## Performance | Metric | Value | |--------|-------| | **F1 Score** | 88.3% | ### Comparison with CRAG Family | Model | F1 | AUC | Key Features | |-------|-----|-----|--------------| | **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives | | CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss | | CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning | ## Usage ```python import torch from transformers import AutoTokenizer, AutoModel # Load model (custom architecture - need to define DualEncoderModel class) # See training script for architecture definition tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base") # Example: Score a drug-ADR pair drug_context = "Patient was prescribed aspirin for pain management." adr_context = "The patient experienced gastrointestinal bleeding." # Tokenize drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length") adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length") # Forward pass (pseudo-code - requires loading custom model) # drug_repr = model.encode_drug(**drug_inputs) # adr_repr = model.encode_adr(**adr_inputs) # score = model.classify(drug_repr, adr_repr) ``` ## Intended Uses ### Primary Use Cases - **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature - **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis - **Literature Mining:** Screening biomedical publications for adverse event reports - **Clinical Decision Support:** Identifying potential drug safety signals ### Out-of-Scope Uses - Direct clinical decision-making without human review - Diagnosis or treatment recommendations - Processing non-English text - Identifying drug-drug interactions (different task) ## Limitations 1. **English Only:** Trained exclusively on English biomedical text 2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations 3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context 4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use ## Ethical Considerations - Model predictions should be validated by domain experts before use in clinical or regulatory settings - False negatives may miss important safety signals; false positives may trigger unnecessary reviews - The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE) ## Citation ```bibtex @misc{crag-dual-encoder-2024, title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction}, author={von Csefalvay, Chris}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base} } ``` ## Model Card Authors Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay)) ## Model Card Contact For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.