File size: 8,347 Bytes
2ba2967 feca20a 2ba2967 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
license: apache-2.0
language:
- en
tags:
- medical
- biomedical
- drug-safety
- adverse-drug-reactions
- pharmacovigilance
- relation-extraction
- dual-encoder
- clinical-nlp
- pubmedbert
datasets:
- ade-benchmark-corpus/ade_corpus_v2
metrics:
- f1
- roc_auc
pipeline_tag: text-classification
model-index:
- name: CRAG-dual-encoder-base
results:
- task:
type: text-classification
name: Drug-ADR Relation Extraction
dataset:
name: ADE Corpus V2
type: ade-benchmark-corpus/ade_corpus_v2
config: Ade_corpus_v2_drug_ade_relation
metrics:
- type: f1
value: 0.883
name: F1 Score
---
# CRAG-dual-encoder-base
**CRAG: Causal Reasoning for Adversomics Graphs**
This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.
## Model Description
CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.
### Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CRAG Dual-Encoder Base β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Drug Context ADR Context β
β β β β
β βΌ βΌ β
β ββββββββββββ ββββββββββββ β
β βPubMedBERTβ βPubMedBERTβ (separate weights) β
β β Drug β β ADR β β
β β Encoder β β Encoder β β
β ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β
β βΌ βΌ β
β [CLS] Pool [CLS] Pool β
β β β β
β ββββββββββ¬βββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β Bilinear β β
β β Fusion β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β MLP Head β β
β β (256β1) β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β P(causal) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
- **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`
- **Hidden Dimension:** 768
- **Fusion Dimension:** 256
- **Parameters:** ~220M (two separate BERT encoders)
### Training Procedure
The model was trained in two phases:
**Phase 1: Contrastive Pre-training (3 epochs)**
- InfoNCE loss with temperature Ο=0.07
- Learns to bring true drug-ADR pairs close in embedding space
- Random negative sampling (mismatched pairs)
**Phase 2: Classification Fine-tuning (5 epochs)**
- Binary cross-entropy loss
- Balanced positive/negative samples
- Learning rate: 2e-5 with linear warmup
### Training Data
- **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2)
- **Configuration:** `Ade_corpus_v2_drug_ade_relation`
- **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs
- **Validation Examples:** ~850 pairs
## Performance
| Metric | Value |
|--------|-------|
| **F1 Score** | 88.3% |
### Comparison with CRAG Family
| Model | F1 | AUC | Key Features |
|-------|-----|-----|--------------|
| **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives |
| CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
| CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning |
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModel
# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition
tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")
# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."
# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)
```
## Intended Uses
### Primary Use Cases
- **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature
- **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis
- **Literature Mining:** Screening biomedical publications for adverse event reports
- **Clinical Decision Support:** Identifying potential drug safety signals
### Out-of-Scope Uses
- Direct clinical decision-making without human review
- Diagnosis or treatment recommendations
- Processing non-English text
- Identifying drug-drug interactions (different task)
## Limitations
1. **English Only:** Trained exclusively on English biomedical text
2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations
3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context
4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use
## Ethical Considerations
- Model predictions should be validated by domain experts before use in clinical or regulatory settings
- False negatives may miss important safety signals; false positives may trigger unnecessary reviews
- The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)
## Citation
```bibtex
@misc{crag-dual-encoder-2024,
title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
author={von Csefalvay, Chris},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}
```
## Model Card Authors
Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay))
## Model Card Contact
For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.
|