File size: 8,347 Bytes
2ba2967
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
feca20a
2ba2967
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: apache-2.0
language:
- en
tags:
- medical
- biomedical
- drug-safety
- adverse-drug-reactions
- pharmacovigilance
- relation-extraction
- dual-encoder
- clinical-nlp
- pubmedbert
datasets:
- ade-benchmark-corpus/ade_corpus_v2
metrics:
- f1
- roc_auc
pipeline_tag: text-classification
model-index:
- name: CRAG-dual-encoder-base
  results:
  - task:
      type: text-classification
      name: Drug-ADR Relation Extraction
    dataset:
      name: ADE Corpus V2
      type: ade-benchmark-corpus/ade_corpus_v2
      config: Ade_corpus_v2_drug_ade_relation
    metrics:
    - type: f1
      value: 0.883
      name: F1 Score
---

# CRAG-dual-encoder-base

**CRAG: Causal Reasoning for Adversomics Graphs**

This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.

## Model Description

CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.

### Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CRAG Dual-Encoder Base                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚   Drug Context          ADR Context                         β”‚
β”‚        β”‚                     β”‚                              β”‚
β”‚        β–Ό                     β–Ό                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚PubMedBERTβ”‚          β”‚PubMedBERTβ”‚    (separate weights)   β”‚
β”‚  β”‚  Drug    β”‚          β”‚   ADR    β”‚                         β”‚
β”‚  β”‚ Encoder  β”‚          β”‚ Encoder  β”‚                         β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚       β”‚                     β”‚                               β”‚
β”‚       β–Ό                     β–Ό                               β”‚
β”‚  [CLS] Pool            [CLS] Pool                           β”‚
β”‚       β”‚                     β”‚                               β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β”‚                β”‚                                            β”‚
β”‚                β–Ό                                            β”‚
β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                     β”‚
β”‚        β”‚   Bilinear   β”‚                                     β”‚
β”‚        β”‚   Fusion     β”‚                                     β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                     β”‚
β”‚               β”‚                                             β”‚
β”‚               β–Ό                                             β”‚
β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                     β”‚
β”‚        β”‚  MLP Head    β”‚                                     β”‚
β”‚        β”‚  (256β†’1)     β”‚                                     β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                     β”‚
β”‚               β”‚                                             β”‚
β”‚               β–Ό                                             β”‚
β”‚           P(causal)                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

- **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`
- **Hidden Dimension:** 768
- **Fusion Dimension:** 256
- **Parameters:** ~220M (two separate BERT encoders)

### Training Procedure

The model was trained in two phases:

**Phase 1: Contrastive Pre-training (3 epochs)**
- InfoNCE loss with temperature Ο„=0.07
- Learns to bring true drug-ADR pairs close in embedding space
- Random negative sampling (mismatched pairs)

**Phase 2: Classification Fine-tuning (5 epochs)**
- Binary cross-entropy loss
- Balanced positive/negative samples
- Learning rate: 2e-5 with linear warmup

### Training Data

- **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2)
- **Configuration:** `Ade_corpus_v2_drug_ade_relation`
- **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs
- **Validation Examples:** ~850 pairs

## Performance

| Metric | Value |
|--------|-------|
| **F1 Score** | 88.3% |

### Comparison with CRAG Family

| Model | F1 | AUC | Key Features |
|-------|-----|-----|--------------|
| **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives |
| CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
| CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning |

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModel

# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition

tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")

# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."

# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")

# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)
```

## Intended Uses

### Primary Use Cases
- **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature
- **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis
- **Literature Mining:** Screening biomedical publications for adverse event reports
- **Clinical Decision Support:** Identifying potential drug safety signals

### Out-of-Scope Uses
- Direct clinical decision-making without human review
- Diagnosis or treatment recommendations
- Processing non-English text
- Identifying drug-drug interactions (different task)

## Limitations

1. **English Only:** Trained exclusively on English biomedical text
2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations
3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context
4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use

## Ethical Considerations

- Model predictions should be validated by domain experts before use in clinical or regulatory settings
- False negatives may miss important safety signals; false positives may trigger unnecessary reviews
- The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)

## Citation

```bibtex
@misc{crag-dual-encoder-2024,
  title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
  author={von Csefalvay, Chris},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}
```

## Model Card Authors

Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay))

## Model Card Contact

For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.