File size: 8,347 Bytes

---
license: apache-2.0
language:
- en
tags:
- medical
- biomedical
- drug-safety
- adverse-drug-reactions
- pharmacovigilance
- relation-extraction
- dual-encoder
- clinical-nlp
- pubmedbert
datasets:
- ade-benchmark-corpus/ade_corpus_v2
metrics:
- f1
- roc_auc
pipeline_tag: text-classification
model-index:
- name: CRAG-dual-encoder-base
  results:
  - task:
      type: text-classification
      name: Drug-ADR Relation Extraction
    dataset:
      name: ADE Corpus V2
      type: ade-benchmark-corpus/ade_corpus_v2
      config: Ade_corpus_v2_drug_ade_relation
    metrics:
    - type: f1
      value: 0.883
      name: F1 Score
---

# CRAG-dual-encoder-base

**CRAG: Causal Reasoning for Adversomics Graphs**

This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.

## Model Description

CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.

### Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    CRAG Dual-Encoder Base                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Drug Context          ADR Context                         │
│        │                     │                              │
│        ▼                     ▼                              │
│  ┌──────────┐          ┌──────────┐                         │
│  │PubMedBERT│          │PubMedBERT│    (separate weights)   │
│  │  Drug    │          │   ADR    │                         │
│  │ Encoder  │          │ Encoder  │                         │
│  └────┬─────┘          └────┬─────┘                         │
│       │                     │                               │
│       ▼                     ▼                               │
│  [CLS] Pool            [CLS] Pool                           │
│       │                     │                               │
│       └────────┬────────────┘                               │
│                │                                            │
│                ▼                                            │
│        ┌──────────────┐                                     │
│        │   Bilinear   │                                     │
│        │   Fusion     │                                     │
│        └──────┬───────┘                                     │
│               │                                             │
│               ▼                                             │
│        ┌──────────────┐                                     │
│        │  MLP Head    │                                     │
│        │  (256→1)     │                                     │
│        └──────┬───────┘                                     │
│               │                                             │
│               ▼                                             │
│           P(causal)                                         │
└─────────────────────────────────────────────────────────────┘
```

- **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`
- **Hidden Dimension:** 768
- **Fusion Dimension:** 256
- **Parameters:** ~220M (two separate BERT encoders)

### Training Procedure

The model was trained in two phases:

**Phase 1: Contrastive Pre-training (3 epochs)**
- InfoNCE loss with temperature τ=0.07
- Learns to bring true drug-ADR pairs close in embedding space
- Random negative sampling (mismatched pairs)

**Phase 2: Classification Fine-tuning (5 epochs)**
- Binary cross-entropy loss
- Balanced positive/negative samples
- Learning rate: 2e-5 with linear warmup

### Training Data

- **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2)
- **Configuration:** `Ade_corpus_v2_drug_ade_relation`
- **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs
- **Validation Examples:** ~850 pairs

## Performance

| Metric | Value |
|--------|-------|
| **F1 Score** | 88.3% |

### Comparison with CRAG Family

| Model | F1 | AUC | Key Features |
|-------|-----|-----|--------------|
| **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives |
| CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
| CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning |

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModel

# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition

tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")

# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."

# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")

# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)
```

## Intended Uses

### Primary Use Cases
- **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature
- **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis
- **Literature Mining:** Screening biomedical publications for adverse event reports
- **Clinical Decision Support:** Identifying potential drug safety signals

### Out-of-Scope Uses
- Direct clinical decision-making without human review
- Diagnosis or treatment recommendations
- Processing non-English text
- Identifying drug-drug interactions (different task)

## Limitations

1. **English Only:** Trained exclusively on English biomedical text
2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations
3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context
4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use

## Ethical Considerations

- Model predictions should be validated by domain experts before use in clinical or regulatory settings
- False negatives may miss important safety signals; false positives may trigger unnecessary reviews
- The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)

## Citation

```bibtex
@misc{crag-dual-encoder-2024,
  title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
  author={von Csefalvay, Chris},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}
```

## Model Card Authors

Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay))

## Model Card Contact

For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.