Text Generation
PEFT
Safetensors
English
medical
biomedical
adverse-drug-events
ade
pharmacovigilance
distillation
lora
llama-3.1
conversational
Eval Results (legacy)
Instructions to use Ventali/llama31-8b-ade-sft-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Ventali/llama31-8b-ade-sft-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "Ventali/llama31-8b-ade-sft-v2") - Notebooks
- Google Colab
- Kaggle
| license: llama3.1 | |
| base_model: meta-llama/Llama-3.1-8B-Instruct | |
| library_name: peft | |
| pipeline_tag: text-generation | |
| language: | |
| - en | |
| tags: | |
| - medical | |
| - biomedical | |
| - adverse-drug-events | |
| - ade | |
| - pharmacovigilance | |
| - distillation | |
| - lora | |
| - peft | |
| - llama-3.1 | |
| datasets: | |
| - ade-benchmark-corpus/ade_corpus_v2 | |
| model-index: | |
| - name: llama31-8b-ade-sft-v2 | |
| results: | |
| - task: | |
| type: text-generation | |
| name: ADE Binary QA + span extraction | |
| dataset: | |
| type: ade-benchmark-corpus/ade_corpus_v2 | |
| name: ade_corpus_v2 (200 held-out) | |
| metrics: | |
| - type: exact_match | |
| value: 0.715 | |
| name: exact_match (answer ∈ {yes,no,abstain}) | |
| - type: f1 | |
| value: 0.860 | |
| name: positive_f1 (answer=yes) | |
| - type: precision | |
| value: 0.785 | |
| name: positive_precision | |
| - type: recall | |
| value: 0.950 | |
| name: positive_recall | |
| - type: f1 | |
| value: 0.883 | |
| name: span_drug_token_f1 (positives only) | |
| - type: f1 | |
| value: 0.866 | |
| name: span_event_token_f1 (positives only) | |
| # llama31-8b-ade-sft-v2 | |
| A LoRA adapter for [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) that answers adverse drug event (ADE) questions on single-sentence clinical text and extracts the implicated drug and event as structured JSON. Distilled from a Vertex-hosted Llama 3.3 70B teacher; trained with QLoRA on ~3k teacher-labeled sentences from `ade_corpus_v2`. | |
| **⚠️ Not clinical grade.** This is a research / educational artifact. Do not use for patient-care decisions. | |
| ## Intended use | |
| Given a short clinical vignette (one or a few sentences), produce a JSON object: | |
| ```json | |
| { | |
| "answer": "yes | no | abstain", | |
| "drug": "<drug name or empty>", | |
| "event": "<adverse event or empty>", | |
| "evidence": "<quoted or closely paraphrased text>", | |
| "short_justification": "<one short sentence>", | |
| "confidence": 0.0 | |
| } | |
| ``` | |
| - `answer` is `yes` only when the text supports a causally plausible drug-event relationship. | |
| - `abstain` is reserved for cases where the text names no plausible drug or no plausible event. Temporal co-occurrence with a clear external cause (e.g., "on metformin, slipped and fractured ankle") should be `no`, not `abstain`. | |
| ## Evaluation | |
| Held-out split (200 rows, balanced 100 positive / 100 negative) sampled from `ade_corpus_v2` and never seen during training. Compared against a v1 baseline that did not use few-shots or hard negatives. | |
| | Metric | v1 | **v2 (this model)** | | |
| |---|---|---| | |
| | exact_match (yes/no/abstain) | 0.555 | **0.715** | | |
| | abstain_rate | 0.315 | **0.135** | | |
| | positive_f1 | 0.884 | 0.860 | | |
| | positive_precision | 0.798 | 0.785 | | |
| | positive_recall | 0.990 | 0.950 | | |
| | span_drug_exact_match (pos) | 0.940 | 0.840 | | |
| | span_drug_token_f1 (pos) | 0.952 | 0.883 | | |
| | span_event_exact_match (pos) | 0.660 | 0.710 | | |
| | span_event_token_f1 (pos) | 0.816 | 0.866 | | |
| **Tradeoff to know.** v2 adds 600 "hard negatives" (drug mentioned, answer=no) to teach calibrated abstention. This halved the abstain rate and added 16 pts of exact_match, but cost ~10 pts of drug-span exact match vs v1 — the model learned to be more cautious about emitting a drug name. If your use case needs drug extraction on positives above all else, the earlier v1 checkpoint may be preferable. | |
| ## Usage | |
| ```python | |
| from peft import PeftModel | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| import torch | |
| base_id = "meta-llama/Llama-3.1-8B-Instruct" | |
| adapter_id = "Ventali/llama31-8b-ade-sft-v2" | |
| bnb = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_use_double_quant=True, | |
| bnb_4bit_compute_dtype=torch.bfloat16, | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(base_id) | |
| model = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb, device_map="auto") | |
| model = PeftModel.from_pretrained(model, adapter_id) | |
| model.eval() | |
| messages = [ | |
| {"role": "system", "content": "You are a careful biomedical assistant. For each case, return a compact JSON answer grounded in the provided evidence. If the evidence is insufficient, abstain."}, | |
| {"role": "user", "content": "Case: The patient developed diffuse urticaria three days after starting amoxicillin.\n\nIs this consistent with a possible adverse drug event? Identify the drug and event if so, or abstain if the evidence is insufficient."}, | |
| ] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| out = model.generate(**inputs, max_new_tokens=256, do_sample=False, pad_token_id=tokenizer.eos_token_id) | |
| print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| For Apple Silicon you can fuse the adapter into the base and run via `mlx-lm`: | |
| ```bash | |
| pip install mlx-lm | |
| mlx_lm.fuse --model meta-llama/Llama-3.1-8B-Instruct \ | |
| --adapter-path <local-adapter-dir> \ | |
| --save-path ~/models/llama31-ade-mlx | |
| mlx_lm.generate --model ~/models/llama31-ade-mlx --prompt "..." | |
| ``` | |
| ## Training | |
| - Base: `meta-llama/Llama-3.1-8B-Instruct`, loaded in 4-bit (NF4, double-quant, bf16 compute). | |
| - LoRA: r=32, alpha=64, dropout=0.05, target modules {q,k,v,o,gate,up,down}_proj. 41.9M trainable params (0.52% of base). | |
| - Data: 2,999 (prompt, teacher JSON) pairs. Prompts drawn from `ade_corpus_v2` as 1,200 positive (from `drug_ade_relation`) + 1,200 easy-negative + 600 hard-negative (classification label=0 rows whose text mentions a drug from the positive-split vocabulary). Teacher: Vertex AI managed `llama-3.3-70b-instruct-maas` (temperature 0.2), seeded with 3 yes/no/abstain few-shots and prompted to reserve abstention for cases with no plausible drug or no plausible event. | |
| - Filter: required non-empty `answer` and `evidence`, `confidence ≥ 0.65`, evidence-source word overlap ≥ 0.6. 2,999/3,000 retained. | |
| - Optimizer: AdamW, lr=2e-4, warmup_ratio=0.03, weight_decay=0.01, bf16, gradient_checkpointing on. | |
| - 3 epochs with `load_best_model_at_end=True` on `eval_loss`; the epoch-1 checkpoint (eval_loss 0.506) was restored, eclipsing the overfit epochs 2–3 (0.547, 0.676). | |
| - Hardware: single A100 40GB on GCP `a2-highgpu-1g`. Training wall time ~94 min. | |
| ## Limitations | |
| - Trained on single-sentence, literature-style clinical text. Longer narratives (discharge summaries, EHR free-text) are out of distribution and will likely perform worse. | |
| - Teacher labels are synthetic. A clinician-reviewed eval set was not used; regressions against human judgment have not been measured. | |
| - The model occasionally produces an empty `drug` or `event` field on positive cases, which is a regression from v1 on drug-span extraction. See the tradeoff note above. | |
| - English only. | |
| ## Reproducibility | |
| Full pipeline (seed building, teacher generation config, filter, SFT prep, training, evaluation) lives at https://github.com/ventali/medical-distill. Commit [`547629f`](https://github.com/ventali/medical-distill/commit/547629f) records this adapter's metrics. | |
| ## License | |
| Inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/) from the base model. | |