File size: 6,114 Bytes

89c46f9

---
license: apache-2.0
language:
- en
base_model:
- google/flan-t5-base
pipeline_tag: summarization
library_name: transformers
tags:
- summarization
- biomedical
- alzheimer
- dementia
- neuroscience
- domain-specific-llm
- scientific-literature
---

# flan-t5-base – Alzheimer Ultra-Safe Summarizer

## Model summary

This repository contains a fine-tuned version of **[`google/flan-t5-base`](https://huggingface.co/google/flan-t5-base)** for **results- and conclusions-focused summarization of Alzheimer’s disease–related scientific abstracts**.

- **Base model:** `google/flan-t5-base` (≈250M parameters, encoder–decoder, Apache-2.0)
- **Task:** Text-to-text summarization of biomedical abstracts
- **Domain:** Alzheimer’s disease, dementia, and related neurodegenerative / neuroimmunology literature
- **Input:** Full abstract (usually from PubMed or similar sources)
- **Output:** 1–3 sentence summary, biased towards the *main results and conclusions*

> ⚠️ **Important:** This model is intended **only for research, education, and literature exploration**.  
> It must **not** be used as a standalone tool for diagnosis, treatment decisions, or any clinical workflow.

---

## Intended use

### Primary use case

- **Summarizing Alzheimer’s-related scientific abstracts** into short, results-oriented summaries that are easier to scan.
- Supporting:
  - literature review,
  - dataset curation,
  - building search / indexing tools,
  - rapid exploration of Alzheimer’s disease research.

The model tends to emphasize:

- key findings (e.g., “X polymorphism is associated with AD risk”),
- high-level conclusions,
- sometimes sample characteristics (N, cohort description) when present in the abstract.

### Supported languages

- **English only.**
- The base model is multilingual, but this fine-tuning was performed **only on English biomedical abstracts**.
- Using it on other languages is *out of distribution* and may produce poor or incorrect summaries.

### Non-goals / out-of-scope

This model is **not** designed or validated for:

- Patient-level clinical decision support
- Prognosis estimation or risk scoring
- Generating treatment recommendations
- Legal, regulatory, or billing decisions
- Summarizing layperson health information for patients

---

## How it was trained

### Base model

- `google/flan-t5-base` (Apache-2.0 licensed, instruction-tuned T5-base).

### Training data (high-level)

> The underlying dataset itself is **not included** in this repository. This section only documents how the data was used.

- ~**9.6k** abstracts related to:
  - Alzheimer’s disease (AD),
  - dementia,
  - neurodegeneration,
  - neuroinflammation / neuroimmunology,
  - related biomarkers and imaging studies.
- Abstracts were retrieved programmatically from **PubMed-like sources** using Alzheimer’s-related queries.
- Each abstract is paired with a **“teacher summary”**, constructed heuristically by selecting sentences that:
  - contain sections like `RESULTS:` and/or `CONCLUSIONS:` (if present),
  - or otherwise capture the core result statement of the study.

In other words, training labels are **extractive, results-focused summaries** derived from the abstracts themselves, not human-written abstractive summaries.

### Objective

- Text-to-text supervised fine-tuning:
  - **Input:** the full abstract (often with a task prefix like `summarize:` or a short instruction).
  - **Target:** the corresponding `teacher_summary` (1–3 sentences, mostly extractive).

This encourages the model to:

- focus on the *result/conclusion* region of the abstract,
- avoid over-emphasizing background and methods,
- stay within the factual space of the original text.

### Training setup (approximate)

- Framework: **PyTorch** + `transformers`
- Model class: `AutoModelForSeq2SeqLM`
- Tokenizer: `AutoTokenizer` for `google/flan-t5-base`
- Train/validation split: ~90% / 10% on the Alzheimer abstracts
- Hyperparameters (typical configuration used in this project):
  - Epochs: **5**
  - Optimizer: `AdamW`
  - Learning rate: ~**1e-4**
  - Weight decay: ~**0.01**
  - LR schedule: linear decay with ~10% warmup
  - Batch size: effective batch size increased via gradient accumulation
  - Max input length: **512 tokens**
  - Max target length: **≈128 tokens**
  - Loss: standard cross-entropy on decoder outputs with padding tokens masked

### Training dynamics (example)

Observed loss over 5 epochs (representative run):

- `Epoch 1` – Train loss ≈ **0.32** | Val loss ≈ **0.18**
- `Epoch 5` – Train loss ≈ **0.16** | Val loss ≈ **0.16**

Combined with qualitative inspection, this indicates:

- Stable training (no divergence / NaNs)
- Reasonable convergence without strong overfitting
- Good alignment to the teacher summaries.

---

## How to use the model

> 🔎 **Note:** The raw model is a standard seq2seq model.  
> For **extra safety**, you may want to wrap it with an overlap-based filter that removes sentences not grounded in the abstract (described later under “Safety & hallucination”).

### Basic usage (raw summarization)

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "ffurkandemir/flan-t5-base-alzheimer-ultra-safe"  # or your actual repo ID

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

abstract = """
Alzheimer's disease (AD) is a neurodegenerative disorder...
RESULTS: Patients with moderate-severe periodontitis had a higher risk...
CONCLUSIONS: Our findings suggest that periodontal disease may be associated with...
"""

prompt = (
    "Summarize the following abstract in 2-3 sentences, focusing on the main "
    "results and conclusions:\n\n" + abstract
)

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    truncation=True,
    max_length=512,
)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,   # higher limit to avoid truncation
    num_beams=4,
    no_repeat_ngram_size=3,
    early_stopping=True,
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)