File size: 6,114 Bytes
89c46f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
license: apache-2.0
language:
- en
base_model:
- google/flan-t5-base
pipeline_tag: summarization
library_name: transformers
tags:
- summarization
- biomedical
- alzheimer
- dementia
- neuroscience
- domain-specific-llm
- scientific-literature
---

# flan-t5-base – Alzheimer Ultra-Safe Summarizer

## Model summary

This repository contains a fine-tuned version of **[`google/flan-t5-base`](https://huggingface.co/google/flan-t5-base)** for **results- and conclusions-focused summarization of Alzheimer’s disease–related scientific abstracts**.

- **Base model:** `google/flan-t5-base` (≈250M parameters, encoder–decoder, Apache-2.0)
- **Task:** Text-to-text summarization of biomedical abstracts
- **Domain:** Alzheimer’s disease, dementia, and related neurodegenerative / neuroimmunology literature
- **Input:** Full abstract (usually from PubMed or similar sources)
- **Output:** 1–3 sentence summary, biased towards the *main results and conclusions*

> ⚠️ **Important:** This model is intended **only for research, education, and literature exploration**.  
> It must **not** be used as a standalone tool for diagnosis, treatment decisions, or any clinical workflow.

---

## Intended use

### Primary use case

- **Summarizing Alzheimer’s-related scientific abstracts** into short, results-oriented summaries that are easier to scan.
- Supporting:
  - literature review,
  - dataset curation,
  - building search / indexing tools,
  - rapid exploration of Alzheimer’s disease research.

The model tends to emphasize:

- key findings (e.g., “X polymorphism is associated with AD risk”),
- high-level conclusions,
- sometimes sample characteristics (N, cohort description) when present in the abstract.

### Supported languages

- **English only.**
- The base model is multilingual, but this fine-tuning was performed **only on English biomedical abstracts**.
- Using it on other languages is *out of distribution* and may produce poor or incorrect summaries.

### Non-goals / out-of-scope

This model is **not** designed or validated for:

- Patient-level clinical decision support
- Prognosis estimation or risk scoring
- Generating treatment recommendations
- Legal, regulatory, or billing decisions
- Summarizing layperson health information for patients

---

## How it was trained

### Base model

- `google/flan-t5-base` (Apache-2.0 licensed, instruction-tuned T5-base).

### Training data (high-level)

> The underlying dataset itself is **not included** in this repository. This section only documents how the data was used.

- ~**9.6k** abstracts related to:
  - Alzheimer’s disease (AD),
  - dementia,
  - neurodegeneration,
  - neuroinflammation / neuroimmunology,
  - related biomarkers and imaging studies.
- Abstracts were retrieved programmatically from **PubMed-like sources** using Alzheimer’s-related queries.
- Each abstract is paired with a **“teacher summary”**, constructed heuristically by selecting sentences that:
  - contain sections like `RESULTS:` and/or `CONCLUSIONS:` (if present),
  - or otherwise capture the core result statement of the study.

In other words, training labels are **extractive, results-focused summaries** derived from the abstracts themselves, not human-written abstractive summaries.

### Objective

- Text-to-text supervised fine-tuning:
  - **Input:** the full abstract (often with a task prefix like `summarize:` or a short instruction).
  - **Target:** the corresponding `teacher_summary` (1–3 sentences, mostly extractive).

This encourages the model to:

- focus on the *result/conclusion* region of the abstract,
- avoid over-emphasizing background and methods,
- stay within the factual space of the original text.

### Training setup (approximate)

- Framework: **PyTorch** + `transformers`
- Model class: `AutoModelForSeq2SeqLM`
- Tokenizer: `AutoTokenizer` for `google/flan-t5-base`
- Train/validation split: ~90% / 10% on the Alzheimer abstracts
- Hyperparameters (typical configuration used in this project):
  - Epochs: **5**
  - Optimizer: `AdamW`
  - Learning rate: ~**1e-4**
  - Weight decay: ~**0.01**
  - LR schedule: linear decay with ~10% warmup
  - Batch size: effective batch size increased via gradient accumulation
  - Max input length: **512 tokens**
  - Max target length: **≈128 tokens**
  - Loss: standard cross-entropy on decoder outputs with padding tokens masked

### Training dynamics (example)

Observed loss over 5 epochs (representative run):

- `Epoch 1` – Train loss ≈ **0.32** | Val loss ≈ **0.18**
- `Epoch 5` – Train loss ≈ **0.16** | Val loss ≈ **0.16**

Combined with qualitative inspection, this indicates:

- Stable training (no divergence / NaNs)
- Reasonable convergence without strong overfitting
- Good alignment to the teacher summaries.

---

## How to use the model

> 🔎 **Note:** The raw model is a standard seq2seq model.  
> For **extra safety**, you may want to wrap it with an overlap-based filter that removes sentences not grounded in the abstract (described later under “Safety & hallucination”).

### Basic usage (raw summarization)

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "ffurkandemir/flan-t5-base-alzheimer-ultra-safe"  # or your actual repo ID

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

abstract = """
Alzheimer's disease (AD) is a neurodegenerative disorder...
RESULTS: Patients with moderate-severe periodontitis had a higher risk...
CONCLUSIONS: Our findings suggest that periodontal disease may be associated with...
"""

prompt = (
    "Summarize the following abstract in 2-3 sentences, focusing on the main "
    "results and conclusions:\n\n" + abstract
)

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    truncation=True,
    max_length=512,
)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,   # higher limit to avoid truncation
    num_beams=4,
    no_repeat_ngram_size=3,
    early_stopping=True,
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)