---
language:
- en
license: mit
tags:
- healthcare
- privacy
- phi
- de-identification
- clinical-nlp
- text-generation
- rewriting
- stateful
- multimodal
- streaming
- pytorch
- t5
- hipaa
- ehr
- re-identification
pipeline_tag: text-generation
base_model: t5-base
datasets:
- vkatg/streaming-phi-deidentification-benchmark
- vkatg/multimodal-phi-masking-benchmark
model-index:
- name: ExposureGuard-SynthRewrite-T5
  results:
  - task:
      type: text-generation
    metrics:
    - type: val_loss
      value: 0.1183
      name: Validation Loss (cross-entropy)
    - type: format_accuracy_normalized
      value: 1.0
      name: Format Accuracy (post-normalization)
    - type: format_accuracy_raw
      value: 1.0
      name: Format Accuracy (raw model output)
---

# ExposureGuard-SynthRewrite-T5

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18865882.svg)](https://doi.org/10.5281/zenodo.18865882)

Redacting PHI preserves privacy but destroys the clinical note. A note full of `[REDACTED]` tokens is useless for downstream processing, audit review, or secondary analysis. This model takes a different approach: it rewrites the note into safe, readable text where real PHI is replaced with consistent pseudonyms rather than blanks.

The version suffix is the key behavior. When cross-modal linkage triggers a pseudonym rotation, the version counter increments. Any system trying to join records across version boundaries hits a deliberate discontinuity. Prior and future records cannot be linked through the pseudonym chain.

---

## What it does

Input: a clinical note plus exposure context (risk score, modality, pseudonym version, trigger reason).
Output: a fully rewritten note with real PHI replaced by version-stamped pseudonyms.

```
Input:  rewrite: [risk=0.72] [modality=asr] [version=1] [trigger=cross_modal_linkage]
        note: James Smith is a 67-year-old patient presenting with chest pain.
              DOB 03/22/1955. Contact: 555-0142. MRN: MRN482910.

Output: Robert Taylor is a 67-year-old patient presenting with chest pain.
        DOB [DOB-V1]. Contact: [PHONE-V1]. MRN: [MRN-V1].
```

Version 0 means no trigger has fired. Version 1 means pseudonyms rotated once. The version suffix in every placeholder makes the rotation point explicit and auditable.

---

## Usage

```python
from inference import load, rewrite

model, tok = load(".")

result = rewrite(
    note="James Smith, MRN 482910, admitted with chest pain. DOB 03/22/1955.",
    risk=0.72,
    modality="asr",
    version=1,
    trigger_reason="cross_modal_linkage",
    model=model,
    tokenizer=tok,
)
print(result)
```

Or directly with transformers:

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

tok   = T5Tokenizer.from_pretrained("vkatg/exposureguard-synthrewrite-t5")
model = T5ForConditionalGeneration.from_pretrained("vkatg/exposureguard-synthrewrite-t5")
model.eval()

note   = "James Smith, MRN482910, admitted with chest pain. DOB 03/22/1955."
prompt = (
    f"rewrite: [risk=0.72] [modality=asr] [version=1] "
    f"[trigger=cross_modal_linkage] note: {note}"
)

ids = tok(prompt, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
    out = model.generate(
        ids["input_ids"],
        attention_mask=ids["attention_mask"],
        max_new_tokens=200,
        num_beams=4,
        no_repeat_ngram_size=3,
        early_stopping=True,
    )
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Input format

```
rewrite: [risk=FLOAT] [modality=MODALITY] [version=INT] [trigger=REASON] note: NOTE_TEXT
```

| Token | Values |
|---|---|
| `risk` | 0.0 to 1.0, cumulative re-identification risk from upstream scorer |
| `modality` | text, asr, image_proxy, waveform_proxy, audio_proxy |
| `version` | 0 = no trigger fired, 1+ = number of pseudonym rotations |
| `trigger` | none, cross_modal_linkage, exposure_accumulation |

---

## PHI handling

| PHI type | Replacement |
|---|---|
| Patient name | Consistent pseudonym, version-aware |
| DOB | [DOB-V{version}] |
| MRN | [MRN-V{version}] |
| Phone | [PHONE-V{version}] |
| Address | [ADDR-V{version}] |

---

## Training

Fine-tuned from `t5-base` (223M parameters) on 12,000 synthetic clinical note pairs with bracket tokens added as special tokens. Each pair was generated with injected fake PHI and a matching rewrite. Trigger scenarios (cross_modal_linkage, exposure_accumulation) were oversampled so the model learns version-bump behavior reliably. Trained on Apple M4 MPS.

| Metric | Value |
|---|---|
| Base model | t5-base |
| Parameters | 222,894,336 |
| Training pairs | 12,000 |
| Epochs | 10 |
| Best val loss | 0.1183 |
| Format accuracy (raw) | 100% |
| Format accuracy (normalized) | 100% |
| Special tokens added | 16 (`[DOB-V0]`–`[ADDR-V3]`) |
| Trigger distribution | none: 49%, cross_modal_linkage: 38%, exposure_accumulation: 13% |

Val loss curve: 0.2121 → 0.1521 → 0.1480 → 0.1438 → 0.1392 → 0.1330 → 0.1283 → 0.1226 → 0.1195 → 0.1183 (best every epoch).

Architecture: 12-layer encoder-decoder, 768 hidden dim, 12 attention heads, 3072 FFN dim.

---

## Where it fits

```
DCPG Risk Scorer
      |
  Controller (policy decision)
      |
  DAGPlanner (remediation plan)
      |
  SynthRewrite-T5    <- this model
      |
  Rewritten safe clinical note
```

The risk score and trigger reason that condition this model come directly from the upstream controller and DCPG scorer. The version integer increments when FedCRDT-Distill or the controller flags a pseudonym rotation event.

---

## Related

- [phi-exposure-guard](https://github.com/azithteja91/phi-exposure-guard): full system
- [dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer): produces the risk score input
- [exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet): policy enforcement layer upstream
- [exposureguard-dcpg-encoder](https://huggingface.co/vkatg/exposureguard-dcpg-encoder): graph encoder upstream
- [exposureguard-fedcrdt-distill](https://huggingface.co/vkatg/exposureguard-fedcrdt-distill): federated risk and retokenization trigger
- [exposureguard-dagplanner](https://huggingface.co/vkatg/exposureguard-dagplanner): remediation planner upstream
- [streaming-phi-deidentification-benchmark](https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark): benchmark dataset
- [multimodal-phi-masking-benchmark](https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark): PHI masking dataset with span annotations

---

## Citation

```bibtex
@software{exposureguard_synthrewrite,
  title  = {ExposureGuard-SynthRewrite-T5: Stateful PHI-Aware Clinical Note Rewriter},
  author = {Ganti, Venkata Krishna Azith Teja},
  doi    = {10.5281/zenodo.18865882},
  url    = {https://huggingface.co/vkatg/exposureguard-synthrewrite-t5},
  note   = {US Provisional Patent filed 2025-07-05}
}
```

Trained on fully synthetic data. Not validated for clinical use.