--- language: - en license: mit tags: - healthcare - privacy - phi - de-identification - clinical-nlp - text-generation - rewriting - stateful - multimodal - streaming - pytorch - t5 - hipaa - ehr - re-identification pipeline_tag: text-generation base_model: t5-base datasets: - vkatg/streaming-phi-deidentification-benchmark - vkatg/multimodal-phi-masking-benchmark model-index: - name: ExposureGuard-SynthRewrite-T5 results: - task: type: text-generation metrics: - type: val_loss value: 0.1183 name: Validation Loss (cross-entropy) - type: format_accuracy_normalized value: 1.0 name: Format Accuracy (post-normalization) - type: format_accuracy_raw value: 1.0 name: Format Accuracy (raw model output) --- # ExposureGuard-SynthRewrite-T5 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18865882.svg)](https://doi.org/10.5281/zenodo.18865882) Redacting PHI preserves privacy but destroys the clinical note. A note full of `[REDACTED]` tokens is useless for downstream processing, audit review, or secondary analysis. This model takes a different approach: it rewrites the note into safe, readable text where real PHI is replaced with consistent pseudonyms rather than blanks. The version suffix is the key behavior. When cross-modal linkage triggers a pseudonym rotation, the version counter increments. Any system trying to join records across version boundaries hits a deliberate discontinuity. Prior and future records cannot be linked through the pseudonym chain. --- ## What it does Input: a clinical note plus exposure context (risk score, modality, pseudonym version, trigger reason). Output: a fully rewritten note with real PHI replaced by version-stamped pseudonyms. ``` Input: rewrite: [risk=0.72] [modality=asr] [version=1] [trigger=cross_modal_linkage] note: James Smith is a 67-year-old patient presenting with chest pain. DOB 03/22/1955. Contact: 555-0142. MRN: MRN482910. Output: Robert Taylor is a 67-year-old patient presenting with chest pain. DOB [DOB-V1]. Contact: [PHONE-V1]. MRN: [MRN-V1]. ``` Version 0 means no trigger has fired. Version 1 means pseudonyms rotated once. The version suffix in every placeholder makes the rotation point explicit and auditable. --- ## Usage ```python from inference import load, rewrite model, tok = load(".") result = rewrite( note="James Smith, MRN 482910, admitted with chest pain. DOB 03/22/1955.", risk=0.72, modality="asr", version=1, trigger_reason="cross_modal_linkage", model=model, tokenizer=tok, ) print(result) ``` Or directly with transformers: ```python from transformers import T5ForConditionalGeneration, T5Tokenizer import torch tok = T5Tokenizer.from_pretrained("vkatg/exposureguard-synthrewrite-t5") model = T5ForConditionalGeneration.from_pretrained("vkatg/exposureguard-synthrewrite-t5") model.eval() note = "James Smith, MRN482910, admitted with chest pain. DOB 03/22/1955." prompt = ( f"rewrite: [risk=0.72] [modality=asr] [version=1] " f"[trigger=cross_modal_linkage] note: {note}" ) ids = tok(prompt, return_tensors="pt", max_length=256, truncation=True) with torch.no_grad(): out = model.generate( ids["input_ids"], attention_mask=ids["attention_mask"], max_new_tokens=200, num_beams=4, no_repeat_ngram_size=3, early_stopping=True, ) print(tok.decode(out[0], skip_special_tokens=True)) ``` --- ## Input format ``` rewrite: [risk=FLOAT] [modality=MODALITY] [version=INT] [trigger=REASON] note: NOTE_TEXT ``` | Token | Values | |---|---| | `risk` | 0.0 to 1.0, cumulative re-identification risk from upstream scorer | | `modality` | text, asr, image_proxy, waveform_proxy, audio_proxy | | `version` | 0 = no trigger fired, 1+ = number of pseudonym rotations | | `trigger` | none, cross_modal_linkage, exposure_accumulation | --- ## PHI handling | PHI type | Replacement | |---|---| | Patient name | Consistent pseudonym, version-aware | | DOB | [DOB-V{version}] | | MRN | [MRN-V{version}] | | Phone | [PHONE-V{version}] | | Address | [ADDR-V{version}] | --- ## Training Fine-tuned from `t5-base` (223M parameters) on 12,000 synthetic clinical note pairs with bracket tokens added as special tokens. Each pair was generated with injected fake PHI and a matching rewrite. Trigger scenarios (cross_modal_linkage, exposure_accumulation) were oversampled so the model learns version-bump behavior reliably. Trained on Apple M4 MPS. | Metric | Value | |---|---| | Base model | t5-base | | Parameters | 222,894,336 | | Training pairs | 12,000 | | Epochs | 10 | | Best val loss | 0.1183 | | Format accuracy (raw) | 100% | | Format accuracy (normalized) | 100% | | Special tokens added | 16 (`[DOB-V0]`–`[ADDR-V3]`) | | Trigger distribution | none: 49%, cross_modal_linkage: 38%, exposure_accumulation: 13% | Val loss curve: 0.2121 → 0.1521 → 0.1480 → 0.1438 → 0.1392 → 0.1330 → 0.1283 → 0.1226 → 0.1195 → 0.1183 (best every epoch). Architecture: 12-layer encoder-decoder, 768 hidden dim, 12 attention heads, 3072 FFN dim. --- ## Where it fits ``` DCPG Risk Scorer | Controller (policy decision) | DAGPlanner (remediation plan) | SynthRewrite-T5 <- this model | Rewritten safe clinical note ``` The risk score and trigger reason that condition this model come directly from the upstream controller and DCPG scorer. The version integer increments when FedCRDT-Distill or the controller flags a pseudonym rotation event. --- ## Related - [phi-exposure-guard](https://github.com/azithteja91/phi-exposure-guard): full system - [dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer): produces the risk score input - [exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet): policy enforcement layer upstream - [exposureguard-dcpg-encoder](https://huggingface.co/vkatg/exposureguard-dcpg-encoder): graph encoder upstream - [exposureguard-fedcrdt-distill](https://huggingface.co/vkatg/exposureguard-fedcrdt-distill): federated risk and retokenization trigger - [exposureguard-dagplanner](https://huggingface.co/vkatg/exposureguard-dagplanner): remediation planner upstream - [streaming-phi-deidentification-benchmark](https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark): benchmark dataset - [multimodal-phi-masking-benchmark](https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark): PHI masking dataset with span annotations --- ## Citation ```bibtex @software{exposureguard_synthrewrite, title = {ExposureGuard-SynthRewrite-T5: Stateful PHI-Aware Clinical Note Rewriter}, author = {Ganti, Venkata Krishna Azith Teja}, doi = {10.5281/zenodo.18865882}, url = {https://huggingface.co/vkatg/exposureguard-synthrewrite-t5}, note = {US Provisional Patent filed 2025-07-05} } ``` Trained on fully synthetic data. Not validated for clinical use.