Healthcare PII Detection Model
Model Description
This is a domain-specific Personally Identifiable Information (PII) detection model fine-tuned for healthcare applications. The model identifies and classifies sensitive information in healthcare text using token-level classification.
Base Model: bert-base-uncased (109M parameters)
Training Method: LoRA (Low-Rank Adaptation)
Task: Token Classification (Named Entity Recognition)
Entity Types: 51 types
Model Performance
| Metric | Score |
|---|---|
| Accuracy | 94.45% |
| F1 Score | 92.07% |
Training Dataset: NVIDIA Nemotron-PII (healthcare domain)
Training Samples: 1477
Training Epochs: 3
LoRA Rank: 8
Trainable Parameters: ~521K (0.48% of base model)
Detected Entity Types
Common Healthcare PII Entities
- Names: first_name, last_name
- Identifiers: medical_record_number, patient_id, insurance_id
- Contact: email, phone_number, street_address, city, state, zipcode
- Personal: date_of_birth, age, gender, race_ethnicity
- Sensitive: blood_type, biometric_identifier, health_plan_beneficiary_number
- Location: coordinate, company_name, url
Total: 103 BIO labels across 51 entity types
Intended Use
Primary Use Cases
- Data Anonymization: Identify PII in healthcare documents for redaction
- Privacy Compliance: Detect sensitive data for HIPAA/GDPR compliance
- Data Classification: Classify and tag healthcare records by sensitivity
- Access Control: Implement role-based data masking
Out-of-Scope Use
- Not designed for general domain PII detection (optimized for healthcare)
- Not a complete anonymization solution (requires post-processing)
- May have lower accuracy on non-healthcare text
How to Use
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "your-org/mom-pii-healthcare"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Prepare input
text = "Patient John Smith, DOB: 03/15/1985, SSN: 123-45-6789"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0]
# Extract entities
entities = []
current_entity = None
for token, pred_id in zip(tokens, predictions):
if token in ['[CLS]', '[SEP]', '[PAD]']:
continue
label = model.config.id2label[pred_id.item()]
if label.startswith('B-'):
if current_entity:
entities.append(current_entity)
entity_type = label[2:]
current_entity = {'type': entity_type, 'tokens': [token]}
elif label.startswith('I-') and current_entity:
current_entity['tokens'].append(token)
else:
if current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
# Clean up tokens
for entity in entities:
entity['text'] = tokenizer.convert_tokens_to_string(entity['tokens'])
print(f"{entity['type']}: {entity['text']}")
Advanced Usage with Pipeline
from transformers import pipeline
ner = pipeline(
"token-classification",
model="your-org/mom-pii-healthcare",
aggregation_strategy="simple",
device=0 # Use GPU
)
results = ner("Patient John Smith, DOB: 03/15/1985, SSN: 123-45-6789")
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2f})")
Training Details
Training Data
- Source: NVIDIA Nemotron-PII dataset
- Domain: Healthcare
- Total Dataset Size: 100,000+ samples
- Filtered for Domain: 1477 samples
- Entity Types: 51 types
- Labeling Scheme: BIO (Begin, Inside, Outside)
Training Configuration
- Base Model: bert-base-uncased
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Configuration:
- Rank: 8
- Alpha: 16
- Dropout: 0.1
- Target Modules: query, key, value
- Optimizer: AdamW
- Learning Rate: 3e-4
- Batch Size: 2 (per device)
- Epochs: 3
- Warmup Steps: 100
- Weight Decay: 0.01
- FP16: Disabled (FP32 for stability)
- Training Time: ~65 seconds on single GPU
Framework Versions
- Transformers: 5.1.0
- PyTorch: 2.10.0+cu128
- PEFT: 0.18.1
- Datasets: 4.5.0
Limitations and Biases
Limitations
- Entity Coverage: Strong on names, moderate on structured identifiers (IDs, phone numbers)
- Format Sensitivity: May miss PII in unusual formats or with special characters
- Context Dependency: Performance varies with entity context and surrounding text
- Domain Specificity: Optimized for healthcare; may underperform on other domains
Known Performance Characteristics
- Strong: Person names (first_name, last_name)
- Moderate: Dates, locations, occupations
- Weak: Formatted numbers (phone, SSN), complex identifiers
Recommendations for Production
For structured PII (medical records, credit cards, phone numbers), consider:
- Combining with regex-based detection
- Post-processing with rule-based validators
- Ensemble approach with multiple models
Ethical Considerations
Responsible Use
- This model is designed for privacy protection, not surveillance
- Should be used to protect individuals' sensitive information
- Recommended for data anonymization pipelines with human review
- Regular audits recommended to ensure bias-free operation
Privacy and Security
- Model itself does not store or transmit PII
- Detected entities should be handled according to data protection regulations
- Consider encryption and access controls for PII processing workflows
Citation
@misc{mom-pii-healthcare-2026,
title={Healthcare PII Detection Model},
author={Your Organization},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/your-org/mom-pii-healthcare}},
}
Model Card Authors
- Organization: Your Organization
- Contact: your-email@example.com
- Date: February 2026
License
Apache 2.0
Acknowledgements
- Base model: google-bert/bert-base-uncased
- Training data: NVIDIA Nemotron-PII dataset
- Training framework: HuggingFace Transformers, PEFT
- Downloads last month
- 1