--- language: - en license: mit tags: - text-classification - phishing-detection - email-security - deberta-v3 - transformers datasets: - zefang-liu/phishing-email-dataset metrics: - accuracy - f1 - precision - recall base_model: microsoft/deberta-v3-large pipeline_tag: text-classification --- # Phishing Email Detector (DeBERTa-v3-large) フィッシングメール検出のためにファインチューニングされたDeBERTa-v3-largeモデル ## Model Description このモデルは`microsoft/deberta-v3-large`をベースに、フィッシングメールと安全なメールを分類するためにファインチューニングされています。 ### 🔒 100% Recall達成 閾値を0.0007に設定することで、**フィッシングメールを100%検出**できます。 ## Performance ### デフォルト設定(閾値0.5) | Metric | Value | |--------|-------| | Accuracy | 97.59% | | F1-score | 96.99% | | Precision | 95.01% | | Recall | 99.04% | ### 最大セキュリティ設定(閾値0.0007)- **Recall 100%** | Metric | Value | |--------|-------| | Accuracy | 95.23% | | F1-score | 94.26% | | Precision | 89.15% | | **Recall** | **100.00%** | ## Usage ### Basic Usage (Default Threshold) ```python from transformers import pipeline classifier = pipeline("text-classification", model="takumi123xxx/phishing-email-detector-deberta-v3") result = classifier("Your email text here") print(result) ``` ### Maximum Security (100% Recall) ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("takumi123xxx/phishing-email-detector-deberta-v3") tokenizer = AutoTokenizer.from_pretrained("takumi123xxx/phishing-email-detector-deberta-v3") THRESHOLD = 0.0007 # For 100% Recall def detect_phishing(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) phishing_prob = probs[0][1].item() return { "is_phishing": phishing_prob >= THRESHOLD, "phishing_probability": phishing_prob, "label": "Phishing Email" if phishing_prob >= THRESHOLD else "Safe Email" } # Example result = detect_phishing("Congratulations! You've won $1,000,000. Click here to claim your prize!") print(result) ``` ## Training Details - **Base Model**: microsoft/deberta-v3-large - **Dataset**: [zefang-liu/phishing-email-dataset](https://huggingface.co/datasets/zefang-liu/phishing-email-dataset) - **Training Samples**: 14,904 - **Validation Samples**: 1,863 - **Test Samples**: 1,864 - **Epochs**: 2.15 (Early Stopping) - **Batch Size**: 16 - **Learning Rate**: 2e-5 - **Max Length**: 512 ## Labels - `0`: Safe Email - `1`: Phishing Email ## Threshold Recommendation | Use Case | Threshold | Recall | False Positives | |----------|-----------|--------|-----------------| | Balanced | 0.5 | 99.04% | 38 | | High Security | 0.0007 | 100.00% | 89 | ## Limitations - Trained on English emails only - May not detect novel phishing techniques not present in training data - False positives increase when using lower thresholds ## License MIT License