hi-paris/FakeParts
Viewer β’ Updated β’ 82.2k β’ 7.49k β’ 2
Deepfake detection inspired by displacement reactions in chemistry.
A stronger identity signal displaces and exposes synthetic faces.
What if we could expose deepfakes the way chemistry exposes impurities?
AB + C β AC + B
AB = video frame (fake face B hidden inside context A)
C = identity anchor (biometric fingerprint from first 8 frames)
AC = anchor bonds with real context β LOW energy = REAL
B = fake face displaced/exposed β HIGH energy = FAKE
| Metric | Value |
|---|---|
| AUC-ROC | 0.9909 |
| Accuracy | 95.2% |
| Fake Recall | 96.5% |
| False Positive Rate | 6.3% |
| Face-swap AUC | 0.9931 |
| Expression-swap AUC | 0.9847 |
| Inference speed | ~4s/video (GPU) |
| Training time | 25 min (T4) |
| Training data | 530 videos |
| Attack Type | AUC | Accuracy | N (test) |
|---|---|---|---|
| Face swap (inswapper) | 0.9931 | 95.6% | 42 |
| Expression transfer (NeuralTextures) | 0.9847 | 93.7% | 15 |
| All combined | 0.9909 | 95.2% | 105 |
| Version | Backbone | Trainable Params | Test AUC |
|---|---|---|---|
| v1 | EfficientNet-B4 (frozen) | ~500K | 0.9072 |
| v2 | EfficientNet-B4 (unfrozen) | ~2.3M | 0.9309 |
| v3 | CLIP ViT-L/14 (frozen) | ~500K | 0.9909 |
Video β InsightFace β 16 face crops (224Γ224)
β
βββ Identity Anchor β GIR + TFR + BCR β 16-dim features
β
βββ CLIP ViT-L/14 (frozen) β 768-dim video embedding
β
βΌ
Fusion MLP [784 β 512 β 128 β 1] + TTA β REAL / FAKE
Key design: CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.
| Signal | Method | Captures |
|---|---|---|
| GIR | ArcFace cosine distance | Skull geometry, eye spacing |
| TFR | DCT KL divergence | Skin micro-texture |
| BCR | dlib landmark coupling | Facial muscle dynamics |
Predicted Real Predicted Fake
Actual Real 45 3
Actual Fake 2 55
Only 5 errors out of 105 test videos.
import torch
import open_clip
from huggingface_hub import hf_hub_download
import torch.nn as nn
# Download checkpoint
ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")
# Load CLIP
clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
clip_model.eval()
# Model
class VIPERv3(nn.Module):
def __init__(self, clip_visual, dropout=0.4):
super().__init__()
self.clip = clip_visual
for p in self.clip.parameters():
p.requires_grad = False
self.head = nn.Sequential(
nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
nn.Linear(128, 1))
model = VIPERv3(clip_model.visual)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()
# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
# Output: logit β sigmoid β P(fake)
| Category | Count | Source | License |
|---|---|---|---|
| Real | 250 | RTFS-10K | CC-BY-SA-4.0 |
| Face swap | 220 | RTFS-10K (inswapper) | CC-BY-SA-4.0 |
| Expression swap | 60 | FaceForensics++ | Academic |
| Full-body GAN | 50 | FakeParts | CC0-1.0 |
| Total | 580 | ||
| Usable | 530 | 91.4% success |
| Parameter | Value |
|---|---|
| Backbone | CLIP ViT-L/14 (OpenAI, frozen) |
| Classifier | MLP 784β512β128β1 |
| Optimizer | AdamW (lr=3e-4, wd=1e-3) |
| Scheduler | Cosine annealing, 15 epochs |
| Batch size | 8 |
| Loss | BCE with pos_weight=0.758 |
| TTA | Horizontal flip average |
| Hardware | NVIDIA T4 (16GB) |
| Training time | ~25 minutes |
@misc{singh2025viper,
title = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
author = {Singh, Robin},
year = {2025},
url = {https://github.com/rxbinsingh/VIPER}
}
Robin Singh Β· Bennett University, India