VIPER: Video Identity Perturbation and Extraction Residual

Deepfake detection inspired by displacement reactions in chemistry.
A stronger identity signal displaces and exposes synthetic faces.


VIPER Banner


Core Idea

What if we could expose deepfakes the way chemistry exposes impurities?

Displacement Reaction

AB + C β†’ AC + B

AB = video frame (fake face B hidden inside context A)
C  = identity anchor (biometric fingerprint from first 8 frames)
AC = anchor bonds with real context β†’ LOW energy = REAL
B  = fake face displaced/exposed   β†’ HIGH energy = FAKE

Results

Results

Metric Value
AUC-ROC 0.9909
Accuracy 95.2%
Fake Recall 96.5%
False Positive Rate 6.3%
Face-swap AUC 0.9931
Expression-swap AUC 0.9847
Inference speed ~4s/video (GPU)
Training time 25 min (T4)
Training data 530 videos

Per-Manipulation-Type Detection

Multiple Types

Attack Type AUC Accuracy N (test)
Face swap (inswapper) 0.9931 95.6% 42
Expression transfer (NeuralTextures) 0.9847 93.7% 15
All combined 0.9909 95.2% 105

Model Progression

Version Backbone Trainable Params Test AUC
v1 EfficientNet-B4 (frozen) ~500K 0.9072
v2 EfficientNet-B4 (unfrozen) ~2.3M 0.9309
v3 CLIP ViT-L/14 (frozen) ~500K 0.9909

Architecture

Architecture

Video β†’ InsightFace β†’ 16 face crops (224Γ—224)
         β”‚
         β”œβ”€β”€ Identity Anchor β†’ GIR + TFR + BCR β†’ 16-dim features
         β”‚
         └── CLIP ViT-L/14 (frozen) β†’ 768-dim video embedding
                   β”‚
                   β–Ό
         Fusion MLP [784 β†’ 512 β†’ 128 β†’ 1] + TTA β†’ REAL / FAKE

Key design: CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.

Three Biometric Signals

Signal Method Captures
GIR ArcFace cosine distance Skull geometry, eye spacing
TFR DCT KL divergence Skin micro-texture
BCR dlib landmark coupling Facial muscle dynamics

Confusion Matrix

                 Predicted Real    Predicted Fake
Actual Real           45                3
Actual Fake            2               55

Only 5 errors out of 105 test videos.


Usage

import torch
import open_clip
from huggingface_hub import hf_hub_download
import torch.nn as nn

# Download checkpoint
ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")

# Load CLIP
clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
clip_model.eval()

# Model
class VIPERv3(nn.Module):
    def __init__(self, clip_visual, dropout=0.4):
        super().__init__()
        self.clip = clip_visual
        for p in self.clip.parameters():
            p.requires_grad = False
        self.head = nn.Sequential(
            nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
            nn.Linear(128, 1))

model = VIPERv3(clip_model.visual)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()

# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
# Output: logit β†’ sigmoid β†’ P(fake)

Training Dataset

Category Count Source License
Real 250 RTFS-10K CC-BY-SA-4.0
Face swap 220 RTFS-10K (inswapper) CC-BY-SA-4.0
Expression swap 60 FaceForensics++ Academic
Full-body GAN 50 FakeParts CC0-1.0
Total 580
Usable 530 91.4% success

Training Configuration

Parameter Value
Backbone CLIP ViT-L/14 (OpenAI, frozen)
Classifier MLP 784β†’512β†’128β†’1
Optimizer AdamW (lr=3e-4, wd=1e-3)
Scheduler Cosine annealing, 15 epochs
Batch size 8
Loss BCE with pos_weight=0.758
TTA Horizontal flip average
Hardware NVIDIA T4 (16GB)
Training time ~25 minutes

Limitations

  • Full-body GAN videos not detectable (face detection fails)
  • Analytical signals (GIR/TFR/BCR) independently weak on modern fakes
  • Evaluated on 105 test videos β€” larger benchmarks pending
  • Not tested against adversarial attacks on CLIP

Citation

@misc{singh2025viper,
  title   = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
  author  = {Singh, Robin},
  year    = {2025},
  url     = {https://github.com/rxbinsingh/VIPER}
}

Author

Robin Singh Β· Bennett University, India

GitHub HuggingFace ResearchGate

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train rxbinsingh/VIPER

Space using rxbinsingh/VIPER 1

Evaluation results