VIPER: Video Identity Perturbation and Extraction Residual

Deepfake detection inspired by displacement reactions in chemistry.
A stronger identity signal displaces and exposes synthetic faces.

Core Idea

What if we could expose deepfakes the way chemistry exposes impurities?

AB + C → AC + B

AB = video frame (fake face B hidden inside context A)
C  = identity anchor (biometric fingerprint from first 8 frames)
AC = anchor bonds with real context → LOW energy = REAL
B  = fake face displaced/exposed   → HIGH energy = FAKE

Results

Metric	Value
AUC-ROC	0.9909
Accuracy	95.2%
Fake Recall	96.5%
False Positive Rate	6.3%
Face-swap AUC	0.9931
Expression-swap AUC	0.9847
Inference speed	~4s/video (GPU)
Training time	25 min (T4)
Training data	530 videos

Per-Manipulation-Type Detection

Attack Type	AUC	Accuracy	N (test)
Face swap (inswapper)	0.9931	95.6%	42
Expression transfer (NeuralTextures)	0.9847	93.7%	15
All combined	0.9909	95.2%	105

Model Progression

Version	Backbone	Trainable Params	Test AUC
v1	EfficientNet-B4 (frozen)	~500K	0.9072
v2	EfficientNet-B4 (unfrozen)	~2.3M	0.9309
v3	CLIP ViT-L/14 (frozen)	~500K	0.9909

Architecture

Video → InsightFace → 16 face crops (224×224)
         │
         ├── Identity Anchor → GIR + TFR + BCR → 16-dim features
         │
         └── CLIP ViT-L/14 (frozen) → 768-dim video embedding
                   │
                   ▼
         Fusion MLP [784 → 512 → 128 → 1] + TTA → REAL / FAKE

Key design: CLIP backbone entirely frozen. Only 500K-parameter MLP trains. Enables 0.99 AUC from just 530 videos.

Three Biometric Signals

Signal	Method	Captures
GIR	ArcFace cosine distance	Skull geometry, eye spacing
TFR	DCT KL divergence	Skin micro-texture
BCR	dlib landmark coupling	Facial muscle dynamics

Confusion Matrix

                 Predicted Real    Predicted Fake
Actual Real           45                3
Actual Fake            2               55

Only 5 errors out of 105 test videos.

Usage

import torch
import open_clip
from huggingface_hub import hf_hub_download
import torch.nn as nn

# Download checkpoint
ckpt = hf_hub_download(repo_id="rxbinsingh/VIPER", filename="viper_best_v3_clip.pt")

# Load CLIP
clip_model, _, _ = open_clip.create_model_and_transforms("ViT-L-14", pretrained="openai")
clip_model.eval()

# Model
class VIPERv3(nn.Module):
    def __init__(self, clip_visual, dropout=0.4):
        super().__init__()
        self.clip = clip_visual
        for p in self.clip.parameters():
            p.requires_grad = False
        self.head = nn.Sequential(
            nn.Linear(784, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(512, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout*0.5),
            nn.Linear(128, 1))

model = VIPERv3(clip_model.visual)
model.load_state_dict(torch.load(ckpt, map_location="cpu"))
model.eval()

# Input: crops (1, 16, 3, 224, 224), hand_feats (1, 16)
# Output: logit → sigmoid → P(fake)

Training Dataset

Category	Count	Source	License
Real	250	RTFS-10K	CC-BY-SA-4.0
Face swap	220	RTFS-10K (inswapper)	CC-BY-SA-4.0
Expression swap	60	FaceForensics++	Academic
Full-body GAN	50	FakeParts	CC0-1.0
Total	580
Usable	530	91.4% success

Training Configuration

Parameter	Value
Backbone	CLIP ViT-L/14 (OpenAI, frozen)
Classifier	MLP 784→512→128→1
Optimizer	AdamW (lr=3e-4, wd=1e-3)
Scheduler	Cosine annealing, 15 epochs
Batch size	8
Loss	BCE with pos_weight=0.758
TTA	Horizontal flip average
Hardware	NVIDIA T4 (16GB)
Training time	~25 minutes

Limitations

Full-body GAN videos not detectable (face detection fails)
Analytical signals (GIR/TFR/BCR) independently weak on modern fakes
Evaluated on 105 test videos — larger benchmarks pending
Not tested against adversarial attacks on CLIP

Citation

@misc{singh2025viper,
  title   = {VIPER: Deepfake Detection Through Identity-Anchored Visual Representation Analysis},
  author  = {Singh, Robin},
  year    = {2025},
  url     = {https://github.com/rxbinsingh/VIPER}
}

Author

Robin Singh · Bennett University, India

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train rxbinsingh/VIPER

Space using rxbinsingh/VIPER 1

Evaluation results

AUC-ROC
self-reported

0.991
Accuracy
self-reported

0.952
F1 (Fake)
self-reported

0.960
Precision (Fake)
self-reported

0.948
Recall (Fake)
self-reported

0.965