--- language: en license: apache-2.0 tags: - deepfake-detection - video-classification - clip - vit - pytorch - transformers - spatiotemporal-adapters - bf16 - reproducibility metrics: - accuracy - f1 - recall base_model: - openai/clip-vit-large-patch14 --- # SPDX-License-Identifier: Apache-2.0 """ SOTA Deepfake Detector - DFD Model card + inference utilities for Hugging Face repository: Arko007/deepfake-detector-dfd-sota This file contains: - A model card in Hugging Face model README format (YAML frontmatter + sections). - Inference helper functions to: - load frames or an MP4 video, - extract 12 evenly-spaced frames, - preprocess using the CLIP image processor, - run a forward pass through the provided checkpoint and return logits, - return softmax probabilities and predicted class. Notes: - Architecture described: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters - Input: 12 frames, 3x384x384, dtype=torch.float32 / BF16 where supported - Labels: 0 = real, 1 = fake - License: Apache-2.0 - This file is intended as a convenience reference for model consumers and deployers. """ # SOTA Deepfake Detector - DFD Model: SOTA Deepfake Detector - DFD Repository: https://huggingface.co/Arko007/deepfake-detector-dfd-sota ## Model Description The "SOTA Deepfake Detector - DFD" is a spatiotemporal adaptation built on a frozen CLIP-ViT-Large backbone with lightweight spatiotemporal adapters inserted to learn temporal relationships across frames. The backbone parameters are frozen; only the adapters and final classification head are trainable. - Architecture: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters - Input resolution: 384×384 pixels - Temporal frames: 12 frames per video - Trainable parameters: 5,255,938 (≈1.7% of total 308M) - Precision used for training: BF16 (mixed precision) - Framework: PyTorch + Transformers (Hugging Face) - HF Hub repo: Arko007/deepfake-detector-dfd-sota - Model files: - pytorch_model.bin (1.28GB) - config.json ## Intended Use This model is intended to classify short video clips (or frame sequences) as either: - Label 0: Real videos - Label 1: Deepfake/manipulated videos Primary use-cases: - Research on deepfake detection - Benchmarking against other detectors - Integration into content-moderation pipelines with caution Not intended: - Medical, legal, or other high-stakes decisions without human review. - Use on domains/styles substantially different from the DFD dataset without further fine-tuning. ## Training Data Dataset: DFD (Deep Fake Detection) - Total videos: 3,431 - Real videos: 363 (10.6%) - Fake videos: 3,068 (89.4%) - Preprocessing: - 12 evenly-spaced frames extracted per video at 384×384 resolution (total frames: 41,172) - Train/validation split: 90/10 stratified - Class balancing during training: WeightedRandomSampler (inverse frequency weights) ## Training Configuration - Batch size: 16 (effective 32 with gradient accumulation) - Optimizer: AdamW (weight_decay=0.05) - Learning rate: 5e-6 with cosine decay, 10% warmup - Epochs: 12 - Loss: Cross-entropy with label smoothing = 0.1 - Gradient clipping: max_grad_norm = 1.0 - Sampling: WeightedRandomSampler to mitigate class imbalance - Augmentation (training only): random horizontal flip, brightness jitter - Hardware: NVIDIA L4 GPU (24GB VRAM) - Random seed: 42 (all RNGs fixed for reproducibility) Training speed: ~1.4 seconds per batch (batch_size=16) on the reported hardware. ## Evaluation / Metrics - Best validation accuracy: 84.88% - Validation detection (approx. ranges due to small real class): - Real class detection: ~47–55% (low number of real samples) - Fake class detection: ~64–93% - Training loss convergence (cross-entropy w/ label smoothing): 0.7097 → 0.6921 (12 epochs) ## Known Limitations - Validation set is highly imbalanced (89% fake), which affects stability of metrics. - Small number of real videos (363) limits generalization to unseen real samples. - Model optimized for the DFD dataset; transfer to other deepfake types may require fine-tuning. - Temporal context limited to 12 frames (approx. 0.4–1s depending on FPS), so long-term artifacts may be missed. ## Usage Quick inference instructions: 1. Load checkpoint: checkpoint = torch.load("best_model.pt", map_location="cpu") 2. Extract model state: model_state = checkpoint["model_state_dict"] 3. Initialize model: model = DeepfakeDetector(config) 4. Load state dict: model.load_state_dict(model_state) 5. Set to eval: model.eval() 6. Inference: Pass a tensor of shape (batch_size, 12, 3, 384, 384) with dtype float32 (or bf16 where supported). 7. Output: - logits: (batch_size, 2) - probabilities: softmax(logits) - prediction: argmax(logits) -> 0=real, 1=fake ## How to Cite If you use this model in your work, please cite the repository and include details about the DFD dataset and this specific model configuration. """ # ---------------------------- # Inference code # ---------------------------- # The following is a compact, self-contained inference utility. It assumes: # - checkpoint is saved as 'best_model.pt' # - a config object or config.json for the model is available # - CLIP image processor is available via transformers # # Important: This implementation is a minimal example to run inference. For # production, wrap in a robust server, add batching, async IO, error handling, # and pre-warming for BF16 on supported accelerators. import os import math from typing import List, Tuple, Union, Dict import numpy as np from PIL import Image import torch import torch.nn.functional as F from torchvision import transforms from torchvision.io import read_video # optional, requires torchvision installation from transformers import CLIPImageProcessor, CLIPModel # ---------------------------------------------------------------------- # Model definition stub # ---------------------------------------------------------------------- # The real model used for training is a CLIP-ViT-Large backbone (frozen) with # spatiotemporal adapters and a small classification head. For inference the # exact architecture must match the checkpoint. Below is a minimal class # to demonstrate expected load / inference semantics. Replace this with the # model class used during training (and the one saved to the checkpoint). class DeepfakeDetector(torch.nn.Module): """ Minimal wrapper around CLIP backbone + temporal adapters + classification head. NOTE: This is a lightweight placeholder for demonstration. Replace with the exact model definition used during training to successfully load the provided checkpoint (pytorch_model.bin / best_model.pt). """ def __init__(self, clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12): super().__init__() self.num_frames = num_frames # Load CLIP and freeze it (backbone frozen) self.clip = CLIPModel.from_pretrained(clip_model_name, torch_dtype=torch.float32) for p in self.clip.parameters(): p.requires_grad = False # Spatiotemporal adapters & head (trainable) # NOTE: The real training used custom adapters; here we provide a representative head. embed_dim = self.clip.visual_projection.shape[1] if hasattr(self.clip, "visual_projection") else self.clip.config.projection_dim self.adapter_pool = torch.nn.AdaptiveAvgPool1d(1) # placeholder # small trainable head consistent with ~5.26M params total self.classifier = torch.nn.Sequential( torch.nn.Linear(embed_dim, 1024), torch.nn.ReLU(), torch.nn.Dropout(p=0.1), torch.nn.Linear(1024, 2) ) def forward(self, x: torch.Tensor) -> torch.Tensor: """ Forward pass. x: Tensor of shape (batch, num_frames, 3, H, W) Returns logits: (batch, 2) """ b, t, c, h, w = x.shape # reshape to (b*t, c, h, w) to run through CLIP's visual encoder xt = x.view(b * t, c, h, w) # CLIP's image encoder expects pixel values and image processing externally. # Use clip.vision_model to obtain image embeddings outputs = self.clip.vision_model(pixel_values=xt) pooled = outputs.pooler_output # (b*t, embed_dim) or adjust per CLIP impl # reshape back to (b, t, embed_dim) and aggregate temporally embed_dim = pooled.shape[-1] pooled = pooled.view(b, t, embed_dim) # (b, t, embed_dim) # Simple temporal aggregation (real model uses adapters). For inference placeholder: # mean across temporal dimension video_repr = pooled.mean(dim=1) # (b, embed_dim) logits = self.classifier(video_repr) # (b, 2) return logits # ---------------------------------------------------------------------- # Helper utilities for frame extraction, preprocessing and inference # ---------------------------------------------------------------------- def extract_evenly_spaced_frames_from_video( video_path: str, num_frames: int = 12, target_size: Tuple[int, int] = (384, 384), ) -> List[Image.Image]: """ Extract `num_frames` evenly spaced frames from a video file. Returns a list of PIL Image objects resized to target_size. Requires torchvision.read_video; as a fallback, uses ffmpeg via PIL + imageio if read_video is unavailable. """ if not os.path.exists(video_path): raise FileNotFoundError(f"Video file not found: {video_path}") try: # torchvision's read_video returns (frames, audio, info) frames, _, info = read_video(video_path, pts_unit="sec") # frames: (num_total_frames, H, W, C) uint8 tensor total = frames.shape[0] if total == 0: raise RuntimeError("No frames extracted from video.") indices = np.linspace(0, total - 1, num_frames, dtype=int) pil_frames = [] for i in indices: frame = frames[i].numpy() img = Image.fromarray(frame) img = img.convert("RGB").resize(target_size, resample=Image.BILINEAR) pil_frames.append(img) return pil_frames except Exception: # Fallback: use imageio-ffmpeg or other mechanism (not implemented here) raise RuntimeError("Video reading failed. Ensure torchvision is installed and supports read_video.") def load_frames_from_folder( folder: str, num_frames: int = 12, target_size: Tuple[int, int] = (384, 384), ) -> List[Image.Image]: """ Load frames (PNG/JPG) from a folder. Picks num_frames evenly across available images. """ files = sorted( [ os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith((".png", ".jpg", ".jpeg")) ] ) if not files: raise FileNotFoundError(f"No image files found in folder: {folder}") total = len(files) indices = np.linspace(0, total - 1, num_frames, dtype=int) pil_frames = [] for i in indices: img = Image.open(files[i]).convert("RGB").resize(target_size, resample=Image.BILINEAR) pil_frames.append(img) return pil_frames def preprocess_frames( pil_frames: List[Image.Image], processor: CLIPImageProcessor, device: Union[str, torch.device] = "cpu", ) -> torch.Tensor: """ Given a list of PIL images (len == 12), apply the CLIP image processor and return a tensor shaped (1, 12, 3, 384, 384) suitable for the model. """ # processor can batch process a list of images; it returns dict with pixel_values # pixel_values shape: (num_images, 3, H, W) proc = processor(images=pil_frames, return_tensors="pt") pixel_values = proc["pixel_values"] # (num_images, 3, H, W) # Ensure exactly num_frames pixel_values = pixel_values.to(device) # reshape to (1, num_frames, 3, H, W) pixel_values = pixel_values.unsqueeze(0) if pixel_values.ndim == 4 else pixel_values pixel_values = pixel_values.view(1, len(pil_frames), 3, pixel_values.shape[-2], pixel_values.shape[-1]) return pixel_values def apply_augmentations_train(pil_frames: List[Image.Image]) -> List[Image.Image]: """ Apply training augmentations: horizontal flip (random) and brightness jitter. Called only during training data pipeline; included here for completeness. """ aug = transforms.Compose([ transforms.RandomHorizontalFlip(p=0.5), transforms.ColorJitter(brightness=0.2), ]) return [aug(img) for img in pil_frames] # ---------------------------------------------------------------------- # High level inference API # ---------------------------------------------------------------------- def load_model_from_checkpoint( checkpoint_path: str, clip_model_name: str = "openai/clip-vit-large-patch14", device: Union[str, torch.device] = "cpu", ) -> Tuple[torch.nn.Module, CLIPImageProcessor]: """ Load model and CLIP image processor. checkpoint should be a dict with 'model_state_dict' key (per model card instructions). """ device = torch.device(device) # Instantiate model skeleton model = DeepfakeDetector(clip_model_name=clip_model_name, num_frames=12) # Load checkpoint ckpt = torch.load(checkpoint_path, map_location="cpu") if "model_state_dict" in ckpt: state_dict = ckpt["model_state_dict"] else: # assume full state dict saved directly state_dict = ckpt # Try to load, allowing for missing keys if adapter names differ try: model.load_state_dict(state_dict) except Exception as e: # Provide a clearer error for mismatch: user must use exact model implementation raise RuntimeError(f"Failed to load state dict into DeepfakeDetector: {e}") model.to(device) model.eval() # Load CLIP image processor (matches model's backbone preprocessing) processor = CLIPImageProcessor.from_pretrained(clip_model_name) return model, processor def predict_from_frames( model: torch.nn.Module, processor: CLIPImageProcessor, pil_frames: List[Image.Image], device: Union[str, torch.device] = "cpu", use_bf16: bool = False, ) -> Dict[str, Union[int, float, List[float], torch.Tensor]]: """ Perform inference on a single sample (list of 12 PIL frames). Returns a dict with: - logits (torch.Tensor shape (1,2)) - probabilities (list of 2 floats) - prediction (int, 0 or 1) - confidence (float, probability of predicted class) """ device = torch.device(device) x = preprocess_frames(pil_frames, processor, device=device) # (1,12,3,384,384) # optionally convert to bf16 if on supported hardware if use_bf16 and device.type != "cpu": x = x.to(dtype=torch.bfloat16) else: x = x.to(dtype=torch.float32) model = model.to(device) with torch.no_grad(): logits = model(x) # (1,2) probs = F.softmax(logits, dim=-1) probs_list = probs.squeeze(0).cpu().tolist() pred = int(torch.argmax(logits, dim=-1).squeeze().item()) confidence = float(probs.squeeze(0)[pred].cpu().item()) return { "logits": logits.cpu(), "probabilities": probs_list, "prediction": pred, "confidence": confidence, } # ---------------------------------------------------------------------- # Convenience function: video file -> prediction # ---------------------------------------------------------------------- def predict_from_video_file( checkpoint_path: str, video_path: str, device: Union[str, torch.device] = "cpu", clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12, use_bf16: bool = False, ) -> Dict[str, Union[int, float, List[float], torch.Tensor]]: """ Load model from checkpoint, extract frames from video, and return prediction. """ model, processor = load_model_from_checkpoint(checkpoint_path, clip_model_name=clip_model_name, device=device) pil_frames = extract_evenly_spaced_frames_from_video(video_path, num_frames=num_frames, target_size=(384, 384)) return predict_from_frames(model, processor, pil_frames, device=device, use_bf16=use_bf16) # ---------------------------------------------------------------------- # Example usage (commented) # ---------------------------------------------------------------------- # if __name__ == "__main__": # # Example: run inference on a single MP4 # checkpoint = "best_model.pt" # video = "example.mp4" # result = predict_from_video_file(checkpoint, video, device="cuda", use_bf16=True) # print("Logits:", result["logits"]) # print("Probabilities:", result["probabilities"]) # print("Prediction (0=real,1=fake):", result["prediction"], "confidence:", result["confidence"]) # ---------------------------------------------------------------------- # End of file # ----------------------------------------------------------------------