---
language: en
license: apache-2.0
tags:
- deepfake-detection
- video-classification
- clip
- vit
- pytorch
- transformers
- spatiotemporal-adapters
- bf16
- reproducibility
metrics:
- accuracy
- f1
- recall
base_model:
- openai/clip-vit-large-patch14
---

# SPDX-License-Identifier: Apache-2.0
"""
SOTA Deepfake Detector - DFD
Model card + inference utilities for Hugging Face repository:
  Arko007/deepfake-detector-dfd-sota

This file contains:
- A model card in Hugging Face model README format (YAML frontmatter + sections).
- Inference helper functions to:
  - load frames or an MP4 video,
  - extract 12 evenly-spaced frames,
  - preprocess using the CLIP image processor,
  - run a forward pass through the provided checkpoint and return logits,
  - return softmax probabilities and predicted class.

Notes:
- Architecture described: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters
- Input: 12 frames, 3x384x384, dtype=torch.float32 / BF16 where supported
- Labels: 0 = real, 1 = fake
- License: Apache-2.0
- This file is intended as a convenience reference for model consumers and deployers.
"""

# SOTA Deepfake Detector - DFD

Model: SOTA Deepfake Detector - DFD  
Repository: https://huggingface.co/Arko007/deepfake-detector-dfd-sota

## Model Description

The "SOTA Deepfake Detector - DFD" is a spatiotemporal adaptation built on a frozen CLIP-ViT-Large backbone with lightweight spatiotemporal adapters inserted to learn temporal relationships across frames. The backbone parameters are frozen; only the adapters and final classification head are trainable.

- Architecture: CLIP-ViT-Large (frozen backbone) + Spatiotemporal Adapters
- Input resolution: 384×384 pixels
- Temporal frames: 12 frames per video
- Trainable parameters: 5,255,938 (≈1.7% of total 308M)
- Precision used for training: BF16 (mixed precision)
- Framework: PyTorch + Transformers (Hugging Face)
- HF Hub repo: Arko007/deepfake-detector-dfd-sota
- Model files:
  - pytorch_model.bin (1.28GB)
  - config.json

## Intended Use

This model is intended to classify short video clips (or frame sequences) as either:
- Label 0: Real videos
- Label 1: Deepfake/manipulated videos

Primary use-cases:
- Research on deepfake detection
- Benchmarking against other detectors
- Integration into content-moderation pipelines with caution

Not intended:
- Medical, legal, or other high-stakes decisions without human review.
- Use on domains/styles substantially different from the DFD dataset without further fine-tuning.

## Training Data

Dataset: DFD (Deep Fake Detection)  
- Total videos: 3,431
  - Real videos: 363 (10.6%)
  - Fake videos: 3,068 (89.4%)
- Preprocessing:
  - 12 evenly-spaced frames extracted per video at 384×384 resolution (total frames: 41,172)
- Train/validation split: 90/10 stratified
- Class balancing during training: WeightedRandomSampler (inverse frequency weights)

## Training Configuration

- Batch size: 16 (effective 32 with gradient accumulation)
- Optimizer: AdamW (weight_decay=0.05)
- Learning rate: 5e-6 with cosine decay, 10% warmup
- Epochs: 12
- Loss: Cross-entropy with label smoothing = 0.1
- Gradient clipping: max_grad_norm = 1.0
- Sampling: WeightedRandomSampler to mitigate class imbalance
- Augmentation (training only): random horizontal flip, brightness jitter
- Hardware: NVIDIA L4 GPU (24GB VRAM)
- Random seed: 42 (all RNGs fixed for reproducibility)

Training speed: ~1.4 seconds per batch (batch_size=16) on the reported hardware.

## Evaluation / Metrics

- Best validation accuracy: 84.88%
- Validation detection (approx. ranges due to small real class):
  - Real class detection: ~47–55% (low number of real samples)
  - Fake class detection: ~64–93%
- Training loss convergence (cross-entropy w/ label smoothing): 0.7097 → 0.6921 (12 epochs)

## Known Limitations

- Validation set is highly imbalanced (89% fake), which affects stability of metrics.
- Small number of real videos (363) limits generalization to unseen real samples.
- Model optimized for the DFD dataset; transfer to other deepfake types may require fine-tuning.
- Temporal context limited to 12 frames (approx. 0.4–1s depending on FPS), so long-term artifacts may be missed.

## Usage

Quick inference instructions:

1. Load checkpoint:
   checkpoint = torch.load("best_model.pt", map_location="cpu")
2. Extract model state:
   model_state = checkpoint["model_state_dict"]
3. Initialize model:
   model = DeepfakeDetector(config)
4. Load state dict:
   model.load_state_dict(model_state)
5. Set to eval:
   model.eval()
6. Inference:
   Pass a tensor of shape (batch_size, 12, 3, 384, 384) with dtype float32 (or bf16 where supported).
7. Output:
   - logits: (batch_size, 2)
   - probabilities: softmax(logits)
   - prediction: argmax(logits) -> 0=real, 1=fake

## How to Cite

If you use this model in your work, please cite the repository and include details about the DFD dataset and this specific model configuration.

"""

# ----------------------------
# Inference code
# ----------------------------
# The following is a compact, self-contained inference utility. It assumes:
# - checkpoint is saved as 'best_model.pt'
# - a config object or config.json for the model is available
# - CLIP image processor is available via transformers
#
# Important: This implementation is a minimal example to run inference. For
# production, wrap in a robust server, add batching, async IO, error handling,
# and pre-warming for BF16 on supported accelerators.

import os
import math
from typing import List, Tuple, Union, Dict

import numpy as np
from PIL import Image
import torch
import torch.nn.functional as F
from torchvision import transforms
from torchvision.io import read_video  # optional, requires torchvision installation
from transformers import CLIPImageProcessor, CLIPModel

# ----------------------------------------------------------------------
# Model definition stub
# ----------------------------------------------------------------------
# The real model used for training is a CLIP-ViT-Large backbone (frozen) with
# spatiotemporal adapters and a small classification head. For inference the
# exact architecture must match the checkpoint. Below is a minimal class
# to demonstrate expected load / inference semantics. Replace this with the
# model class used during training (and the one saved to the checkpoint).
class DeepfakeDetector(torch.nn.Module):
    """
    Minimal wrapper around CLIP backbone + temporal adapters + classification head.

    NOTE: This is a lightweight placeholder for demonstration. Replace with the
    exact model definition used during training to successfully load the
    provided checkpoint (pytorch_model.bin / best_model.pt).
    """

    def __init__(self, clip_model_name: str = "openai/clip-vit-large-patch14", num_frames: int = 12):
        super().__init__()
        self.num_frames = num_frames
        # Load CLIP and freeze it (backbone frozen)
        self.clip = CLIPModel.from_pretrained(clip_model_name, torch_dtype=torch.float32)
        for p in self.clip.parameters():
            p.requires_grad = False

        # Spatiotemporal adapters & head (trainable)
        # NOTE: The real training used custom adapters; here we provide a representative head.
        embed_dim = self.clip.visual_projection.shape[1] if hasattr(self.clip, "visual_projection") else self.clip.config.projection_dim
        self.adapter_pool = torch.nn.AdaptiveAvgPool1d(1)  # placeholder
        # small trainable head consistent with ~5.26M params total
        self.classifier = torch.nn.Sequential(
            torch.nn.Linear(embed_dim, 1024),
            torch.nn.ReLU(),
            torch.nn.Dropout(p=0.1),
            torch.nn.Linear(1024, 2)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.

        x: Tensor of shape (batch, num_frames, 3, H, W)
        Returns logits: (batch, 2)
        """
        b, t, c, h, w = x.shape
        # reshape to (b*t, c, h, w) to run through CLIP's visual encoder
        xt = x.view(b * t, c, h, w)

        # CLIP's image encoder expects pixel values and image processing externally.
        # Use clip.vision_model to obtain image embeddings
        outputs = self.clip.vision_model(pixel_values=xt)
        pooled = outputs.pooler_output  # (b*t, embed_dim) or adjust per CLIP impl

        # reshape back to (b, t, embed_dim) and aggregate temporally
        embed_dim = pooled.shape[-1]
        pooled = pooled.view(b, t, embed_dim)  # (b, t, embed_dim)

        # Simple temporal aggregation (real model uses adapters). For inference placeholder:
        # mean across temporal dimension
        video_repr = pooled.mean(dim=1)  # (b, embed_dim)

        logits = self.classifier(video_repr)  # (b, 2)
        return logits

# ----------------------------------------------------------------------
# Helper utilities for frame extraction, preprocessing and inference
# ----------------------------------------------------------------------
def extract_evenly_spaced_frames_from_video(
    video_path: str,
    num_frames: int = 12,
    target_size: Tuple[int, int] = (384, 384),
) -> List[Image.Image]:
    """
    Extract `num_frames` evenly spaced frames from a video file.

    Returns a list of PIL Image objects resized to target_size.
    Requires torchvision.read_video; as a fallback, uses ffmpeg via PIL + imageio
    if read_video is unavailable.
    """
    if not os.path.exists(video_path):
        raise FileNotFoundError(f"Video file not found: {video_path}")

    try:
        # torchvision's read_video returns (frames, audio, info)
        frames, _, info = read_video(video_path, pts_unit="sec")
        # frames: (num_total_frames, H, W, C) uint8 tensor
        total = frames.shape[0]
        if total == 0:
            raise RuntimeError("No frames extracted from video.")
        indices = np.linspace(0, total - 1, num_frames, dtype=int)
        pil_frames = []
        for i in indices:
            frame = frames[i].numpy()
            img = Image.fromarray(frame)
            img = img.convert("RGB").resize(target_size, resample=Image.BILINEAR)
            pil_frames.append(img)
        return pil_frames
    except Exception:
        # Fallback: use imageio-ffmpeg or other mechanism (not implemented here)
        raise RuntimeError("Video reading failed. Ensure torchvision is installed and supports read_video.")

def load_frames_from_folder(
    folder: str,
    num_frames: int = 12,
    target_size: Tuple[int, int] = (384, 384),
) -> List[Image.Image]:
    """
    Load frames (PNG/JPG) from a folder. Picks num_frames evenly across available images.
    """
    files = sorted(
        [
            os.path.join(folder, f)
            for f in os.listdir(folder)
            if f.lower().endswith((".png", ".jpg", ".jpeg"))
        ]
    )
    if not files:
        raise FileNotFoundError(f"No image files found in folder: {folder}")
    total = len(files)
    indices = np.linspace(0, total - 1, num_frames, dtype=int)
    pil_frames = []
    for i in indices:
        img = Image.open(files[i]).convert("RGB").resize(target_size, resample=Image.BILINEAR)
        pil_frames.append(img)
    return pil_frames

def preprocess_frames(
    pil_frames: List[Image.Image],
    processor: CLIPImageProcessor,
    device: Union[str, torch.device] = "cpu",
) -> torch.Tensor:
    """
    Given a list of PIL images (len == 12), apply the CLIP image processor and
    return a tensor shaped (1, 12, 3, 384, 384) suitable for the model.
    """
    # processor can batch process a list of images; it returns dict with pixel_values
    # pixel_values shape: (num_images, 3, H, W)
    proc = processor(images=pil_frames, return_tensors="pt")
    pixel_values = proc["pixel_values"]  # (num_images, 3, H, W)
    # Ensure exactly num_frames
    pixel_values = pixel_values.to(device)
    # reshape to (1, num_frames, 3, H, W)
    pixel_values = pixel_values.unsqueeze(0) if pixel_values.ndim == 4 else pixel_values
    pixel_values = pixel_values.view(1, len(pil_frames), 3, pixel_values.shape[-2], pixel_values.shape[-1])
    return pixel_values

def apply_augmentations_train(pil_frames: List[Image.Image]) -> List[Image.Image]:
    """
    Apply training augmentations: horizontal flip (random) and brightness jitter.
    Called only during training data pipeline; included here for completeness.
    """
    aug = transforms.Compose([
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ColorJitter(brightness=0.2),
    ])
    return [aug(img) for img in pil_frames]

# ----------------------------------------------------------------------
# High level inference API
# ----------------------------------------------------------------------
def load_model_from_checkpoint(
    checkpoint_path: str,
    clip_model_name: str = "openai/clip-vit-large-patch14",
    device: Union[str, torch.device] = "cpu",
) -> Tuple[torch.nn.Module, CLIPImageProcessor]:
    """
    Load model and CLIP image processor.

    checkpoint should be a dict with 'model_state_dict' key (per model card instructions).
    """
    device = torch.device(device)
    # Instantiate model skeleton
    model = DeepfakeDetector(clip_model_name=clip_model_name, num_frames=12)

    # Load checkpoint
    ckpt = torch.load(checkpoint_path, map_location="cpu")
    if "model_state_dict" in ckpt:
        state_dict = ckpt["model_state_dict"]
    else:
        # assume full state dict saved directly
        state_dict = ckpt

    # Try to load, allowing for missing keys if adapter names differ
    try:
        model.load_state_dict(state_dict)
    except Exception as e:
        # Provide a clearer error for mismatch: user must use exact model implementation
        raise RuntimeError(f"Failed to load state dict into DeepfakeDetector: {e}")

    model.to(device)
    model.eval()

    # Load CLIP image processor (matches model's backbone preprocessing)
    processor = CLIPImageProcessor.from_pretrained(clip_model_name)
    return model, processor

def predict_from_frames(
    model: torch.nn.Module,
    processor: CLIPImageProcessor,
    pil_frames: List[Image.Image],
    device: Union[str, torch.device] = "cpu",
    use_bf16: bool = False,
) -> Dict[str, Union[int, float, List[float], torch.Tensor]]:
    """
    Perform inference on a single sample (list of 12 PIL frames).
    Returns a dict with:
      - logits (torch.Tensor shape (1,2))
      - probabilities (list of 2 floats)
      - prediction (int, 0 or 1)
      - confidence (float, probability of predicted class)
    """
    device = torch.device(device)
    x = preprocess_frames(pil_frames, processor, device=device)  # (1,12,3,384,384)
    # optionally convert to bf16 if on supported hardware
    if use_bf16 and device.type != "cpu":
        x = x.to(dtype=torch.bfloat16)
    else:
        x = x.to(dtype=torch.float32)
    model = model.to(device)

    with torch.no_grad():
        logits = model(x)  # (1,2)
        probs = F.softmax(logits, dim=-1)
        probs_list = probs.squeeze(0).cpu().tolist()
        pred = int(torch.argmax(logits, dim=-1).squeeze().item())
        confidence = float(probs.squeeze(0)[pred].cpu().item())

    return {
        "logits": logits.cpu(),
        "probabilities": probs_list,
        "prediction": pred,
        "confidence": confidence,
    }

# ----------------------------------------------------------------------
# Convenience function: video file -> prediction
# ----------------------------------------------------------------------
def predict_from_video_file(
    checkpoint_path: str,
    video_path: str,
    device: Union[str, torch.device] = "cpu",
    clip_model_name: str = "openai/clip-vit-large-patch14",
    num_frames: int = 12,
    use_bf16: bool = False,
) -> Dict[str, Union[int, float, List[float], torch.Tensor]]:
    """
    Load model from checkpoint, extract frames from video, and return prediction.
    """
    model, processor = load_model_from_checkpoint(checkpoint_path, clip_model_name=clip_model_name, device=device)
    pil_frames = extract_evenly_spaced_frames_from_video(video_path, num_frames=num_frames, target_size=(384, 384))
    return predict_from_frames(model, processor, pil_frames, device=device, use_bf16=use_bf16)

# ----------------------------------------------------------------------
# Example usage (commented)
# ----------------------------------------------------------------------
# if __name__ == "__main__":
#     # Example: run inference on a single MP4
#     checkpoint = "best_model.pt"
#     video = "example.mp4"
#     result = predict_from_video_file(checkpoint, video, device="cuda", use_bf16=True)
#     print("Logits:", result["logits"])
#     print("Probabilities:", result["probabilities"])
#     print("Prediction (0=real,1=fake):", result["prediction"], "confidence:", result["confidence"])

# ----------------------------------------------------------------------
# End of file
# ----------------------------------------------------------------------