File size: 3,640 Bytes
6fccdf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: video-classification
tags:
- cvit
- deepfake-detection
- video-classification
- computer-vision
- vision-transformer
- binary-classification
---

# 🔍 Convolutional Vision Transformer (CViT) for Deepfake Detection

The **Convolutional Vision Transformer (CViT)** is a hybrid architecture combining the powerful spatial feature extraction capabilities of CNNs with the long-range dependency modeling of Vision Transformers (ViT). This model is purpose-built for detecting deepfake videos and is trained on DFDC.

---

## Model Architecture

### 1. Feature Learning (FL) Module - CNN Backbone
- Composed of **17 convolutional operations**.
- Unlike traditional VGG architectures, **FL focuses purely on feature extraction**, not classification.
- Accepts input of size **224 × 224 × 3 (RGB image)**.
- Outputs a **512 × 7 × 7** feature map.
- Contains **10.8 million learnable parameters**.

### 2. Vision Transformer (ViT) Module
- Receives CNN output (**512 × 7 × 7**) as its input.
- Converts the 7×7 patches into a **1 × 1024** sequence using linear embedding.
- Adds **positional embeddings** of shape **(2 × 1024)**.
- ViT Encoder uses:
  - **Multi-Head Self Attention (MSA)** with **8 attention heads**.
  - **MLP blocks** with:
    - First linear layer of **2048** units.
    - Final linear layer of **2 units** (binary classification: Fake / Real).
    - **ReLU activation** and **Softmax** for final probabilities.

---

## 🧪 Experimental Results

The CViT model was tested and evaluated across multiple deepfake datasets:

### 📊 FaceForensics++ Accuracy
| Dataset                               | Accuracy |
|--------------------------------------|----------|
| FaceForensics++ FaceSwap             | 69%      |
| FaceForensics++ DeepFakeDetection    | 91%      |
| FaceForensics++ Deepfake             | 93%      |
| FaceForensics++ FaceShifter          | 46%      |
| FaceForensics++ NeuralTextures       | 60%      |

> **Note**: Poor performance on the FaceShifter dataset is attributed to the model's difficulty in learning subtle visual artifacts.

---

### 🧪 DFDC Evaluation

| Model               | Validation | Test   |
|---------------------|------------|--------|
| **CViT**            | 87.25%     | **91.5%** |

- **Unseen DFDC test videos**: 400
- **Accuracy**: 91.5%
- **AUC Score**: 0.91

---

### 🧪 UADFV AUC Comparison

| Model         | Validation | FaceSwap | Face2Face |
|---------------|------------|----------|-----------|
| **CViT**      | **93.75%** | 69.69%   | 69.39%    |

---

## ⚙️ Training Configuration

- **Loss Function**: Binary Cross Entropy (BCE)
- **Optimizer**: Adam
- **Learning Rate**: 1e-4  
- **Weight Decay**: 1e-6  
- **Batch Size**: 32
- **Epochs**: 50
- **Learning Rate Scheduler**: Reduces LR by factor of 0.1 every 15 epochs
- **Normalization**:
  - Mean: `[0.485, 0.456, 0.406]`
  - Std: `[0.229, 0.224, 0.225]`

---

## 🧪 Inference Setup

- **Input**: 30 normalized facial images (per video)
- **Classification**:
  - Uses **log loss function** to compute confidence.
  - Output is a probability `y ∈ [0, 1]`
    - `0 < y < 0.5`: Real
    - `0.5 ≤ y ≤ 1`: Fake
- Log loss penalizes:
  - Random guesses
  - Confident but incorrect predictions

---

## 🛠 Inference Example

```python
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="mhamza-007/cvit_deepfake_detection",
    filename="cvit2_deepfake_detection_ep_50.pth"
)

# Load model (example)
model = torch.load(model_path, map_location='cpu')
model.eval()