---
license: mit
tags:
  - video-classification
  - I3D
  - action-recognition
  - anomaly-detection
datasets:
  - kinetics-400
  - ucf-crime
model-index:
  - name: i3d_ucf_finetuned
    results:
      - task:
          type: video-classification
        dataset:
          name: UCF-Crime
          type: ucf-crime
        metrics:
          - name: Validation Accuracy
            type: accuracy
            value: 0.6667
---

# I3D UCF Finetuned

## Model Description
This is a finetuned I3D (Inflated 3D ConvNet) model for video classification, based on the `i3d_r50` architecture from [PyTorchVideo](https://pytorchvideo.org/). The I3D model uses a ResNet-50 backbone inflated to 3D convolutions to capture both spatial and temporal features from videos. It was originally pretrained on the **Kinetics-400** dataset, which contains ~306,245 short videos across 400 human action classes (e.g., running, dancing, cooking).

The model was finetuned on the **UCF-Crime** dataset to classify videos into 8 specific categories: `arrest`, `Explosion`, `Fight`, `normal`, `roadaccidents`, `shooting`, `Stealing`, `vandalism`. During finetuning, the final fully connected layer was modified to output 8 classes, and a Dropout layer (p=0.3) was added to reduce overfitting. The finetuned weights are stored in `i3d_ucf_finetuned.pth` (109 MB) and can be downloaded from this repository.

## Dataset
### Pretraining Dataset
- **Kinetics-400**: A large-scale dataset with ~306,245 videos covering 400 human action classes. It provides robust general features for video understanding, making it an excellent starting point for finetuning.

### Finetuning Dataset
- **UCF-Crime**: A dataset for anomaly detection in videos, containing ~1,900 videos (~1,610 for training, 290 for testing). The model was finetuned on a subset of UCF-Crime to classify videos into 8 categories: `arrest`, `Explosion`, `Fight`, `normal`, `roadaccidents`, `shooting`, `Stealing`, `vandalism`.

## Performance
The model was finetuned for 30 epochs. Below are the training and validation performance plots:

### Training and Validation Accuracy
![Training and Validation Accuracy](train_val_accuracy.jpg)

- **Best Validation Accuracy**: ~66.67% (achieved after finetuning on UCF-Crime).
- **Training Accuracy**: Reached ~81.03% .
  
### Training and Validation Loss
![Training and Validation Loss](train_val_loss.jpg)

- The training loss decreases steadily, while the validation loss shows some fluctuations, indicating potential room for improving generalization.

## Usage
To use the model for video classification, you can load the weights from this repository using the following code:

```python
import torch
import cv2
import numpy as np
import torch.nn as nn
from huggingface_hub import hf_hub_download

# Define the model
def load_i3d_ucf_finetuned(repo_id="Ahmeddawood0001/i3d_ucf_finetuned", filename="i3d_ucf_finetuned.pth"):
    class I3DClassifier(nn.Module):
        def __init__(self, num_classes):
            super(I3DClassifier, self).__init__()
            self.i3d = torch.hub.load('facebookresearch/pytorchvideo', 'i3d_r50', pretrained=True)
            self.dropout = nn.Dropout(0.3)
            self.i3d.blocks[6].proj = nn.Linear(2048, num_classes)
        def forward(self, x):
            x = self.i3d(x)
            x = self.dropout(x)
            return x
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = I3DClassifier(num_classes=8).to(device)
    weights_path = hf_hub_download(repo_id=repo_id, filename=filename)
    model.load_state_dict(torch.load(weights_path))
    model.eval()
    return model

# Define frame extraction function
def extract_frames(video_path, max_frames=32, frame_size=(224, 224)):
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < max_frames:
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = cv2.resize(frame, frame_size)
        frames.append(frame)
    while len(frames) < max_frames:
        frames.append(frames[-1])
    frames = frames[:max_frames]
    frames = np.stack(frames)
    frames = torch.from_numpy(frames).permute(0, 3, 1, 2).float() / 255.0
    frames = frames.permute(1, 0, 2, 3)
    cap.release()
    return frames

# Define classification function
def classify_video(video_path, model, labels):
    frames = extract_frames(video_path)
    frames = frames.unsqueeze(0).to(device)
    with torch.no_grad():
        outputs = model(frames)
        probabilities = torch.softmax(outputs, dim=1)
        predicted_idx = torch.argmax(probabilities, dim=1).item()
        predicted_label = labels[predicted_idx]
        confidence = probabilities[0, predicted_idx].item()
    return predicted_label, confidence

# Example usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
labels = ["arrest", "Explosion", "Fight", "normal", "roadaccidents", "shooting", "Stealing", "vandalism"]
model = load_i3d_ucf_finetuned()
video_path = "path/to/your/video.mp4"  # Replace with your video path
predicted_label, confidence = classify_video(video_path, model, labels)
print(f"Video: {video_path}")
print(f"Predicted Label: {predicted_label}")

print(f"Confidence: {confidence:.4f}")