File size: 3,137 Bytes
1aa21a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eec68b7
 
 
1aa21a8
eec68b7
 
 
1aa21a8
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
language:
- ko
- en
tags:
- video-understanding
- v-jepa
- multimodal
- projection-layer
- lora
license: apache-2.0
base_model:
- facebook/vjepa2-vitl-fpc64-256
- Qwen/Qwen3.5-27B
---

# V-JEPA 2 + Qwen3.5-27B Video Understanding

Event-based video understanding pipeline using V-JEPA 2 vision encoder aligned with Qwen3.5-27B LLM.

**Training data**: ~600 YouTube Shorts with Gemini 2.0 Flash auto-generated summaries (custom dataset, not publicly released).

## Key Results

- **~250x token efficiency** vs frame-based approaches (8-15 tokens per video)
- **80% domain accuracy** on video summarization (Experiment 5)
- **47.9% text recognition** accuracy with V-JEPA LoRA (vs 1.2% baseline)
- **~22GB VRAM** for inference with GGUF quantization

## Checkpoints

| File | Description | Use Case |
|---|---|---|
| `exp5_projection/proj_epoch5.pt` | Projection Layer (3-layer MLP, ~215M) | Video summarization |
| `exp6_projection/proj_lora_epoch5.pt` | Projection Layer trained with LoRA | Summarization + text recognition |
| `exp6_vjepa_lora/` | V-JEPA 2 LoRA adapter (r=16, alpha=32) | Text recognition in videos |

## Architecture

```
Video β†’ V-JEPA 2 ViT-L (frozen/LoRA) β†’ frame mean pool β†’ [N_frames, 1024]
  β†’ event segmentation (cosine distance peak detection)
  β†’ event mean pool β†’ [N_events, 1024]
  β†’ Projection Layer (3-layer MLP) β†’ [N_events, 5120]
  β†’ Qwen3.5-27B (frozen) β†’ text generation
```

## Projection Layer Architecture

```python
class ProjectionV2(nn.Module):
    def __init__(self, vjepa_dim=1024, llm_dim=5120):
        super().__init__()
        hidden = llm_dim * 2  # 10240
        self.proj = nn.Sequential(
            nn.Linear(vjepa_dim, hidden), nn.GELU(),
            nn.Linear(hidden, hidden), nn.GELU(),
            nn.Linear(hidden, llm_dim),
        )
```

## Usage

```python
import torch
from transformers import AutoModel

# Load V-JEPA 2
vjepa = AutoModel.from_pretrained("facebook/vjepa2-vitl-fpc64-256")

# Load Projection
proj = ProjectionV2(1024, 5120)
proj.load_state_dict(torch.load("exp5_projection/proj_epoch5.pt"))

# For text recognition, also load LoRA
from peft import PeftModel
vjepa_lora = PeftModel.from_pretrained(vjepa, "exp6_vjepa_lora/")
proj_lora = ProjectionV2(1024, 5120)
proj_lora.load_state_dict(torch.load("exp6_projection/proj_lora_epoch5.pt"))
```

## Training Details

- **Vision Encoder**: V-JEPA 2 ViT-L (326M params, frozen or LoRA r=16)
- **LLM**: Qwen3.5-27B (frozen, bf16)
- **Projection**: 3-layer MLP (~215M params, trainable)
- **Data**: ~600 YouTube Shorts with Gemini 2.0 Flash auto-summaries
- **Training**: 5 epochs, AdamW lr=1e-4, A100 80GB
- **Loss**: next-token prediction (causal LM)

## Citation

If you use this work, please cite:

```bibtex
@misc{raen2026vjepa_video_understanding,
  title={Event-Based Video Understanding via V-JEPA--LLM Alignment: From Event Segmentation to Visual-Semantic Mapping},
  author={Raen2264},
  year={2026},
  doi={10.5281/zenodo.19143611},
  url={https://doi.org/10.5281/zenodo.19143611},
  note={Model checkpoints: https://huggingface.co/2264K/vjepa2-qwen3.5-video-understanding}
}
```