--- language: - ko - en tags: - video-understanding - v-jepa - multimodal - projection-layer - lora license: apache-2.0 base_model: - facebook/vjepa2-vitl-fpc64-256 - Qwen/Qwen3.5-27B --- # V-JEPA 2 + Qwen3.5-27B Video Understanding Event-based video understanding pipeline using V-JEPA 2 vision encoder aligned with Qwen3.5-27B LLM. **Training data**: ~600 YouTube Shorts with Gemini 2.0 Flash auto-generated summaries (custom dataset, not publicly released). ## Key Results - **~250x token efficiency** vs frame-based approaches (8-15 tokens per video) - **80% domain accuracy** on video summarization (Experiment 5) - **47.9% text recognition** accuracy with V-JEPA LoRA (vs 1.2% baseline) - **~22GB VRAM** for inference with GGUF quantization ## Checkpoints | File | Description | Use Case | |---|---|---| | `exp5_projection/proj_epoch5.pt` | Projection Layer (3-layer MLP, ~215M) | Video summarization | | `exp6_projection/proj_lora_epoch5.pt` | Projection Layer trained with LoRA | Summarization + text recognition | | `exp6_vjepa_lora/` | V-JEPA 2 LoRA adapter (r=16, alpha=32) | Text recognition in videos | ## Architecture ``` Video → V-JEPA 2 ViT-L (frozen/LoRA) → frame mean pool → [N_frames, 1024] → event segmentation (cosine distance peak detection) → event mean pool → [N_events, 1024] → Projection Layer (3-layer MLP) → [N_events, 5120] → Qwen3.5-27B (frozen) → text generation ``` ## Projection Layer Architecture ```python class ProjectionV2(nn.Module): def __init__(self, vjepa_dim=1024, llm_dim=5120): super().__init__() hidden = llm_dim * 2 # 10240 self.proj = nn.Sequential( nn.Linear(vjepa_dim, hidden), nn.GELU(), nn.Linear(hidden, hidden), nn.GELU(), nn.Linear(hidden, llm_dim), ) ``` ## Usage ```python import torch from transformers import AutoModel # Load V-JEPA 2 vjepa = AutoModel.from_pretrained("facebook/vjepa2-vitl-fpc64-256") # Load Projection proj = ProjectionV2(1024, 5120) proj.load_state_dict(torch.load("exp5_projection/proj_epoch5.pt")) # For text recognition, also load LoRA from peft import PeftModel vjepa_lora = PeftModel.from_pretrained(vjepa, "exp6_vjepa_lora/") proj_lora = ProjectionV2(1024, 5120) proj_lora.load_state_dict(torch.load("exp6_projection/proj_lora_epoch5.pt")) ``` ## Training Details - **Vision Encoder**: V-JEPA 2 ViT-L (326M params, frozen or LoRA r=16) - **LLM**: Qwen3.5-27B (frozen, bf16) - **Projection**: 3-layer MLP (~215M params, trainable) - **Data**: ~600 YouTube Shorts with Gemini 2.0 Flash auto-summaries - **Training**: 5 epochs, AdamW lr=1e-4, A100 80GB - **Loss**: next-token prediction (causal LM) ## Citation If you use this work, please cite: ```bibtex @misc{raen2026vjepa_video_understanding, title={Event-Based Video Understanding via V-JEPA--LLM Alignment: From Event Segmentation to Visual-Semantic Mapping}, author={Raen2264}, year={2026}, doi={10.5281/zenodo.19143611}, url={https://doi.org/10.5281/zenodo.19143611}, note={Model checkpoints: https://huggingface.co/2264K/vjepa2-qwen3.5-video-understanding} } ```