--- license: apache-2.0 datasets: - HuggingFaceFV/finevideo - APRIL-AIGC/UltraVideo base_model: - Qwen/Qwen2.5-VL-7B-Instruct tags: - text-generation-inference - Document - Video - Reasoning - VIREX language: - en pipeline_tag: image-text-to-text library_name: transformers --- ![VIREX-062225-exp .png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/U7tipNhzPydO-wws_Gqs1.png) # **VIREX-062225-exp** > The **VIREX-062225-exp** (Video Information Retrieval and Extraction eXpert - experimental) model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, specifically optimized for **advanced video understanding**, **image comprehension**, **sense of reasoning**, and **natural language decision-making through long chain-of-thought (CoT) reasoning**. Built on the robust Qwen2.5-VL architecture, this experimental model excels at extracting meaningful insights from visual content through sophisticated video-image frame sampling and multimodal reasoning. > [!note] VIREX: Video Information Retrieval and Extraction eXpert [ experimental ] # Key Enhancements * **Advanced Video Information Retrieval**: Capable of understanding complex video sequences, extracting key information, and providing detailed analysis of visual narratives across extended durations. * **Enhanced Image Understanding with Physical Common Sense**: Designed to comprehend real-world physics, spatial relationships, and contextual understanding in both static images and dynamic video content. * **Long Chain-of-Thought Reasoning**: Implements sophisticated reasoning pathways to provide detailed, logical explanations and decision-making processes in natural language. * **Custom Video-Image Frame Sampling**: Utilizes redesigned dataset methodology with intelligent frame sampling techniques for optimal training on video understanding tasks. * **Multimodal Decision Making**: Enables complex decision-making through integration of visual information and natural language processing with contextual understanding. * **State-of-the-Art Video Comprehension**: Achieves superior performance on video understanding benchmarks through modular combination of FineVideo and UltraVideo datasets. # Quick Start with Transformers ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "prithivMLmods/VIREX-062225-exp", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/VIREX-062225-exp") messages = [ { "role": "user", "content": [ { "type": "video", "video": "path/to/your/video.mp4", }, {"type": "text", "text": "Analyze this video and explain the physical interactions you observe using chain-of-thought reasoning."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=512) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` # Intended Use This model is intended for: * **Video Content Analysis**: Deep understanding of video sequences, temporal relationships, and narrative comprehension. * **Physical Common Sense Reasoning**: Analysis of real-world physics, object interactions, and spatial relationships in visual content. * **Chain-of-Thought Video Q&A**: Detailed reasoning and explanation for complex video-based questions with step-by-step logical analysis. * **Temporal Information Extraction**: Retrieval of time-sensitive information and sequential understanding from video content. * **Multimodal Decision Support**: Integration of visual understanding with natural language reasoning for decision-making applications. * **Educational and Research Applications**: Analysis of instructional videos, research content, and educational material with detailed explanations. * **Content Summarization**: Intelligent summarization of video content with contextual understanding and key insight extraction. # Limitations * **Experimental Status**: As an experimental model, performance may vary across different use cases and requires further validation. * **Computational Requirements**: High memory and processing demands for video understanding tasks, not optimized for real-time applications. * **Video Length Constraints**: Performance may degrade with extremely long videos due to context window limitations. * **Domain Specificity**: Optimized primarily for general video understanding; specialized domains may require additional fine-tuning. * **Frame Sampling Dependency**: Performance is dependent on the quality and relevance of frame sampling during inference. * **Reasoning Complexity**: While capable of chain-of-thought reasoning, extremely complex logical chains may still present challenges. # Model Capabilities ## Video Understanding - Temporal sequence analysis - Object tracking and identification - Scene transition recognition - Action and activity recognition ## Physical Common Sense - Physics-based reasoning - Spatial relationship understanding - Cause-and-effect analysis - Real-world interaction comprehension ## Chain-of-Thought Reasoning - Step-by-step logical analysis - Detailed explanation generation - Multi-step problem solving - Contextual reasoning pathways ## Training Details | Parameter | Value | |-------------------------|-----------------------------------------------------| | **Dataset Size** | 11,750 samples (Modular Combination of FineVideo and UltraVideo) | | **Model Architecture** | `Qwen2_5_VLForConditionalGeneration` | | **Hardware** | 3 × NVIDIA A40 (27 vCPUs) | | **Total Disk** | 250,000 MB | | **Training Time** | 4,489 seconds (~1.25 hours) | | **Learning Rate** | 1e-5 | | **Scheduler** | Linear Decay | | **Warmup Steps** | 500 | | **Precision** | bfloat16 | | **Training Method** | Custom dataset with redesigned video-to-image frame sampling | ## References - **Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution** [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191) - **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond** [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966) - **Chain-of-Thought Prompting Elicits Reasoning in Large Language Models** [https://arxiv.org/pdf/2201.11903](https://arxiv.org/pdf/2201.11903) - **Video Understanding with Large Language Models: A Survey** [https://arxiv.org/pdf/2312.17432](https://arxiv.org/pdf/2312.17432)