--- license: apache-2.0 library_name: transformers tags: - vision-language - fine-grained-recognition - visual-reasoning - reinforcement-learning - qwen --- # DiVE-k QWEN2.5-7B (CUB) ## Overview **DiVE-k QWEN2.5-7B-CUB** is a vision-language model fine-tuned using **DiVE-k (Differential Visual Reasoning using Top-k Generations)** on a fine-grained visual recognition domain (e.g., CUB). DiVE-k reformulates fine-grained image classification as a *differential reasoning* problem. Instead of training the model to predict a single label, it leverages the modelโ€™s own **top-k predictions** to construct a multiple-choice reasoning task. The model is then trained using reinforcement learning to select the correct answer among visually similar candidates, encouraging deeper visual discrimination and reasoning. This approach improves zero-shot and base-to-novel generalization performance by teaching the model to compare subtle visual differences between competing hypotheses. The training framework, data construction, and evaluation pipeline are described in detail in the DiVE-k repository. ๐Ÿ‘‰ **Source code:** https://github.com/raja-kumar/DiVE-k --- ## Example Usage Please refer to the official **DiVE-k GitHub repository** for: - Model loading - Inference pipelines - Fine-grained classification setup - Training and evaluation scripts ๐Ÿ‘‰ https://github.com/raja-kumar/DiVE-k --- ## Citation If you use this model or the DiVE-k framework in your research, please cite: ```bibtex @misc{kumar2025divekdifferentialvisualreasoning, title={DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition}, author={Raja Kumar and Arka Sadhu and Ram Nevatia}, year={2025}, eprint={2511.18305}, archivePrefix={arXiv}, primaryClass={cs.CV} }