---
license: apache-2.0
library_name: transformers
tags:
- vision-language
- fine-grained-recognition
- visual-reasoning
- reinforcement-learning
- qwen
---

# DiVE-k QWEN2.5-7B (CUB)

## Overview

**DiVE-k QWEN2.5-7B-CUB** is a vision-language model fine-tuned using **DiVE-k (Differential Visual Reasoning using Top-k Generations)** on a fine-grained visual recognition domain (e.g., CUB).

DiVE-k reformulates fine-grained image classification as a *differential reasoning* problem. Instead of training the model to predict a single label, it leverages the model’s own **top-k predictions** to construct a multiple-choice reasoning task. The model is then trained using reinforcement learning to select the correct answer among visually similar candidates, encouraging deeper visual discrimination and reasoning.

This approach improves zero-shot and base-to-novel generalization performance by teaching the model to compare subtle visual differences between competing hypotheses.

The training framework, data construction, and evaluation pipeline are described in detail in the DiVE-k repository.

👉 **Source code:** https://github.com/raja-kumar/DiVE-k

---

## Example Usage

Please refer to the official **DiVE-k GitHub repository** for:
- Model loading
- Inference pipelines
- Fine-grained classification setup
- Training and evaluation scripts

👉 https://github.com/raja-kumar/DiVE-k

---

## Citation

If you use this model or the DiVE-k framework in your research, please cite:

```bibtex
@misc{kumar2025divekdifferentialvisualreasoning,
  title={DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition},
  author={Raja Kumar and Arka Sadhu and Ram Nevatia},
  year={2025},
  eprint={2511.18305},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}