--- license: mit pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternViT-300M-v2.5 - internlm/Qwen2.5-1.5B base_model_relation: merge language: - multilingual --- # QTuneVL1.5-2B developed by the [ Reconova AI Lab ](https://www.reconova.com/) && [ BDAA-Lab ](https://dm.ustc.edu.cn/index.html) # Introduction We’re excited to introduce QTuneVL1.5-2B, the latest in [Reconova AI Lab’s ](https://www.reconova.com/)series of multimodal large language models. Building on [QTuneVL1-2B](https://huggingface.co/hanchaow/QTuneVL1-2B), it incorporates key features from both [InternVL](https://huggingface.co/OpenGVLab/InternVL2_5-2B) and [Mini-Monkey](https://huggingface.co/mx262/MiniMonkey) to deliver even greater performance. Like QTuneVL1-2B, QTuneVL1.5-2B is a lightweight MLLM that incorporates cropping and padding strategies from [Mini-Monkey](https://huggingface.co/mx262/MiniMonkey)/[Ureader](https://arxiv.org/abs/2310.05126)/[InternVL](https://github.com/OpenGVLab/InternVL), and has been fine-tuned on [InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B). # Evaluation By evaluating our model on eight benchmarks in the [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal) leaderboard using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), we found that it outperformed its predecessor(QTuneVL1-2B) in terms of average scores, particularly on MMStar MMMU_DEV_VAL and OCRBench benchmarks. The eight benchmarks and specific experimental results are as follows: **Eight benchmark:** ` 'MMBench_DEV_EN_V11', 'MMStar', 'MMMU_DEV_VAL', 'MathVista_MINI', 'HallusionBench', 'AI2D_TEST', 'OCRBench', 'MMVet' `. | Index | Model | AVG | MMBench_DEV_EN_V11 | MMStar | MMMU_DEV_VAL | MathVista_MINI | HallusionBench | AI2D_TEST | OCRBench | MMVet | |:------:|------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:| | 1 | Minimonkey | 54.3 | 71.4 | 50.3 | 35.6 | 46.3 | 38.6 | 74.8 | 802 | 37.2 | | 2 | InternVL2-2B | 54.2 | 71.4 | 50.3 | 34.6 | 47.2 | 38.2 | 74.2 | 783 | 39.8 | | 3 | InternVL2_5-2B | 59.4 | 74.6 | 53.7 | 40.1 | 49.7 | 42.2 | 74.9 | 802 | 59.5 | | 4 | InternVL3-2B | 63.5 | 79.6 | 61.1 | 48.6 | 51.1 | 42 | 78.4 | 835 | 64.08 | | 5 | QTuneVL1-2B | 59.7 | 74.9 | 53.9 | 41.5 | 48.8 | 43.0| 75.2 | 806 | 59.6 | | 6 | QTuneVL1.5-2B | **64.2(+4.5)** | **79.6(+4.7)** | **61.4(+7.5)** | **51.1(+9.6)** | **51.8(+3)** | **43.0**| **78.8(+3.6)** | **858(+52)** | **62.1(+2.5)** | It is important to note that when using **VLMEvalKit** for evaluation, the GPT-related evaluation models being called differ slightly from the official ones. In the code (`vlmeval/dataset/utils/judge_util.py`), it uses: - `'gpt-4o-mini': 'gpt-4o-mini'` instead of `'gpt-4o-mini': 'gpt-4o-mini-2024-07-18'` - `'gpt-4-turbo': 'gpt-4-turbo'` instead of `'gpt-4-turbo': 'gpt-4-1106-preview' This configuration will result in evaluation results that slightly differ from the official ones. # Copyright We welcome suggestions to help us improve the QTuneVL. For any query, please contact HanChao Wang: wanghanchao@reconova.com. If you find something interesting, please also feel free to share with us through email or open an issue.