---
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
base_model:
  - OpenGVLab/InternViT-300M-v2.5
  - internlm/Qwen2.5-1.5B
base_model_relation: merge
language:
  - multilingual
---
# QTuneVL1.5-2B developed by the [ Reconova AI Lab ](https://www.reconova.com/)   &&   [ BDAA-Lab ](https://dm.ustc.edu.cn/index.html)

# Introduction


We’re excited to introduce QTuneVL1.5-2B, the latest in [Reconova AI Lab’s ](https://www.reconova.com/)series of multimodal large language models. Building on [QTuneVL1-2B](https://huggingface.co/hanchaow/QTuneVL1-2B), it incorporates key features from both [InternVL](https://huggingface.co/OpenGVLab/InternVL2_5-2B) and [Mini-Monkey](https://huggingface.co/mx262/MiniMonkey) to deliver even greater performance.

Like QTuneVL1-2B, QTuneVL1.5-2B is a lightweight MLLM that incorporates cropping and padding strategies from [Mini-Monkey](https://huggingface.co/mx262/MiniMonkey)/[Ureader](https://arxiv.org/abs/2310.05126)/[InternVL](https://github.com/OpenGVLab/InternVL), and has been fine-tuned on [InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B).

# Evaluation

By evaluating our model on eight benchmarks in the [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal) leaderboard using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), we found that it outperformed its predecessor(QTuneVL1-2B) in terms of average scores, particularly on MMStar MMMU_DEV_VAL and OCRBench benchmarks. The eight benchmarks  and specific experimental results are as follows:

**Eight benchmark:** ` 'MMBench_DEV_EN_V11', 'MMStar', 'MMMU_DEV_VAL', 'MathVista_MINI', 'HallusionBench', 'AI2D_TEST', 'OCRBench', 'MMVet' `.

| Index | Model | AVG | MMBench_DEV_EN_V11 | MMStar | MMMU_DEV_VAL | MathVista_MINI | HallusionBench | AI2D_TEST | OCRBench | MMVet |
|:------:|------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | Minimonkey | 54.3 | 71.4 | 50.3 | 35.6 | 46.3  | 38.6 | 74.8 | 802 | 37.2 | 
| 2 | InternVL2-2B | 54.2 | 71.4 | 50.3 | 34.6 | 47.2  | 38.2 | 74.2 | 783 | 39.8 | 
| 3 | InternVL2_5-2B | 59.4 | 74.6 | 53.7 | 40.1 | 49.7 | 42.2 | 74.9 | 802 | 59.5 | 
| 4 | InternVL3-2B | 63.5 | 79.6 | 61.1 | 48.6 | 51.1 | 42 | 78.4 | 835 | 64.08 |
| 5 | QTuneVL1-2B | 59.7 | 74.9 | 53.9 | 41.5 | 48.8 | 43.0| 75.2 | 806 | 59.6 | 
| 6 | QTuneVL1.5-2B |  **64.2(+4.5)** | **79.6(+4.7)** | **61.4(+7.5)** | **51.1(+9.6)** | **51.8(+3)** | **43.0**| **78.8(+3.6)** | **858(+52)** | **62.1(+2.5)** |

It is important to note that when using **VLMEvalKit** for evaluation, the GPT-related evaluation models being called differ slightly from the official ones. In the code (`vlmeval/dataset/utils/judge_util.py`), it uses:

- `'gpt-4o-mini': 'gpt-4o-mini'` instead of `'gpt-4o-mini': 'gpt-4o-mini-2024-07-18'`
- `'gpt-4-turbo': 'gpt-4-turbo'` instead of `'gpt-4-turbo': 'gpt-4-1106-preview'
  
This configuration will result in evaluation results that slightly differ from the official ones.

# Copyright
We welcome suggestions to help us improve the QTuneVL. For any query, please contact HanChao Wang: wanghanchao@reconova.com. If you find something interesting, please also feel free to share with us through email or open an issue.