--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: image-text-to-text base_model: Qwen/Qwen3-VL-8B-Instruct tags: - vision-language-model - vlm - reasoning - perception - rlvr - grpo - icml-2026 --- # VLM-CapCurriculum-Qwen3-VL-8B-Staged A vision-language model post-trained from **Qwen/Qwen3-VL-8B-Instruct** with the staged, capability-dimension curriculum from *"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models"* (ICML 2026). > **TL;DR.** Visual perception — not reasoning length — is the dominant bottleneck for visual reasoning in VLMs. We fix this by post-training along a **capability axis** (perception → textual reasoning → visual reasoning) rather than mixing all data together. | Resource | Link | |---|---| | 📄 Paper | https://arxiv.org/abs/2605.20177 | | 💻 Code | https://github.com/UCSC-VLAA/VLM-CapCurriculum | | 🌐 Project page | https://ucsc-vlaa.github.io/VLM-CapCurriculum | | 🤗 Collection (model + data + eval) | https://huggingface.co/collections/UCSC-VLAA/vlm-capcurriculum-from-seeing-to-thinking-icml-2026-6a07691f944148ccb2b183b8 | ## Headline numbers | Setting | Visual Math AVG | Perception AVG | Overall AVG | |---|:---:|:---:|:---:| | Qwen3-VL-8B (base) | 45.17 | 79.21 | 62.19 | | Qwen3-VL-8B + Merged training | 49.64 | 79.71 | 64.67 | | **Qwen3-VL-8B + Staged (this model)** | **51.10** | **80.44** | **65.77** | | OneThinker-8B (concurrent baseline) | 51.10 | 78.64 | 64.87 | Visual math = MathVista / MathVision / MathVerse(VI) / WeMath. Perception = A-OKVQA / RealWorldQA / MMStar / POPE. Compared with the merged baseline on the same backbone, this model also produces **20.8% shorter** reasoning traces — better perception lets the model think less. ## How it was trained Three RLVR stages with GRPO (on top of [EasyR1](https://github.com/hiyouga/EasyR1)): 1. **Stage 1 — visual perception** on `UCSC-VLAA/VLM-CapCurriculum-Perception` (synthesised + filtered DOCCI MCQs). 2. **Stage 2 — textual reasoning** on `UCSC-VLAA/VLM-CapCurriculum-TextReasoning` (ORZ-Math-13k). 3. **Stage 3 — visual reasoning** on `UCSC-VLAA/VLM-CapCurriculum-VisualReasoning` (CLEVR-Math + GeoQA170K + Math PUMA + DocVQA + ArxivQA mix). All three stages share **one** system / format prompt — see [Inference](#inference) below. Detailed launch scripts: [`training/examples/qwen3_vl_8b/`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/tree/main/training/examples/qwen3_vl_8b) in the code repo. ## Inference The model expects the unified system prompt that it was trained against: ``` You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \boxed{}. i.e. reasoning here \boxed{final answer here} ``` Quick start with vLLM: ```bash vllm serve UCSC-VLAA/VLM-CapCurriculum-Qwen3-VL-8B-Staged \ --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --port 23341 ``` Then send chat completions including the system prompt above. For VLMEvalKit-style benchmark eval, plug it in via the `Qwen3_VL_8B_Staged` alias defined in [`evaluation/configs/models.py`](https://github.com/UCSC-VLAA/VLM-CapCurriculum/blob/main/evaluation/configs/models.py). ## Intended use & limitations Intended for research on vision-language reasoning, post-training methodology, and capability-dimension curriculum learning. Inherits the safety / bias profile of the underlying Qwen3-VL-8B-Instruct backbone; we have not added additional alignment fine-tuning. Not recommended for high-stakes deployments without further evaluation. The model was trained at the 8B parameter scale with 2048-token max prompt length and a fixed group size of 5. Behaviour at much longer contexts or substantially different prompt formats has not been characterised. ## License & citation Released under **Apache-2.0**, matching the upstream backbone. If you use this model, please cite: ```bibtex @inproceedings{vlmcapcurriculum2026, title = {From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models}, author = {Juncheng Wu and Hardy Chen and Haoqin Tu and Xianfeng Tang and Freda Shi and Hui Liu and Hanqing Lu and Cihang Xie and Yuyin Zhou}, booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, year = {2026}, eprint = {2605.20177}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.20177} } ``` ## Acknowledgements Built on top of [EasyR1](https://github.com/hiyouga/EasyR1), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), and the [Qwen3-VL](https://huggingface.co/Qwen) family.