--- license: mit pipeline_tag: image-segmentation library_name: transformers base_model: - OpenGVLab/InternVL2_5-2B --- # MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder This repository contains the `MLLMSeg_InternVL2_5_1B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107). **MLLMSeg** aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods. The official code is available on GitHub: [https://github.com/jcwang0602/MLLMSeg](https://github.com/jcwang0602/MLLMSeg) ## Model Architecture

## Quick Start / How to Use This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions. ### Installation First, install the `transformers` library and other necessary dependencies. Note that `flash-attn` requires a GPU for installation. ```bash conda create -n mllmseg python==3.10.18 -y conda activate mllmseg pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version pip install -r requirements.txt # Assuming requirements.txt from the cloned repo pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install ``` ## Usage Refer to the Github README: # The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000). # You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts. ``` ## Performance Metrics ### Referring Expression Segmentation

### Referring Expression Comprehension

### Generalized Referring Expression Segmentation

## Visualization ### Referring Expression Segmentation

### Referring Expression Comprehension

### Generalized Referring Expression Segmentation

## Citation If our work is useful for your research, please consider citing: ```bibtex @misc{wang2025unlockingpotentialmllmsreferring, title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder}, author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang}, year={2025}, eprint={2508.04107}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.04107}, } ```