--- license: mit tags: - vision-language - composed-image-retrieval - zero-shot - patch-masking - contrastive-learning datasets: - fashion-iq - cirr - circo - genecis pipeline_tag: feature-extraction library_name: pytorch --- # Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval This repository contains the official pre-trained and tuned model weights (CLIP-ViT-L/14 backbone) for **PLI (Pretrain like Your Inference)**, accepted at **ICME 2025**. [![GitHub](https://img.shields.io/badge/GitHub-Code-blue?logo=github)](https://github.com/Chen-Junyang-cn/PLI) [![arXiv](https://img.shields.io/badge/arXiv-2311.07622-b31b1b.svg?logo=arxiv)](https://arxiv.org/abs/2311.07622) --- ## 📌 Introduction Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query image and a modifying text description, without using labeled triplet data during training. Existing methods typically rely on vision-language models (e.g., CLIP) which are pre-trained on standard image-text pairs. However, this creates a gap between pre-training (matching static image-text pairs) and inference (matching modified image-text compositions). **PLI (Pretrain like Your Inference)** bridges this gap by reformulating contrastive learning as a CIR task using a self-supervised **Masked Tuning** approach. By randomly masking patches of the input image, we generate triplets of $\langle \text{masked image}, \text{modifying text}, \text{original image} \rangle$, forcing the model to learn fine-grained text-guided modifications during pre-training. --- ## 🚀 Quick Start & Usage You can download and load the model weights directly using the `huggingface_hub` SDK. ### 1. Installation Ensure you have the necessary libraries installed: ```bash pip install torch torchvision huggingface_hub clip ``` ### 2. Loading Weights in Python Here is an example of how to programmatically download the weight file and load it into your model: ```python import torch import clip from huggingface_hub import hf_hub_download # 1. Download the weights from Hugging Face checkpoint_path = hf_hub_download( repo_id="jayong/PLI-CLIP-VIT-L-14", filename="best.pth" ) # 2. Initialize the base CLIP model (ViT-L/14) device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-L/14", device=device) # 3. Load the tuned weights state_dict = torch.load(checkpoint_path, map_location=device) model.load_state_dict(state_dict) print("PLI model weights loaded successfully!") ``` --- ## 🛠️ Method Overview ``` [ Original Image ] ──( Patch Masking )──> [ Masked Image ] │ │ │ ( Text-guided Modification ) ▼ ▼ [ Target Representation ] <──( Contrastive )── [ Predicted Representation ] ``` 1. **Patch Masking**: Randomly mask patches of the source image. 2. **Text Query**: Treat the text description of the image as the "modifying text". 3. **Contrastive Objective**: Align the composition of `(masked image + text)` with the representation of the `original image`. --- ## ✍️ Citation If you find our work or weights useful in your research, please consider citing our paper: ```bibtex @inproceedings{chen2025pretrain, title={Pretrain like your inference: Masked tuning improves zero-shot composed image retrieval}, author={Chen, Junyang and Lai, Hanjiang}, booktitle={2025 IEEE International Conference on Multimedia and Expo (ICME)}, pages={1--6}, year={2025}, organization={IEEE} } ``` --- ## 📭 Contact / Feedback For questions or feedback, please raise an issue on our [GitHub Repository](https://github.com/Chen-Junyang-cn/PLI).