---
library_name: transformers
pipeline_tag: zero-shot-image-classification
license: cc-by-nc-sa-4.0
tags:
- clip
- vision-language
- commonsense-reasoning
---

# ReasonCLIP

ReasonCLIP is a CLIP-style training framework designed to improve visual representation learning with reasoning-aware supervision without modifying the underlying model architecture.

More details can be found in the paper [ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP](https://huggingface.co/papers/2606.26794).

- **Repository:** https://github.com/RISys-Lab/ReasonCLIP
- **License:** [CC-BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)

## How to Get Started with the Model

ReasonCLIP does not modify any model architecture. For inference/loading, please use the official Hugging Face `transformers` code path.

```python
from PIL import Image
import requests
from transformers import AutoModel, AutoProcessor

model_id = "fesvhtr/RC-B32-S1"
model = AutoModel.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
```

## Citation

```bibtex
@article{reasonclip2026,
  title={ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP},
  author={TBD},
  journal={arXiv preprint arXiv:2606.26794},
  year={2026}
}
```