--- library_name: transformers pipeline_tag: zero-shot-image-classification license: cc-by-nc-sa-4.0 tags: - clip - vision-language - commonsense-reasoning --- # ReasonCLIP ReasonCLIP is a CLIP-style training framework designed to improve visual representation learning with reasoning-aware supervision without modifying the underlying model architecture. More details can be found in the paper [ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP](https://huggingface.co/papers/2606.26794). - **Repository:** https://github.com/RISys-Lab/ReasonCLIP - **License:** [CC-BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) ## How to Get Started with the Model ReasonCLIP does not modify any model architecture. For inference/loading, please use the official Hugging Face `transformers` code path. ```python from PIL import Image import requests from transformers import AutoModel, AutoProcessor model_id = "fesvhtr/RC-B32-S1" model = AutoModel.from_pretrained(model_id) processor = AutoProcessor.from_pretrained(model_id) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) ``` ## Citation ```bibtex @article{reasonclip2026, title={ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP}, author={TBD}, journal={arXiv preprint arXiv:2606.26794}, year={2026} } ```