--- license: apache-2.0 tags: - clip - feature-extraction - remote-sensing base_model: - chendelong/RemoteCLIP --- # Remote-CLIP-ViT-L-14 This model is a mirror/redistribution of the original [RemoteCLIP](https://huggingface.co/chendelong/RemoteCLIP) model. ## Original Repository and Links - **Original Hugging Face Model**: [chendelong/RemoteCLIP](https://huggingface.co/chendelong/RemoteCLIP) - **Official GitHub Repository**: [ChenDelong1999/RemoteCLIP](https://github.com/ChenDelong1999/RemoteCLIP) ## Description RemoteCLIP is a vision-language foundation model for remote sensing, trained on a large-scale dataset of remote sensing image-text pairs. It is based on the CLIP architecture and is designed to handle the unique characteristics of remote sensing imagery. ## How to use ### With `transformers` ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch # Load model and processor model = CLIPModel.from_pretrained("BiliSakura/Remote-CLIP-ViT-L-14") processor = CLIPProcessor.from_pretrained("BiliSakura/Remote-CLIP-ViT-L-14") # Load and process image image = Image.open("path/to/your/image.jpg") inputs = processor( text=["a photo of a building", "a photo of vegetation", "a photo of water"], images=image, return_tensors="pt", padding=True ) # Get image-text similarity scores with torch.inference_mode(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) print(f"Similarity scores: {probs}") ``` **Zero-shot image classification:** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch model = CLIPModel.from_pretrained("BiliSakura/Remote-CLIP-ViT-L-14") processor = CLIPProcessor.from_pretrained("BiliSakura/Remote-CLIP-ViT-L-14") # Define candidate labels candidate_labels = [ "a satellite image of urban area", "a satellite image of forest", "a satellite image of agricultural land", "a satellite image of water body" ] image = Image.open("path/to/your/image.jpg") inputs = processor( text=candidate_labels, images=image, return_tensors="pt", padding=True ) with torch.inference_mode(): outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) # Get the predicted label predicted_idx = probs.argmax().item() print(f"Predicted label: {candidate_labels[predicted_idx]}") print(f"Confidence: {probs[0][predicted_idx]:.4f}") ``` **Extracting individual features:** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch model = CLIPModel.from_pretrained("BiliSakura/Remote-CLIP-ViT-L-14") processor = CLIPProcessor.from_pretrained("BiliSakura/Remote-CLIP-ViT-L-14") # Get image features only image = Image.open("path/to/your/image.jpg") image_inputs = processor(images=image, return_tensors="pt") with torch.inference_mode(): image_features = model.get_image_features(**image_inputs) # Get text features only text_inputs = processor( text=["a satellite image of urban area"], return_tensors="pt", padding=True, truncation=True ) with torch.inference_mode(): text_features = model.get_text_features(**text_inputs) print(f"Image features shape: {image_features.shape}") print(f"Text features shape: {text_features.shape}") ``` ### With `diffusers` This model's text encoder can be used with Stable Diffusion and other diffusion models: ```python from diffusers import StableDiffusionPipeline from transformers import CLIPTextModel, CLIPTokenizer import torch # Load the text encoder and tokenizer text_encoder = CLIPTextModel.from_pretrained( "BiliSakura/Remote-CLIP-ViT-L-14/diffusers", subfolder="text_encoder", torch_dtype=torch.float16 ) tokenizer = CLIPTokenizer.from_pretrained( "BiliSakura/Remote-CLIP-ViT-L-14" ) # Encode text prompt prompt = "a satellite image of a city with buildings and roads" text_inputs = tokenizer( prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt" ) with torch.inference_mode(): text_outputs = text_encoder(text_inputs.input_ids) text_embeddings = text_outputs.last_hidden_state print(f"Text embeddings shape: {text_embeddings.shape}") ``` **Using with Stable Diffusion:** ```python from diffusers import StableDiffusionPipeline import torch # Load pipeline with custom text encoder pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.float16 ) pipe = pipe.to("cuda") # Generate image prompt = "a high-resolution satellite image of urban area" image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] image.save("generated_image.png") ``` ## Citation If you use this model in your research, please cite the original work: ```bibtex @article{remoteclip, author = {Fan Liu and Delong Chen and Zhangqingyun Guan and Xiaocong Zhou and Jiale Zhu and Qiaolin Ye and Liyong Fu and Jun Zhou}, title = {RemoteCLIP: {A} Vision Language Foundation Model for Remote Sensing}, journal = {{IEEE} Transactions on Geoscience and Remote Sensing}, volume = {62}, pages = {1--16}, year = {2024}, url = {https://doi.org/10.1109/TGRS.2024.3390838}, doi = {10.1109/TGRS.2024.3390838}, } ```