---
tags:
- fashion
- image-retrieval
- image-to-image
- siglip
- lookbench
- embedding
- vision-only
- fp16
- compressed
library_name: open_clip
pipeline_tag: image-feature-extraction
license: mit
language:
- en
metrics:
- recall
- ndcg
datasets:
- srpone/look-bench
- DeepFashion2
---

# MODA-Fashion-Vision-FP16

**Compressed vision-only encoder for fast fashion image-to-image retrieval — 4.2x smaller, same quality.**

This is the vision tower extracted from MODA-Fashion-Distilled, converted to FP16 half-precision.
It strips the unused text encoder (image-to-image tasks never use it) and halves the weight precision,
reducing model size from 775 MB to **186 MB** with only **-0.21 pp** quality loss on LookBench.

## Key Numbers

| Property | Value |
|---|---|
| Architecture | ViT-B/16-SigLIP (vision tower only) |
| Parameters | 92.9M (vs 203M full CLIP) |
| Precision | float16 |
| Model Size | **186 MB** (vs 775 MB full CLIP) |
| Embedding Dim | 768 |
| Input Resolution | 224 x 224 |
| LookBench Fine R@1 | **67.42%** (full model: 67.63%) |

## LookBench Results (Fine Recall@1)

| Variant | Params | Size | RealStudio | AIGenStudio | RealStreet | AIGenStreet | **Overall** |
|---|---|---|---|---|---|---|---|
| MODA-Distilled (full CLIP) | 203M | 775 MB | 70.23 | 80.31 | 60.24 | 81.25 | **67.63** |
| **MODA-Vision-FP16 (this)** | 92.9M | 186 MB | 70.13 | 80.83 | 59.73 | 81.25 | **67.42** |
| FashionSigLIP baseline | 203M | 775 MB | 66.96 | 76.68 | 56.37 | 74.38 | 63.84 |

## Inference — Quick Start

A standalone `inference.py` is included in this directory.

```bash
# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU (keeps FP16 precision for speed)
python inference.py --image query.jpg --device cuda
```

### Python API

```python
import torch
import open_clip
import torch.nn.functional as F
from safetensors.torch import load_file
from PIL import Image

# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# The text tower is randomly initialized (we never use it). Only the visual tower
# is overwritten with MODA's fine-tuned weights below. Suppresses the ~775 MB
# pretrained-checkpoint download that would otherwise come with hf-hub:Marqo/marqo-fashionSigLIP.
base_model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP", pretrained=None
)

# Load MODA's vision-only fp16 weights (186 MB) and overlay onto the visual tower.
vision_sd = load_file("path/to/moda-fashion-vision-fp16/vision_encoder.safetensors")
vision_sd_fp32 = {k: v.float() for k, v in vision_sd.items()}
full_sd = base_model.state_dict()
for k, v in vision_sd_fp32.items():
    full_sd[k] = v
base_model.load_state_dict(full_sd, strict=True)

encoder = base_model.visual
encoder.eval()

image = preprocess(Image.open("fashion_item.jpg")).unsqueeze(0)
with torch.no_grad():
    features = encoder(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]
```

### Requirements

```
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
```

## Why Vision-Only?

For **image-to-image retrieval** (the primary use case), the text encoder is never used.
Stripping it provides:
- **54% fewer parameters** (92.9M vs 203M) — faster forward pass
- **4.2x smaller on disk** (186 MB vs 775 MB) — cheaper to deploy
- **Near-zero quality loss** (-0.21 pp Fine R@1) — within noise margin

## Training

This model inherits its weights from MODA-Fashion-Distilled, which was trained via:
1. **Knowledge distillation** from a 10-model ensemble (SigLIP, CLIP, EVA-CLIP, MetaCLIP, DFN variants)
2. **Stratified contrastive learning** on Google Shopping data (10K image pairs, category-balanced)
3. **Vision-only export** — text tower removed
4. **FP16 conversion** — weights cast from float32 to float16

## Related Models

| Model | Dim | Fine R@1 | Best for |
|---|---:|---:|---|
| [MODA-Fashion-Distilled](https://huggingface.co/HopitAI/moda-fashion-distilled) | 768 | 67.63 | Best overall quality |
| [MODA-Fashion-Matryoshka](https://huggingface.co/HopitAI/moda-fashion-matryoshka) | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| **MODA-Fashion-Vision-FP16 (this model)** | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| [MODA-Fashion-Distilled-512d](https://huggingface.co/HopitAI/moda-fashion-distilled-512d) | 512 | 67.63 | Compact index, highest nDCG@5 |
| [MODA-Fashion-DeepFashion2](https://huggingface.co/HopitAI/moda-fashion-deepfashion2) | 768 | 66.52 | Simplest recipe, no distillation |

## License

MIT

## Citation

If you use this model, please cite:
```
@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}
```