--- tags: - fashion - image-retrieval - image-to-image - siglip - lookbench - embedding - vision-only - fp16 - compressed library_name: open_clip pipeline_tag: image-feature-extraction license: mit language: - en metrics: - recall - ndcg datasets: - srpone/look-bench - DeepFashion2 --- # MODA-Fashion-Vision-FP16 **Compressed vision-only encoder for fast fashion image-to-image retrieval — 4.2x smaller, same quality.** This is the vision tower extracted from MODA-Fashion-Distilled, converted to FP16 half-precision. It strips the unused text encoder (image-to-image tasks never use it) and halves the weight precision, reducing model size from 775 MB to **186 MB** with only **-0.21 pp** quality loss on LookBench. ## Key Numbers | Property | Value | |---|---| | Architecture | ViT-B/16-SigLIP (vision tower only) | | Parameters | 92.9M (vs 203M full CLIP) | | Precision | float16 | | Model Size | **186 MB** (vs 775 MB full CLIP) | | Embedding Dim | 768 | | Input Resolution | 224 x 224 | | LookBench Fine R@1 | **67.42%** (full model: 67.63%) | ## LookBench Results (Fine Recall@1) | Variant | Params | Size | RealStudio | AIGenStudio | RealStreet | AIGenStreet | **Overall** | |---|---|---|---|---|---|---|---| | MODA-Distilled (full CLIP) | 203M | 775 MB | 70.23 | 80.31 | 60.24 | 81.25 | **67.63** | | **MODA-Vision-FP16 (this)** | 92.9M | 186 MB | 70.13 | 80.83 | 59.73 | 81.25 | **67.42** | | FashionSigLIP baseline | 203M | 775 MB | 66.96 | 76.68 | 56.37 | 74.38 | 63.84 | ## Inference — Quick Start A standalone `inference.py` is included in this directory. ```bash # Single image → 768-d embedding python inference.py --image query.jpg # Two images → embeddings + cosine similarity python inference.py --image img1.jpg img2.jpg --similarity # Run on GPU (keeps FP16 precision for speed) python inference.py --image query.jpg --device cuda ``` ### Python API ```python import torch import open_clip import torch.nn.functional as F from safetensors.torch import load_file from PIL import Image # Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights. # The text tower is randomly initialized (we never use it). Only the visual tower # is overwritten with MODA's fine-tuned weights below. Suppresses the ~775 MB # pretrained-checkpoint download that would otherwise come with hf-hub:Marqo/marqo-fashionSigLIP. base_model, _, preprocess = open_clip.create_model_and_transforms( "ViT-B-16-SigLIP", pretrained=None ) # Load MODA's vision-only fp16 weights (186 MB) and overlay onto the visual tower. vision_sd = load_file("path/to/moda-fashion-vision-fp16/vision_encoder.safetensors") vision_sd_fp32 = {k: v.float() for k, v in vision_sd.items()} full_sd = base_model.state_dict() for k, v in vision_sd_fp32.items(): full_sd[k] = v base_model.load_state_dict(full_sd, strict=True) encoder = base_model.visual encoder.eval() image = preprocess(Image.open("fashion_item.jpg")).unsqueeze(0) with torch.no_grad(): features = encoder(image) features = F.normalize(features, p=2, dim=-1) # [1, 768] ``` ### Requirements ``` open_clip_torch>=2.20.0 torch>=2.0 Pillow safetensors ``` ## Why Vision-Only? For **image-to-image retrieval** (the primary use case), the text encoder is never used. Stripping it provides: - **54% fewer parameters** (92.9M vs 203M) — faster forward pass - **4.2x smaller on disk** (186 MB vs 775 MB) — cheaper to deploy - **Near-zero quality loss** (-0.21 pp Fine R@1) — within noise margin ## Training This model inherits its weights from MODA-Fashion-Distilled, which was trained via: 1. **Knowledge distillation** from a 10-model ensemble (SigLIP, CLIP, EVA-CLIP, MetaCLIP, DFN variants) 2. **Stratified contrastive learning** on Google Shopping data (10K image pairs, category-balanced) 3. **Vision-only export** — text tower removed 4. **FP16 conversion** — weights cast from float32 to float16 ## Related Models | Model | Dim | Fine R@1 | Best for | |---|---:|---:|---| | [MODA-Fashion-Distilled](https://huggingface.co/HopitAI/moda-fashion-distilled) | 768 | 67.63 | Best overall quality | | [MODA-Fashion-Matryoshka](https://huggingface.co/HopitAI/moda-fashion-matryoshka) | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index | | **MODA-Fashion-Vision-FP16 (this model)** | 768 | 67.42 | Smallest (186 MB), edge/mobile | | [MODA-Fashion-Distilled-512d](https://huggingface.co/HopitAI/moda-fashion-distilled-512d) | 512 | 67.63 | Compact index, highest nDCG@5 | | [MODA-Fashion-DeepFashion2](https://huggingface.co/HopitAI/moda-fashion-deepfashion2) | 768 | 66.52 | Simplest recipe, no distillation | ## License MIT ## Citation If you use this model, please cite: ``` @software{moda2026, title = {MODA: Open-source benchmark and models for fashion search}, author = {Hopit AI}, year = {2026}, url = {https://github.com/hopit-ai/Moda} } ```