---
title: Visual Product Recommender
emoji: ๐๏ธ
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 6.16.0
app_file: app.py
pinned: false
license: mit
---
# Visual Product Recommender: Find Similar Products by Image or Text
**Assignment #3 ยท Embeddings, RecSys, Spaces ยท June 2026**
## Video Presentation
*(The recorded walk-through is also in the "Presentation" tab of the app above.)*
## Overview
This project builds a recommendation app on the vision modality. The idea is simple:
Given a product (as a photo you upload, or a short text description), return the 3 most similar products from a real e-commerce catalogue.
The pipeline goes from a raw Hugging Face image dataset to a working Gradio Space: I load the
catalogue, explore it, turn every product image into a CLIP embedding, check that similar
products really do land near each other (using clustering and 2D plots), save the embeddings to a
file, and serve the Top-3 most similar products live. On a held-out test, the top result is in the
right category about 39% of the time, roughly 3x a random guess, and the strongest demo is
uploading a photo (for example, a Canon camera returns three other Canon cameras at 0.94 similarity).
## Dataset
Downloaded directly from Hugging Face with `datasets.load_dataset`.
| Property | Details |
|---|---|
| Source | [`Shopify/product-catalogue`](https://huggingface.co/datasets/Shopify/product-catalogue) |
| Total size | 48,289 products (train 38,631 + test 9,658) |
| Used here | the 38,631-row train split; embeddings on a balanced 11,912 sample |
| Modality | product images (median ~900 x 900 px) + metadata |
| Top-level categories | 25 (Home & Garden, Sporting Goods, Electronics, Apparel, etc.) |
| Brands | 24,245 (very long tail, none dominant) |
| Target / task | unsupervised: visual similarity retrieval (no label is predicted) |
### Feature Mix
| Type | Columns |
|---|---|
| Image | `product_image` (the signal the whole project is built on) |
| Categorical | `ground_truth_category` (Google taxonomy, median depth 4), `ground_truth_brand`, `ground_truth_is_secondhand` |
| Textual | `product_title`, `product_description` |
| Derived | `top_category` (first taxonomy segment, used as the EDA/clustering label) |
## Setup
```python
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import torch
from datasets import load_dataset
from transformers import CLIPModel, CLIPProcessor
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import umap, faiss
SEED = 42
```
Built on an Apple M-series laptop (CLIP runs on the `mps` device); the Space itself runs on CPU.
## Data Loading and Preparation
Loading: `load_dataset("Shopify/product-catalogue", split="train")` pulls the images and metadata
straight from the Hub (about 7 GB across 15 shards, cached after the first run).
Top-level category: the raw category is a deep path like *Home & Garden > Decor > Piggy Banks*.
I take the first segment as `top_category`, which gives 25 interpretable groups to reason about.
Balanced sampling: the categories are imbalanced, so embedding the raw data would let a few big
categories dominate the clusters. I drop categories with fewer than 150 products and then cap each
remaining category so the embedded sample is balanced: 11,912 products across 18 categories.
This keeps the analysis in the sweet spot the brief asks for (large enough to show structure, small
enough to stay fast).
Embedding storage: every image embedding is L2-normalized and saved with its metadata and a
thumbnail to `catalog.parquet`, which the app loads at runtime.
## Exploratory Data Analysis (EDA)
Because this is an image dataset, the EDA looks at both the metadata and the images themselves.
### 1. Category Distribution

Home & Garden, Sporting Goods and Arts & Entertainment dominate, while categories like Media and
Luggage are tiny. This imbalance is exactly why I sample in a balanced way before embedding, instead
of feeding the raw distribution into K-Means.
### 2. Brands (the long tail)

There are over 24,000 brands and not one of them owns a meaningful share of the catalogue. So brand
is useless as a grouping signal here. The thing that actually separates products is what they
look like, which is the whole motivation for using image embeddings.
### 3. Text Field Lengths

Titles are short (a few words), descriptions vary a lot, and a good chunk of the titles are not in
English (Hebrew, Japanese, Dutch, Portuguese all show up). That multilinguality is worth keeping in
mind for the text-query mode later, and it is another reason the image is the more reliable signal.
### 4. What the Products Look Like

One random product per category. Two things jump out: the photos are clean, single-product studio
shots, and almost all of them sit on a white background. That observation turns out to matter a lot.
### 5. Image Dimensions

Most images are close to square at around 900 x 900 px (aspect ratio clustered near 1.0). This is
consistent product-listing imagery, which is good news: I do not have to worry about wildly different
shapes confusing the encoder, and CLIP resizes everything to 224 x 224 anyway.
### 6. Brightness and Composition

This is the most important EDA finding. Median brightness is about 205 out of 255, and roughly
70% of products have a near-white background. White backgrounds make products from different
categories look similar to each other, which is the thread that runs through the rest of the project:
it is why the clustering silhouette is modest, and why the recommender works best on clean,
single-product photos rather than busy lifestyle shots.
## Embeddings
I use [`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32), a
small/medium CLIP model. The reason for CLIP specifically is that it embeds images and text into
the same 512-dimensional space, which is what lets one app accept either a photo or a text query
and compare both against the same catalogue vectors.
Each product image is passed through the vision encoder, projected into the shared space, and
L2-normalized (so a dot product equals cosine similarity). I process the images in small batches so
it does not run out of memory, then store the 11,912 vectors in `catalog.parquet`.
## Clustering: is the embedding space meaningful?
Before trusting nearest-neighbour search, I check that the embeddings actually organise products
sensibly. CLIP vectors have 512 dimensions, which is a lot for K-Means to handle well, so I first
reduce them to 50 PCA components (this keeps most of the useful signal and drops some of the noise).
### 7. Choosing K with the Silhouette Score

I tested K from 4 to 10. The silhouette score stays low and fairly flat (about 0.08 at its best,
K=4), so there is no single obvious number of clusters and K-Means only finds broad groups. This
fits the white-background overlap from EDA #6: products from different categories look similar, so
they are hard to separate cleanly. I use K=4 as a simple summary and lean on the next two plots,
not the silhouette, to judge whether the grouping is real.
### 8. UMAP Projection, coloured by category

Projecting the 512-d embeddings to 2D with UMAP and colouring by the true category shows real
structure: products of the same category land near each other, even though the projection never saw
the labels.
### 9. UMAP Projection, coloured by K-Means cluster

The same map coloured by the four K-Means clusters lines up with the category structure above, which
is the visual confirmation that the clusters are not arbitrary.
### 10. Cluster vs Category, and the reasoning

This heatmap (each row adds up to 100%) is spread out rather than showing one strong block per row,
which matches the low silhouette: the categories overlap a fair amount. Even so, each cluster has a
clear set of categories that show up most, giving four broad visual product families (these are read
off the data, not chosen by me):
- Consumables and packaged goods (Food, Health & Beauty, Pet Supplies) - bottles, boxes, tubes.
- Furnishings and soft goods (Furniture, Apparel, Baby & Toddler) - larger lifestyle items.
- Tech and hardware (Electronics, Cameras & Optics, Hardware) - devices, tools, metal parts.
- Toys, office and media (Toys & Games, Office Supplies, Arts & Entertainment) - small, colourful items.
The clusters were found from purely visual embeddings with no access to the labels, yet they recover
human-meaningful groupings. That is exactly the property that makes nearest-neighbour recommendation
work: similar-looking products really are near each other in the space.
### 11. Two more projections: PCA and t-SNE
PCA is a fast linear view (it is also what I cluster on); t-SNE emphasises local neighbourhoods. I
ran both as cross-checks against UMAP, and both show the same per-category grouping, so the structure
is not a UMAP artefact.


### 12. A second clustering algorithm: DBSCAN

K-Means forces every product into one of K round clusters; DBSCAN instead looks for dense regions
and labels the leftovers as noise. I picked its `eps` setting from the k-distance plot (left) rather
than guessing. DBSCAN splits the catalogue into roughly two dozen small clusters with almost no
noise, which means the products group into many tight little neighbourhoods. K-Means with K=4 is just
a simpler, easier-to-read summary of the same thing, so the two methods agree the space is well organised.
### 13. What each cluster actually contains

Labels and numbers are one thing; the honest test is to look at the products closest to each
centroid. Each row is clearly one visual family (packaged goods, tech, furnishings, small colourful
items). This is what convinced me the space was worth recommending from.
## The Recommender (Inputs and Outputs)
The recommendation itself is four steps, kept as small standalone functions:
1. Encode the query. A text query is wrapped in a short template (`"a product photo of ..."`,
which calibrates CLIP better) and passed through the text encoder; an uploaded image is passed
through the image encoder. Either way the result is one 512-d vector in the shared space.
2. Score. Cosine similarity against all 11,912 catalogue vectors, which is a single matrix
multiply because everything is L2-normalized.
3. Rank and de-duplicate. Take the highest scores, skipping near-identical duplicates
(similarity > 0.985) so the three results are genuinely different products.
4. Return the Top-3 with their thumbnails, categories and similarity scores.
Here is what it actually returns for five held-out photos (products the catalogue never saw):

The camera row is the clearest: a Canon body retrieves three other cameras at ~0.94 similarity.
### Bonus: a faster backend with FAISS
A linear `EMB @ q` scan is fine for 12K items, but a real catalogue has millions. I index the
embeddings with FAISS (the standard vector-search library) and confirm it returns the same
Top-3 as the brute-force scan, only faster. On 500 queries FAISS was about 50x faster with
100% agreement, so it is exact here, just quicker. That is the piece that would let the same app
scale to a production catalogue.
## Evaluation
To put a number on quality I ran image-to-image retrieval on 80 held-out products (products the
catalogue sample never saw) and measured how often the returned products share the query's top-level
category:
| Metric | Score | Note |
|---|---|---|
| precision@1 | 0.39 | about 3x a random baseline (~0.12) |
| precision@3 | 0.35 | slightly below precision@1 (explained below) |
precision@3 being a little lower than precision@1 makes sense: the single closest product is usually
the safest match, while results 2 and 3 start drifting toward items that look alike but fall under a
different category. This also means the metric is a bit harsh on the model. Many of the apparent
"misses" are actually good visual matches across categories (a metal spear tip returning nails, a
scroll saw returning a microtome), because the model ranks by how things look, not by the category
label, which is exactly what you want from a "find similar-looking products" tool.
## Interactive App (the Space)
The app has three tabs: a Recommender (upload a photo or type a query, get Top-3), a
Dataset & Analysis tab with these plots and reasoning, and a Presentation tab with the video.
The Space loads `catalog.parquet` and the same CLIP model used to build it, so the live results are
exactly the pipeline described above.
## Business and Ethical Considerations
Business value. Visual similarity search is the engine behind "shop the look" and "more like
this" features. It needs no manual tagging (it runs on the product image alone), works across
languages (useful here, since titles are multilingual), and helps cold-start items that have no
clicks yet. The same `catalog.parquet` + FAISS setup would directly power a related-items carousel
or a visual search bar on a store.
Limits and ethics.
- Visual, not semantic. The model matches appearance, so it can pair items that look alike but
serve different purposes. Fine for shopping, but it should not be trusted where the *function*
matters (medical or safety products).
- Representation bias. CLIP is trained on web images and inherits their biases; a product shot in
an unusual style, or from an under-represented region, may embed poorly and be under-recommended.
- Catalogue gaps. Recommendations can only point inside the catalogue, so sparse categories give
weak results no matter how good the model is.
## Final Conclusions
CLIP gives a single shared space for images and text, and the clustering confirmed that space is
semantically meaningful before I relied on it for retrieval. The recommender is strongest in
image-to-image mode, where there is no text/image gap, and the EDA's white-background finding
explains both the modest silhouette and the cases where retrieval drifts across categories.
A few caveats worth flagging:
- Visual, not semantic. The model matches appearance, so a black sports pad can retrieve black
motorcycle pants. For a visual recommender that is a feature, but it does cap the category-match metric.
- Catalogue coverage. Some everyday text queries (for example "necklace" or "mug") return weak
results simply because few such products exist in the sampled catalogue.
- Single sample. Numbers come from one balanced 11,912-product sample and one 80-query eval set;
a larger sweep would tighten the estimates.
What I would do next. A larger or domain-tuned CLIP would help the weak text queries; a
single-domain catalogue (rather than a mixed marketplace) would reduce the white-background overlap
and sharpen the clusters; and real human relevance judgements would replace the conservative
category-match proxy. If I had set out to maximise the clustering metric I would also have filtered to
fewer, more visually distinct categories, but I kept the full mix because it is the honest, harder case.
## Repository Contents
| File | Description |
|---|---|
| `app.py` | The Gradio app (Recommender / Dataset & Analysis / Presentation tabs) |
| `Assignment_3_NoamFuchs.ipynb` | Full code notebook (all cells run, plots and outputs embedded) |
| `catalog.parquet` | The embeddings file: 11,912 vectors + metadata + thumbnails |
| `assets/` | All plots referenced above |
| `examples/` | Example product images for the app |
| `requirements.txt` | Dependencies |
| `README.md` | This file |
## Author
Noam Fuchs ยท *Introduction to Data Science* ยท Reichman University (IDC) ยท Spring 2026