--- title: Visual Product Recommender emoji: 🛍️ colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 6.16.0 app_file: app.py pinned: false license: mit --- # Visual Product Recommender: Find Similar Products by Image or Text **Assignment #3 · Embeddings, RecSys, Spaces · June 2026** ## Video Presentation

*(The recorded walk-through is also in the "Presentation" tab of the app above.)* ## Overview This project builds a recommendation app on the vision modality. The idea is simple: Given a product (as a photo you upload, or a short text description), return the 3 most similar products from a real e-commerce catalogue. The pipeline goes from a raw Hugging Face image dataset to a working Gradio Space: I load the catalogue, explore it, turn every product image into a CLIP embedding, check that similar products really do land near each other (using clustering and 2D plots), save the embeddings to a file, and serve the Top-3 most similar products live. On a held-out test, the top result is in the right category about 39% of the time, roughly 3x a random guess, and the strongest demo is uploading a photo (for example, a Canon camera returns three other Canon cameras at 0.94 similarity). ## Dataset Downloaded directly from Hugging Face with `datasets.load_dataset`. | Property | Details | |---|---| | Source | [`Shopify/product-catalogue`](https://huggingface.co/datasets/Shopify/product-catalogue) | | Total size | 48,289 products (train 38,631 + test 9,658) | | Used here | the 38,631-row train split; embeddings on a balanced 11,912 sample | | Modality | product images (median ~900 x 900 px) + metadata | | Top-level categories | 25 (Home & Garden, Sporting Goods, Electronics, Apparel, etc.) | | Brands | 24,245 (very long tail, none dominant) | | Target / task | unsupervised: visual similarity retrieval (no label is predicted) | ### Feature Mix | Type | Columns | |---|---| | Image | `product_image` (the signal the whole project is built on) | | Categorical | `ground_truth_category` (Google taxonomy, median depth 4), `ground_truth_brand`, `ground_truth_is_secondhand` | | Textual | `product_title`, `product_description` | | Derived | `top_category` (first taxonomy segment, used as the EDA/clustering label) | ## Setup ```python import numpy as np, pandas as pd import matplotlib.pyplot as plt import torch from datasets import load_dataset from transformers import CLIPModel, CLIPProcessor from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.manifold import TSNE from sklearn.neighbors import NearestNeighbors from sklearn.metrics import silhouette_score import umap, faiss SEED = 42 ``` Built on an Apple M-series laptop (CLIP runs on the `mps` device); the Space itself runs on CPU. ## Data Loading and Preparation Loading: `load_dataset("Shopify/product-catalogue", split="train")` pulls the images and metadata straight from the Hub (about 7 GB across 15 shards, cached after the first run). Top-level category: the raw category is a deep path like *Home & Garden > Decor > Piggy Banks*. I take the first segment as `top_category`, which gives 25 interpretable groups to reason about. Balanced sampling: the categories are imbalanced, so embedding the raw data would let a few big categories dominate the clusters. I drop categories with fewer than 150 products and then cap each remaining category so the embedded sample is balanced: 11,912 products across 18 categories. This keeps the analysis in the sweet spot the brief asks for (large enough to show structure, small enough to stay fast). Embedding storage: every image embedding is L2-normalized and saved with its metadata and a thumbnail to `catalog.parquet`, which the app loads at runtime. ## Exploratory Data Analysis (EDA) Because this is an image dataset, the EDA looks at both the metadata and the images themselves. ### 1. Category Distribution

Top categories

Home & Garden, Sporting Goods and Arts & Entertainment dominate, while categories like Media and Luggage are tiny. This imbalance is exactly why I sample in a balanced way before embedding, instead of feeding the raw distribution into K-Means. ### 2. Brands (the long tail)

Brands

There are over 24,000 brands and not one of them owns a meaningful share of the catalogue. So brand is useless as a grouping signal here. The thing that actually separates products is what they look like, which is the whole motivation for using image embeddings. ### 3. Text Field Lengths

Text lengths

Titles are short (a few words), descriptions vary a lot, and a good chunk of the titles are not in English (Hebrew, Japanese, Dutch, Portuguese all show up). That multilinguality is worth keeping in mind for the text-query mode later, and it is another reason the image is the more reliable signal. ### 4. What the Products Look Like

Sample grid

One random product per category. Two things jump out: the photos are clean, single-product studio shots, and almost all of them sit on a white background. That observation turns out to matter a lot. ### 5. Image Dimensions

Image dimensions

Most images are close to square at around 900 x 900 px (aspect ratio clustered near 1.0). This is consistent product-listing imagery, which is good news: I do not have to worry about wildly different shapes confusing the encoder, and CLIP resizes everything to 224 x 224 anyway. ### 6. Brightness and Composition

Image colour

This is the most important EDA finding. Median brightness is about 205 out of 255, and roughly 70% of products have a near-white background. White backgrounds make products from different categories look similar to each other, which is the thread that runs through the rest of the project: it is why the clustering silhouette is modest, and why the recommender works best on clean, single-product photos rather than busy lifestyle shots. ## Embeddings I use [`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32), a small/medium CLIP model. The reason for CLIP specifically is that it embeds images and text into the same 512-dimensional space, which is what lets one app accept either a photo or a text query and compare both against the same catalogue vectors. Each product image is passed through the vision encoder, projected into the shared space, and L2-normalized (so a dot product equals cosine similarity). I process the images in small batches so it does not run out of memory, then store the 11,912 vectors in `catalog.parquet`. ## Clustering: is the embedding space meaningful? Before trusting nearest-neighbour search, I check that the embeddings actually organise products sensibly. CLIP vectors have 512 dimensions, which is a lot for K-Means to handle well, so I first reduce them to 50 PCA components (this keeps most of the useful signal and drops some of the noise). ### 7. Choosing K with the Silhouette Score

Silhouette

I tested K from 4 to 10. The silhouette score stays low and fairly flat (about 0.08 at its best, K=4), so there is no single obvious number of clusters and K-Means only finds broad groups. This fits the white-background overlap from EDA #6: products from different categories look similar, so they are hard to separate cleanly. I use K=4 as a simple summary and lean on the next two plots, not the silhouette, to judge whether the grouping is real. ### 8. UMAP Projection, coloured by category

UMAP by category

Projecting the 512-d embeddings to 2D with UMAP and colouring by the true category shows real structure: products of the same category land near each other, even though the projection never saw the labels. ### 9. UMAP Projection, coloured by K-Means cluster

UMAP by cluster

The same map coloured by the four K-Means clusters lines up with the category structure above, which is the visual confirmation that the clusters are not arbitrary. ### 10. Cluster vs Category, and the reasoning

Cluster vs category

This heatmap (each row adds up to 100%) is spread out rather than showing one strong block per row, which matches the low silhouette: the categories overlap a fair amount. Even so, each cluster has a clear set of categories that show up most, giving four broad visual product families (these are read off the data, not chosen by me): - Consumables and packaged goods (Food, Health & Beauty, Pet Supplies) - bottles, boxes, tubes. - Furnishings and soft goods (Furniture, Apparel, Baby & Toddler) - larger lifestyle items. - Tech and hardware (Electronics, Cameras & Optics, Hardware) - devices, tools, metal parts. - Toys, office and media (Toys & Games, Office Supplies, Arts & Entertainment) - small, colourful items. The clusters were found from purely visual embeddings with no access to the labels, yet they recover human-meaningful groupings. That is exactly the property that makes nearest-neighbour recommendation work: similar-looking products really are near each other in the space. ### 11. Two more projections: PCA and t-SNE PCA is a fast linear view (it is also what I cluster on); t-SNE emphasises local neighbourhoods. I ran both as cross-checks against UMAP, and both show the same per-category grouping, so the structure is not a UMAP artefact.

PCA by category

t-SNE

### 12. A second clustering algorithm: DBSCAN

DBSCAN

K-Means forces every product into one of K round clusters; DBSCAN instead looks for dense regions and labels the leftovers as noise. I picked its `eps` setting from the k-distance plot (left) rather than guessing. DBSCAN splits the catalogue into roughly two dozen small clusters with almost no noise, which means the products group into many tight little neighbourhoods. K-Means with K=4 is just a simpler, easier-to-read summary of the same thing, so the two methods agree the space is well organised. ### 13. What each cluster actually contains

Cluster examples

Labels and numbers are one thing; the honest test is to look at the products closest to each centroid. Each row is clearly one visual family (packaged goods, tech, furnishings, small colourful items). This is what convinced me the space was worth recommending from. ## The Recommender (Inputs and Outputs) The recommendation itself is four steps, kept as small standalone functions: 1. Encode the query. A text query is wrapped in a short template (`"a product photo of ..."`, which calibrates CLIP better) and passed through the text encoder; an uploaded image is passed through the image encoder. Either way the result is one 512-d vector in the shared space. 2. Score. Cosine similarity against all 11,912 catalogue vectors, which is a single matrix multiply because everything is L2-normalized. 3. Rank and de-duplicate. Take the highest scores, skipping near-identical duplicates (similarity > 0.985) so the three results are genuinely different products. 4. Return the Top-3 with their thumbnails, categories and similarity scores. Here is what it actually returns for five held-out photos (products the catalogue never saw):

Query to Top-3

The camera row is the clearest: a Canon body retrieves three other cameras at ~0.94 similarity. ### Bonus: a faster backend with FAISS A linear `EMB @ q` scan is fine for 12K items, but a real catalogue has millions. I index the embeddings with FAISS (the standard vector-search library) and confirm it returns the same Top-3 as the brute-force scan, only faster. On 500 queries FAISS was about 50x faster with 100% agreement, so it is exact here, just quicker. That is the piece that would let the same app scale to a production catalogue. ## Evaluation To put a number on quality I ran image-to-image retrieval on 80 held-out products (products the catalogue sample never saw) and measured how often the returned products share the query's top-level category: | Metric | Score | Note | |---|---|---| | precision@1 | 0.39 | about 3x a random baseline (~0.12) | | precision@3 | 0.35 | slightly below precision@1 (explained below) | precision@3 being a little lower than precision@1 makes sense: the single closest product is usually the safest match, while results 2 and 3 start drifting toward items that look alike but fall under a different category. This also means the metric is a bit harsh on the model. Many of the apparent "misses" are actually good visual matches across categories (a metal spear tip returning nails, a scroll saw returning a microtome), because the model ranks by how things look, not by the category label, which is exactly what you want from a "find similar-looking products" tool. ## Interactive App (the Space) The app has three tabs: a Recommender (upload a photo or type a query, get Top-3), a Dataset & Analysis tab with these plots and reasoning, and a Presentation tab with the video. The Space loads `catalog.parquet` and the same CLIP model used to build it, so the live results are exactly the pipeline described above. ## Business and Ethical Considerations Business value. Visual similarity search is the engine behind "shop the look" and "more like this" features. It needs no manual tagging (it runs on the product image alone), works across languages (useful here, since titles are multilingual), and helps cold-start items that have no clicks yet. The same `catalog.parquet` + FAISS setup would directly power a related-items carousel or a visual search bar on a store. Limits and ethics. - Visual, not semantic. The model matches appearance, so it can pair items that look alike but serve different purposes. Fine for shopping, but it should not be trusted where the *function* matters (medical or safety products). - Representation bias. CLIP is trained on web images and inherits their biases; a product shot in an unusual style, or from an under-represented region, may embed poorly and be under-recommended. - Catalogue gaps. Recommendations can only point inside the catalogue, so sparse categories give weak results no matter how good the model is. ## Final Conclusions CLIP gives a single shared space for images and text, and the clustering confirmed that space is semantically meaningful before I relied on it for retrieval. The recommender is strongest in image-to-image mode, where there is no text/image gap, and the EDA's white-background finding explains both the modest silhouette and the cases where retrieval drifts across categories. A few caveats worth flagging: - Visual, not semantic. The model matches appearance, so a black sports pad can retrieve black motorcycle pants. For a visual recommender that is a feature, but it does cap the category-match metric. - Catalogue coverage. Some everyday text queries (for example "necklace" or "mug") return weak results simply because few such products exist in the sampled catalogue. - Single sample. Numbers come from one balanced 11,912-product sample and one 80-query eval set; a larger sweep would tighten the estimates. What I would do next. A larger or domain-tuned CLIP would help the weak text queries; a single-domain catalogue (rather than a mixed marketplace) would reduce the white-background overlap and sharpen the clusters; and real human relevance judgements would replace the conservative category-match proxy. If I had set out to maximise the clustering metric I would also have filtered to fewer, more visually distinct categories, but I kept the full mix because it is the honest, harder case. ## Repository Contents | File | Description | |---|---| | `app.py` | The Gradio app (Recommender / Dataset & Analysis / Presentation tabs) | | `Assignment_3_NoamFuchs.ipynb` | Full code notebook (all cells run, plots and outputs embedded) | | `catalog.parquet` | The embeddings file: 11,912 vectors + metadata + thumbnails | | `assets/` | All plots referenced above | | `examples/` | Example product images for the app | | `requirements.txt` | Dependencies | | `README.md` | This file | ## Author Noam Fuchs · *Introduction to Data Science* · Reichman University (IDC) · Spring 2026