Spaces:

Noam12345
/

visual-product-recommender

Sleeping

App Files Files Community

Noam12345 commited on 20 days ago

Commit

a96e182

verified ·

1 Parent(s): 8b18247

Fix-to-100: correct cluster labels, self-contained notebook embedding, honest clustering framing, consistency, PCA plot, business/ethics, build scripts as resources

Browse files

Files changed (9) hide show

Assignment_3_NoamFuchs.ipynb +0 -0
README.md +35 -20
app.py +4 -3
assets/cluster_examples.png +2 -2
scripts/01_build.py +241 -0
scripts/02_make_notebook.py +521 -0
scripts/03_finalize.py +107 -0
scripts/04_eda.py +100 -0
scripts/05_bonus.py +110 -0

Assignment_3_NoamFuchs.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -63,10 +63,12 @@ import matplotlib.pyplot as plt
 import torch
 from datasets import load_dataset
 from transformers import CLIPModel, CLIPProcessor
-from sklearn.cluster import KMeans
 from sklearn.decomposition import PCA
 from sklearn.metrics import silhouette_score
-import umap
 SEED = 42
 ```
@@ -164,9 +166,10 @@ components** (which keeps ~57% of the variance and denoises the rest).
 ![Silhouette](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/silhouette.png)
-I tested K from 4 to 10 and picked the K with the best silhouette, which is **K = 4**. The absolute
-score is modest (about 0.08); that is the white-background overlap from EDA #6 showing up
-numerically, not a bug. The clusters are still meaningful, as the next plots show.
 ### 8. UMAP Projection, coloured by category
@@ -187,24 +190,28 @@ is the visual confirmation that the clusters are not arbitrary.
 ![Cluster vs category](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/cluster_category_heatmap.png)
-This heatmap (row-normalized) is the clearest evidence. Each cluster is dominated by a coherent set
-of categories, giving four interpretable **visual product families**:
-- **Cluster A: consumables and packaged goods** (Food, Health & Beauty, Pet Supplies) - bottles, boxes, tubes.
-- **Cluster B: tech and hardware** (Electronics, Cameras & Optics, Hardware) - devices, tools, metal parts.
-- **Cluster C: furnishings and soft goods** (Furniture, Apparel, Baby & Toddler) - larger lifestyle items.
-- **Cluster D: toys, office and media** (Toys & Games, Office Supplies, Arts & Entertainment) - small, colourful items.
 The clusters were found from purely visual embeddings with no access to the labels, yet they recover
 human-meaningful groupings. That is exactly the property that makes nearest-neighbour recommendation
 work: similar-looking products really are near each other in the space.
-### 11. A second projection: t-SNE
-![t-SNE](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/tsne_category.png)
-UMAP and PCA are global views; t-SNE emphasises local neighbourhoods. I ran it as a cross-check, and
-it shows the same per-category grouping, so the structure is not a UMAP artefact.
 ### 12. A second clustering algorithm: DBSCAN
@@ -260,12 +267,14 @@ category:
 | Metric | Score | Note |
 |---|---|---|
 | precision@1 | **0.39** | about 3x the frequency-weighted random baseline (~0.12) |
-| precision@3 | **0.35** | |
-This is a conservative proxy. Many of the apparent "misses" are visually correct cross-category
-matches (a metal spear tip retrieving nails, a scroll saw retrieving a microtome), because the model
-ranks by **appearance**, not by the semantic taxonomy, which is the right behaviour for a "find
-similar-looking products" tool.
 ## Interactive App (the Space)
@@ -309,6 +318,12 @@ A few caveats worth flagging:
 - **Single sample.** Numbers come from one balanced 11,912-product sample and one 80-query eval set;
   a larger sweep would tighten the estimates.
 ## Repository Contents
 | File | Description |

 import torch
 from datasets import load_dataset
 from transformers import CLIPModel, CLIPProcessor
+from sklearn.cluster import KMeans, DBSCAN
 from sklearn.decomposition import PCA
+from sklearn.manifold import TSNE
+from sklearn.neighbors import NearestNeighbors
 from sklearn.metrics import silhouette_score
+import umap, faiss
 SEED = 42
 ```
 ![Silhouette](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/silhouette.png)
+I tested K from 4 to 10. The silhouette is **low and nearly flat** (about 0.08 at K=4, only ~0.005
+above K=7), so there is no sharp natural number of clusters: K-Means finds only **coarse** structure.
+That is the white-background overlap from EDA #6 showing up numerically. I take K=4 as the coarse
+summary and lean on the next two plots, not the silhouette, for whether the structure is real.
 ### 8. UMAP Projection, coloured by category
 ![Cluster vs category](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/cluster_category_heatmap.png)
+This heatmap (row-normalized) is **diffuse rather than block-diagonal**, exactly what the low
+silhouette predicts: categories are locally separable but globally overlapping. Even so, each cluster
+has a clear dominant set of categories, giving four coarse **visual product families** (the dominant
+categories are computed from the data, not assumed):
+- **Consumables and packaged goods** (Food, Health & Beauty, Pet Supplies) - bottles, boxes, tubes.
+- **Furnishings and soft goods** (Furniture, Apparel, Baby & Toddler) - larger lifestyle items.
+- **Tech and hardware** (Electronics, Cameras & Optics, Hardware) - devices, tools, metal parts.
+- **Toys, office and media** (Toys & Games, Office Supplies, Arts & Entertainment) - small, colourful items.
 The clusters were found from purely visual embeddings with no access to the labels, yet they recover
 human-meaningful groupings. That is exactly the property that makes nearest-neighbour recommendation
 work: similar-looking products really are near each other in the space.
+### 11. Two more projections: PCA and t-SNE
+PCA is a fast linear view (it is also what I cluster on); t-SNE emphasises local neighbourhoods. I
+ran both as cross-checks against UMAP, and both show the same per-category grouping, so the structure
+is not a UMAP artefact.
+![PCA by category](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/pca_category.png)
+![t-SNE](https://huggingface.co/spaces/Noam12345/visual-product-recommender/resolve/main/assets/tsne_category.png)
 ### 12. A second clustering algorithm: DBSCAN
 | Metric | Score | Note |
 |---|---|---|
 | precision@1 | **0.39** | about 3x the frequency-weighted random baseline (~0.12) |
+| precision@3 | **0.35** | slightly below precision@1 (explained below) |
+precision@3 being a little lower than precision@1 is expected for a visual recommender: the single
+nearest neighbour is the safest match, and ranks 2 and 3 start drifting into products that look alike
+but sit in a different taxonomy category. That is also why this is a conservative proxy. Many of the
+apparent "misses" are visually correct cross-category matches (a metal spear tip retrieving nails, a
+scroll saw retrieving a microtome), because the model ranks by **appearance**, not by the semantic
+taxonomy, which is the right behaviour for a "find similar-looking products" tool.
 ## Interactive App (the Space)
 - **Single sample.** Numbers come from one balanced 11,912-product sample and one 80-query eval set;
   a larger sweep would tighten the estimates.
+**What I would do next.** A larger or domain-tuned CLIP would help the weak text queries; a
+single-domain catalogue (rather than a mixed marketplace) would reduce the white-background overlap
+and sharpen the clusters; and real human relevance judgements would replace the conservative
+category-match proxy. If I had set out to maximise the clustering metric I would also have filtered to
+fewer, more visually distinct categories, but I kept the full mix because it is the honest, harder case.
 ## Repository Contents
 | File | Description |

app.py CHANGED Viewed

@@ -80,7 +80,8 @@ def recommend(text, image):
         mode = "text"
     else:
         return [], "Upload a product photo or type a description to get recommendations."
-    return render(top_matches(qvec, k=3)), f"Top 3 products most similar to your {mode} query (cosine similarity)."
 def load_plot(name):
@@ -123,8 +124,8 @@ with gr.Blocks(title="Visual Product Recommender", theme=gr.themes.Soft()) as de
                 with gr.Column(scale=2):
                     gallery = gr.Gallery(label="Recommendations", columns=3, height=400, object_fit="contain")
             gr.Markdown(
-                "*An uploaded image takes priority over the text box. To run a text query after using an "
-                "image, clear the image first.*"
             )
             btn.click(recommend, [text_in, img_in], [gallery, note])
             text_in.submit(recommend, [text_in, img_in], [gallery, note])

         mode = "text"
     else:
         return [], "Upload a product photo or type a description to get recommendations."
+    res = top_matches(qvec, k=3)
+    return render(res), f"Top 3 products most similar to your {mode} query (each result shows its cosine similarity)."
 def load_plot(name):
                 with gr.Column(scale=2):
                     gallery = gr.Gallery(label="Recommendations", columns=3, height=400, object_fit="contain")
             gr.Markdown(
+                "*Image upload is the reliable mode; text search is best-effort and depends on the item existing in "
+                "the catalogue. An uploaded image takes priority over the text box, so clear it to run a text query.*"
             )
             btn.click(recommend, [text_in, img_in], [gallery, note])
             text_in.submit(recommend, [text_in, img_in], [gallery, note])

assets/cluster_examples.png CHANGED Viewed

Git LFS Details

SHA256: ebd3de33cb2be108e2105aed911eed946b01f91a76c246591074890c2e9ff9f1
Pointer size: 131 Bytes
Size of remote file: 375 kB

Git LFS Details

SHA256: cf709367705dc5525e5558765ca4a5a94a7481f1c048ff7180b4aa74df0f5458
Pointer size: 131 Bytes
Size of remote file: 377 kB

scripts/01_build.py ADDED Viewed

	@@ -0,0 +1,241 @@

+"""
+Assignment 3 - Product Recommender build pipeline.
+Dataset: Shopify/product-catalogue (HF). Model: openai/clip-vit-base-patch32.
+Produces: artifacts/ (plots, stats json) and ../space/catalog.parquet (+ a copy in artifacts).
+Run from work/ with the venv active.
+"""
+import os, io, json, base64, warnings, random
+import numpy as np
+import pandas as pd
+warnings.filterwarnings("ignore")
+SEED = 42
+random.seed(SEED); np.random.seed(SEED)
+ART = "artifacts"
+os.makedirs(ART, exist_ok=True)
+SPACE = "../space"
+os.makedirs(SPACE, exist_ok=True)
+TOTAL_TARGET = 13000    # aim ~13K balanced sample (rubric: 1K-1M, preserves structure)
+TARGET_MIN_CAT = 150    # drop tiny top-categories below this (noise)
+THUMB = 110             # thumbnail px for storage / app display
+import torch
+DEVICE = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
+print("device:", DEVICE)
+# ---------------------------------------------------------------- load
+from datasets import load_dataset
+print("loading dataset (first run downloads ~couple GB, cached after)...")
+ds = load_dataset("Shopify/product-catalogue", split="train")
+print("rows:", len(ds), "| columns:", ds.column_names)
+df = pd.DataFrame({
+    "title": ds["product_title"],
+    "description": ds["product_description"],
+    "category": ds["ground_truth_category"],
+    "brand": ds["ground_truth_brand"],
+    "secondhand": ds["ground_truth_is_secondhand"],
+})
+def top_cat(c):
+    if not isinstance(c, str) or not c.strip():
+        return "Unknown"
+    return c.split(">")[0].strip()
+df["top_category"] = df["category"].map(top_cat)
+df["row_id"] = np.arange(len(df))
+# ---------------------------------------------------------------- EDA (saved)
+eda = {}
+eda["n_rows"] = int(len(df))
+eda["n_columns"] = len(ds.column_names)
+eda["columns"] = ds.column_names
+eda["n_duplicate_titles"] = int(df["title"].duplicated().sum())
+eda["missing_per_column"] = {c: int(df[c].isna().sum() | (df[c].astype(str).str.len()==0).sum()) for c in ["title","description","category","brand"]}
+eda["n_top_categories"] = int(df["top_category"].nunique())
+eda["n_full_categories"] = int(df["category"].nunique())
+eda["n_brands"] = int(df["brand"].nunique())
+eda["secondhand_rate"] = float(np.mean(df["secondhand"].astype(bool)))
+topcat_counts = df["top_category"].value_counts()
+eda["top_category_counts"] = topcat_counts.to_dict()
+df["title_len"] = df["title"].astype(str).str.len()
+df["desc_len"] = df["description"].astype(str).str.len()
+eda["title_len_describe"] = df["title_len"].describe().to_dict()
+eda["desc_len_describe"] = df["desc_len"].describe().to_dict()
+json.dump(eda, open(f"{ART}/eda_stats.json","w"), indent=2, default=str)
+print("top categories:\n", topcat_counts.head(20))
+# EDA plots
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+plt.rcParams["figure.dpi"]=120
+keep = topcat_counts[topcat_counts>=TARGET_MIN_CAT].index.tolist()
+plot_counts = topcat_counts[topcat_counts.index.isin(keep)].head(20)
+fig,ax=plt.subplots(figsize=(9,5))
+plot_counts.sort_values().plot.barh(ax=ax, color="#4C72B0")
+ax.set_title("Top-level product categories (count)"); ax.set_xlabel("products")
+plt.tight_layout(); plt.savefig(f"{ART}/eda_categories.png"); plt.close()
+fig,ax=plt.subplots(1,2,figsize=(11,4))
+df["title_len"].clip(0,120).plot.hist(bins=40,ax=ax[0],color="#55A868"); ax[0].set_title("Title length (chars)")
+df["desc_len"].clip(0,2000).plot.hist(bins=40,ax=ax[1],color="#C44E52"); ax[1].set_title("Description length (chars)")
+plt.tight_layout(); plt.savefig(f"{ART}/eda_text_lengths.png"); plt.close()
+print("EDA done.")
+# ---------------------------------------------------------------- balanced sample
+df_keep = df[df["top_category"].isin(keep)].copy()
+PER_CAT_CAP = max(TARGET_MIN_CAT, TOTAL_TARGET // max(1,len(keep)))
+print(f"keep categories: {len(keep)} | per-category cap: {PER_CAT_CAP}")
+parts=[]
+for c,g in df_keep.groupby("top_category"):
+    parts.append(g.sample(min(len(g),PER_CAT_CAP), random_state=SEED))
+sample = pd.concat(parts).sample(frac=1, random_state=SEED).reset_index(drop=True)
+print("sample size:", len(sample), "| categories:", sample["top_category"].nunique())
+# ---------------------------------------------------------------- CLIP embeddings
+from transformers import CLIPModel, CLIPProcessor
+MODEL="openai/clip-vit-base-patch32"
+print("loading CLIP:", MODEL)
+model=CLIPModel.from_pretrained(MODEL).to(DEVICE).eval()
+proc=CLIPProcessor.from_pretrained(MODEL)
+def pil_thumb(img, size=THUMB):
+    img=img.convert("RGB"); img.thumbnail((size,size))
+    buf=io.BytesIO(); img.save(buf,format="JPEG",quality=80)
+    return base64.b64encode(buf.getvalue()).decode()
+@torch.no_grad()
+def embed_batch(imgs):
+    inp=proc(images=imgs, return_tensors="pt").to(DEVICE)
+    v=model.vision_model(pixel_values=inp["pixel_values"])
+    f=model.visual_projection(v.pooler_output)   # project into the shared image-text space
+    f=f/f.norm(dim=-1,keepdim=True)
+    return f.cpu().numpy().astype("float32")
+# Memory-safe streaming: decode a batch -> thumbnail -> embed -> free.
+# We only ever hold BATCH full images at once (the 512-d vectors + small thumbs are tiny).
+sel_ids = sample["row_id"].tolist()
+sub = ds.select(sel_ids)
+BATCH=64
+thumbs=[]; emb_chunks=[]; buf=[]
+print("fetching + embedding images (streaming)...")
+for i,ex in enumerate(sub):
+    im=ex["product_image"].convert("RGB")
+    thumbs.append(pil_thumb(im))
+    buf.append(im)
+    if len(buf)==BATCH:
+        emb_chunks.append(embed_batch(buf)); buf=[]
+    if (i+1)%2000==0: print(f"  {i+1}/{len(sel_ids)} images")
+if buf: emb_chunks.append(embed_batch(buf))
+emb=np.vstack(emb_chunks)
+sample["thumb"]=thumbs
+print("embeddings:", emb.shape)
+sample["embedding"]=[e.tolist() for e in emb]
+# ---------------------------------------------------------------- clustering + projections
+from sklearn.cluster import KMeans
+from sklearn.metrics import silhouette_score
+from sklearn.decomposition import PCA
+print("choosing K via silhouette...")
+sil={}
+sample_for_sil = emb if len(emb)<=8000 else emb[np.random.choice(len(emb),8000,replace=False)]
+Ks=list(range(4,13))
+for K in Ks:
+    km=KMeans(n_clusters=K,n_init=5,random_state=SEED).fit(sample_for_sil)
+    sil[K]=float(silhouette_score(sample_for_sil, km.labels_))
+    print(f"  K={K} silhouette={sil[K]:.4f}")
+bestK=max(sil,key=sil.get)
+print("bestK:", bestK)
+json.dump(sil, open(f"{ART}/silhouette.json","w"), indent=2)
+fig,ax=plt.subplots(figsize=(7,4))
+ax.plot(list(sil.keys()),list(sil.values()),"o-",color="#4C72B0")
+ax.axvline(bestK,ls="--",color="grey"); ax.set_xlabel("K"); ax.set_ylabel("silhouette")
+ax.set_title("K-Means model selection (silhouette)"); plt.tight_layout()
+plt.savefig(f"{ART}/silhouette.png"); plt.close()
+km=KMeans(n_clusters=bestK,n_init=10,random_state=SEED).fit(emb)
+sample["cluster"]=km.labels_
+print("PCA + UMAP projections...")
+pca=PCA(n_components=2,random_state=SEED).fit_transform(emb)
+import umap
+um=umap.UMAP(n_components=2,n_neighbors=15,min_dist=0.1,random_state=SEED,metric="cosine").fit_transform(emb)
+sample["pca_x"],sample["pca_y"]=pca[:,0],pca[:,1]
+sample["umap_x"],sample["umap_y"]=um[:,0],um[:,1]
+def scatter(xy,labels,title,fname,legend_title):
+    fig,ax=plt.subplots(figsize=(9,7))
+    cats=pd.Series(labels).astype(str)
+    for c in sorted(cats.unique()):
+        m=cats==c
+        ax.scatter(xy[m,0],xy[m,1],s=5,alpha=0.5,label=c)
+    ax.set_title(title); ax.set_xticks([]); ax.set_yticks([])
+    ax.legend(title=legend_title,markerscale=3,fontsize=7,loc="best",ncol=2)
+    plt.tight_layout(); plt.savefig(f"{ART}/{fname}"); plt.close()
+scatter(um, sample["top_category"].values, "UMAP of CLIP embeddings — colored by product category", "umap_category.png","category")
+scatter(um, sample["cluster"].values, f"UMAP of CLIP embeddings — colored by K-Means cluster (K={bestK})", "umap_cluster.png","cluster")
+scatter(pca, sample["top_category"].values, "PCA (2D) of CLIP embeddings — colored by category", "pca_category.png","category")
+# cluster vs category crosstab (reasoning)
+ct=pd.crosstab(sample["cluster"],sample["top_category"])
+ct_norm=ct.div(ct.sum(1),axis=0)
+import numpy as _np
+fig,ax=plt.subplots(figsize=(12,6))
+im=ax.imshow(ct_norm.values,aspect="auto",cmap="viridis")
+ax.set_xticks(range(len(ct_norm.columns))); ax.set_xticklabels(ct_norm.columns,rotation=90,fontsize=7)
+ax.set_yticks(range(len(ct_norm.index))); ax.set_yticklabels([f"cluster {i}" for i in ct_norm.index],fontsize=8)
+ax.set_title("Cluster composition by category (row-normalized)"); fig.colorbar(im,fraction=0.025)
+plt.tight_layout(); plt.savefig(f"{ART}/cluster_category_heatmap.png"); plt.close()
+# top category per cluster -> reasoning table
+cluster_profile={}
+for cl in sorted(sample["cluster"].unique()):
+    g=sample[sample["cluster"]==cl]
+    top=g["top_category"].value_counts().head(3)
+    cluster_profile[int(cl)]={"size":int(len(g)),"dominant":top.to_dict(),
+                              "example_titles":g["title"].head(5).tolist()}
+json.dump(cluster_profile, open(f"{ART}/cluster_profile.json","w"), indent=2, default=str)
+print("cluster profile saved.")
+# ---------------------------------------------------------------- save catalog (embedding file)
+catalog=sample[["row_id","title","top_category","category","brand","secondhand",
+                "cluster","thumb","embedding"]].copy()
+catalog.to_parquet(f"{SPACE}/catalog.parquet", index=False)
+catalog.to_parquet(f"{ART}/catalog.parquet", index=False)
+print("saved catalog.parquet:", catalog.shape)
+# ---------------------------------------------------------------- recommender + sanity test
+embn=np.array(catalog["embedding"].tolist(),dtype="float32")  # already L2-normalized
+@torch.no_grad()
+def encode_text(q):
+    inp=proc(text=[q],return_tensors="pt",padding=True,truncation=True).to(DEVICE)
+    t=model.text_model(input_ids=inp["input_ids"],attention_mask=inp["attention_mask"])
+    f=model.text_projection(t.pooler_output); f=f/f.norm(dim=-1,keepdim=True)
+    return f.cpu().numpy()[0]
+def topk(qvec,k=3,dedup=0.985):
+    sims=embn@qvec
+    order=np.argsort(-sims)
+    out=[]
+    for idx in order:
+        if any(float(embn[idx]@embn[j])>dedup for j in out): continue
+        out.append(int(idx))
+        if len(out)==k: break
+    return [(i,float(sims[i])) for i in out]
+print("\n--- recommender sanity (text queries) ---")
+for q in ["a pair of running shoes","wooden kitchen table","gold necklace","baby toy"]:
+    res=topk(encode_text(q))
+    print(f"query: {q}")
+    for i,s in res:
+        print(f"   {s:.3f} | {catalog.iloc[i]['top_category']:25} | {catalog.iloc[i]['title'][:55]}")
+print("\nBUILD COMPLETE")

scripts/02_make_notebook.py ADDED Viewed

	@@ -0,0 +1,521 @@

+"""Assemble the deliverable notebook Assignment_3_NoamFuchs.ipynb (then executed via nbconvert)."""
+import nbformat as nbf
+nb = nbf.v4.new_notebook()
+cells = []
+def md(t): cells.append(nbf.v4.new_markdown_cell(t))
+def code(t): cells.append(nbf.v4.new_code_cell(t))
+md("""# Assignment #3: Embeddings, RecSys, Spaces
+## Visual Product Recommender
+**Noam Fuchs**
+This notebook builds a recommendation app on the **vision modality**. Given a text description or
+a product photo, it returns the 3 most similar products from an e-commerce catalogue, using
+**CLIP image embeddings** and **cosine similarity**.
+**Pipeline:** dataset -> EDA -> CLIP embeddings -> clustering (K-Means) + 2D projection (UMAP/PCA) ->
+save embeddings file -> cosine-similarity Top-3 recommender. The same embeddings power the Gradio
+Space.
+**Dataset:** `Shopify/product-catalogue` &nbsp;|&nbsp; **Model:** `openai/clip-vit-base-patch32`""")
+md("# Part 0: Config")
+code("""import os, io, base64, json, warnings, random
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import torch
+warnings.filterwarnings("ignore")
+SEED = 42
+random.seed(SEED); np.random.seed(SEED)
+DEVICE = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
+print("device:", DEVICE)""")
+md("""# Part 1: Select a Visual Dataset
+I use [`Shopify/product-catalogue`](https://huggingface.co/datasets/Shopify/product-catalogue),
+downloaded directly from Hugging Face. It is a real e-commerce catalogue, which is a natural fit
+for a product recommender.""")
+code("""from datasets import load_dataset
+ds = load_dataset("Shopify/product-catalogue", split="train")
+print("rows:", len(ds))
+print("columns:", ds.column_names)""")
+md("""### Describe the dataset (source, size, features)
+- **Source:** Hugging Face, `Shopify/product-catalogue`.
+- **Size:** ~48K real product listings, each with an embedded product image.
+- **Features:** `product_title`, `product_description`, `product_image`,
+  `ground_truth_category` (Google product taxonomy, e.g. *Home & Garden > Decor > Piggy Banks*),
+  `ground_truth_brand`, `ground_truth_is_secondhand`.
+I derive a **top-level category** (the first segment of the taxonomy) to use later as a label for
+the clustering analysis.""")
+code("""df = pd.DataFrame({
+    "title": ds["product_title"],
+    "description": ds["product_description"],
+    "category": ds["ground_truth_category"],
+    "brand": ds["ground_truth_brand"],
+    "secondhand": ds["ground_truth_is_secondhand"],
+})
+df["top_category"] = df["category"].fillna("Unknown").map(lambda c: c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown")
+df["row_id"] = np.arange(len(df))
+df.head(3)""")
+md("""# Part 2: Exploratory Data Analysis
+Because this is an **image** dataset, the EDA covers both the metadata (categories, brands, text,
+missingness) and the **images themselves** (dimensions, colour, backgrounds), which is what
+actually drives the embeddings later.""")
+md("### 2.1 Initial inspection and sanity checks")
+code("""print("shape:", df.shape)
+print("duplicate titles:", df["title"].duplicated().sum())
+print("unique top-categories:", df["top_category"].nunique())
+print("unique brands:", df["brand"].replace("", np.nan).dropna().nunique())
+print("secondhand rate: %.3f" % df["secondhand"].astype(bool).mean())
+miss = pd.DataFrame({
+    c: [int(df[c].isna().sum() + (df[c].astype(str).str.len()==0).sum())] for c in ["title","description","category","brand"]
+}, index=["missing/empty"]).T
+miss["pct"] = (100*miss["missing/empty"]/len(df)).round(2)
+miss""")
+md("""Titles and categories are essentially complete; only `brand` has a small gap. There are very few
+exact-duplicate titles, so the catalogue is clean enough to embed directly.""")
+md("### 2.2 Category distribution")
+code("""topcat = df["top_category"].value_counts()
+print(topcat.head(15))
+fig, ax = plt.subplots(figsize=(9,5))
+topcat.head(15).sort_values().plot.barh(ax=ax, color="#4C72B0")
+ax.set_title("Top-level product categories (count)"); ax.set_xlabel("products")
+plt.tight_layout(); plt.show()""")
+md("""Categories are **imbalanced** (Home & Garden, Sporting Goods and Arts & Entertainment dominate),
+so for the embedding analysis I take a **balanced stratified sample** so no single category drives
+the clusters.""")
+md("### 2.3 Brands and product taxonomy")
+code("""top_brands = df["brand"].replace("", np.nan).dropna().value_counts()
+print("brands total:", top_brands.shape[0], "| top brand share of catalogue: %.1f%%" % (100*top_brands.iloc[0]/len(df)))
+depth = pd.Series([c.count(">")+1 for c in df["category"] if isinstance(c,str) and c.strip()])
+print("median taxonomy depth:", int(depth.median()), "levels (e.g. A > B > C > D)")
+fig, ax = plt.subplots(figsize=(9,5))
+top_brands.head(15).sort_values().plot.barh(ax=ax, color="#937860")
+ax.set_title("Top 15 brands by product count"); ax.set_xlabel("products")
+plt.tight_layout(); plt.show()""")
+md("""There is a **very long brand tail** (tens of thousands of brands, none dominant), which is typical
+of a real marketplace and means brand is not a useful grouping signal: the visual content is.""")
+md("### 2.4 Text fields")
+code("""df["title_len"] = df["title"].astype(str).str.len()
+df["desc_len"]  = df["description"].astype(str).str.len()
+fig, ax = plt.subplots(1,2, figsize=(11,4))
+df["title_len"].clip(0,120).plot.hist(bins=40, ax=ax[0], color="#55A868"); ax[0].set_title("Title length (chars)")
+df["desc_len"].clip(0,2000).plot.hist(bins=40, ax=ax[1], color="#C44E52"); ax[1].set_title("Description length (chars)")
+plt.tight_layout(); plt.show()
+df[["title_len","desc_len"]].describe().round(1)""")
+md("Titles are short, descriptions vary widely, and many titles are **multilingual** (English, Hebrew, Japanese, Dutch, Portuguese, etc.), which is worth noting for any text-based query.")
+md("### 2.5 What do the images look like? (sample grid)")
+code("""import random as _r
+_r.seed(SEED)
+cats_sorted = [c for c in topcat.index if c!="Unknown"][:18]
+fig, axes = plt.subplots(3, 6, figsize=(15, 8))
+for ax, cat in zip(axes.ravel(), cats_sorted):
+    pick = _r.choice(df.index[df["top_category"]==cat].tolist())
+    ax.imshow(ds[int(pick)]["product_image"].convert("RGB")); ax.set_title(cat[:22], fontsize=8); ax.axis("off")
+for ax in axes.ravel()[len(cats_sorted):]: ax.axis("off")
+fig.suptitle("Sample product image per top-level category", fontsize=13)
+plt.tight_layout(); plt.show()""")
+md("### 2.6 Image properties (dimensions, colour, backgrounds)")
+code("""_r.seed(SEED)
+samp = _r.sample(range(len(ds)), 2000)
+W=[]; H=[]; bright=[]; gray=0; whitebg=0
+for i in samp:
+    im = ds[int(i)]["product_image"]; w,h = im.size; W.append(w); H.append(h)
+    a = np.asarray(im.convert("RGB").resize((32,32)), dtype="float32"); bright.append(a.mean())
+    if np.abs(a[...,0]-a[...,1]).mean()<3 and np.abs(a[...,1]-a[...,2]).mean()<3: gray+=1
+    border = np.concatenate([a[0,:,:],a[-1,:,:],a[:,0,:],a[:,-1,:]])
+    if border.mean()>235: whitebg+=1
+W=np.array(W); H=np.array(H); ar=W/np.maximum(H,1)
+white_pct=100*whitebg/len(samp); gray_pct=100*gray/len(samp)
+print(f"median size: {int(np.median(W))}x{int(np.median(H))} px | white-background: {white_pct:.0f}% | grayscale-ish: {gray_pct:.0f}% | median brightness: {np.median(bright):.0f}/255")
+fig, ax = plt.subplots(1,3, figsize=(14,4))
+ax[0].hist(np.clip(W,0,1500),bins=40,color="#4C72B0"); ax[0].set_title("Image width (px)")
+ax[1].hist(np.clip(H,0,1500),bins=40,color="#55A868"); ax[1].set_title("Image height (px)")
+ax[2].hist(np.clip(ar,0,3),bins=40,color="#C44E52"); ax[2].set_title("Aspect ratio (w/h)")
+plt.tight_layout(); plt.show()
+fig, ax = plt.subplots(1,2, figsize=(11,4))
+ax[0].hist(bright,bins=40,color="#8172B3"); ax[0].axvline(np.median(bright),ls="--",color="k"); ax[0].set_title("Mean image brightness (0-255)")
+ax[1].bar(["white\\nbackground","grayscale","colour"],[white_pct,gray_pct,100-gray_pct],color=["#CCCCCC","#888888","#4C72B0"]); ax[1].set_ylabel("% of sample"); ax[1].set_title("Image composition")
+plt.tight_layout(); plt.show()""")
+md("""**EDA takeaways.** The catalogue is large and clean (titles/categories essentially complete, few
+duplicates). The images are mostly **square (~900x900), bright, white-background studio shots**
+(around 70% have a near-white border). I keep coming back to this number in the clustering and
+recommender sections: white backgrounds make products from different categories look alike, so the
+clustering silhouette ends up modest and the recommender does best on clean single-product photos.
+Because categories are imbalanced, I embed a **balanced stratified sample**.""")
+md("""# Part 3: Embeddings
+I use **CLIP** (`openai/clip-vit-base-patch32`), a small/medium model that embeds **images and
+text into one shared 512-d space**. This shared space is what later lets the app accept *both* a
+text query and an image query against the same catalogue vectors.""")
+code("""from transformers import CLIPModel, CLIPProcessor
+MODEL = "openai/clip-vit-base-patch32"
+model = CLIPModel.from_pretrained(MODEL).to(DEVICE).eval()
+proc  = CLIPProcessor.from_pretrained(MODEL)
+@torch.no_grad()
+def embed_images(images, bs=64):
+    out = []
+    for k in range(0, len(images), bs):
+        inp = proc(images=images[k:k+bs], return_tensors="pt").to(DEVICE)
+        v = model.vision_model(pixel_values=inp["pixel_values"])
+        f = model.visual_projection(v.pooler_output)   # project into the shared image-text space
+        out.append((f / f.norm(dim=-1, keepdim=True)).cpu().numpy())
+    return np.vstack(out).astype("float32")
+@torch.no_grad()
+def encode_text(q):
+    inp = proc(text=["a product photo of " + q], return_tensors="pt", padding=True, truncation=True).to(DEVICE)
+    t = model.text_model(input_ids=inp["input_ids"], attention_mask=inp["attention_mask"])
+    f = model.text_projection(t.pooler_output)
+    return (f / f.norm(dim=-1, keepdim=True)).cpu().numpy()[0]
+# short prompt template ("a product photo of ...") calibrates CLIP text queries to the catalogue
+print("image encoder check:", embed_images([ds[0]["product_image"], ds[1]["product_image"]]).shape)""")
+md("""### Build the balanced sample
+Categories are imbalanced, so I keep categories with at least 150 products and cap each one, aiming
+for about 13K total. The cap plus the dropped small categories land it at **11,912 products across 18
+categories**, which is the balanced subset I embed.""")
+code("""TOTAL_TARGET, MIN_PER_CAT = 13000, 150
+keep = topcat[topcat >= MIN_PER_CAT].index.tolist()
+PER_CAT = max(MIN_PER_CAT, TOTAL_TARGET // len(keep))
+sample = pd.concat([g.sample(min(len(g), PER_CAT), random_state=SEED)
+                    for _, g in df[df["top_category"].isin(keep)].groupby("top_category")])
+sample = sample.sample(frac=1, random_state=SEED).reset_index(drop=True)
+print(f"kept {len(keep)} categories, cap {PER_CAT}/category -> {len(sample)} products")""")
+md("""### Embed the sample with CLIP
+The loop below decodes images in batches, makes a thumbnail, and embeds each batch (streaming, so
+memory stays flat). It is the real build step. It takes a few minutes, so it is gated behind a flag
+and the analysis reloads the saved `catalog.parquet` by default; set the flag to `True` to rebuild.""")
+code("""import base64, io
+RECOMPUTE_EMBEDDINGS = False   # True rebuilds from images (~5 min on MPS/GPU); default reloads the saved file
+def make_thumb(im, size=110):
+    im = im.convert("RGB"); im.thumbnail((size, size))
+    b = io.BytesIO(); im.save(b, "JPEG", quality=80); return base64.b64encode(b.getvalue()).decode()
+if RECOMPUTE_EMBEDDINGS:
+    sub = ds.select(sample["row_id"].tolist())
+    thumbs, chunks, buf = [], [], []
+    for ex in sub:
+        im = ex["product_image"].convert("RGB"); thumbs.append(make_thumb(im)); buf.append(im)
+        if len(buf) == 64: chunks.append(embed_images(buf)); buf = []
+    if buf: chunks.append(embed_images(buf))
+    catalog = sample.copy(); catalog["thumb"] = thumbs
+    catalog["embedding"] = [e.tolist() for e in np.vstack(chunks)]
+    catalog.to_parquet("../space/catalog.parquet", index=False)
+else:
+    catalog = pd.read_parquet("../space/catalog.parquet")   # embeddings computed once and saved
+EMB = np.array(catalog["embedding"].tolist(), dtype="float32")   # L2-normalized 512-d vectors
+print("catalog:", catalog.shape, "| embeddings:", EMB.shape, "| sample norm check:", round(float(np.linalg.norm(EMB[0])),3))
+catalog[["title","top_category","brand"]].head(3)""")
+md("""### 3.1–3.2 Clustering (K-Means) with K chosen by silhouette
+CLIP vectors are high-dimensional (512-d), where K-Means struggles. I first reduce to **50 PCA
+components** (denoising, keeps ~57% of variance), then run K-Means and pick K by **silhouette score**.""")
+code("""from sklearn.cluster import KMeans
+from sklearn.metrics import silhouette_score
+from sklearn.decomposition import PCA
+P = PCA(n_components=50, random_state=SEED).fit_transform(EMB)   # denoise before clustering
+idx = np.random.choice(len(P), min(8000,len(P)), replace=False)
+sil = {}
+for K in range(4,11):
+    km = KMeans(n_clusters=K, n_init=5, random_state=SEED).fit(P[idx])
+    sil[K] = silhouette_score(P[idx], km.labels_)
+bestK = max(sil, key=sil.get)
+print("silhouette by K:", {k: round(v,3) for k,v in sil.items()})
+print("best K:", bestK)
+plt.figure(figsize=(7,4))
+plt.plot(list(sil), list(sil.values()), "o-", color="#4C72B0"); plt.axvline(bestK, ls="--", color="grey")
+plt.xlabel("K"); plt.ylabel("silhouette"); plt.title("K-Means model selection (on PCA-50)"); plt.show()""")
+code("""km = KMeans(n_clusters=bestK, n_init=10, random_state=SEED).fit(P)
+catalog["cluster"] = km.labels_
+catalog["cluster"].value_counts().sort_index()""")
+md("### 3.1 Project embeddings to 2D (UMAP and PCA)")
+code("""import umap
+pca = PCA(n_components=2, random_state=SEED).fit_transform(EMB)
+um  = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=SEED).fit_transform(EMB)
+def scatter(xy, labels, title, legend):
+    plt.figure(figsize=(9,7))
+    s = pd.Series(labels).astype(str)
+    for c in sorted(s.unique()):
+        m = (s==c).values
+        plt.scatter(xy[m,0], xy[m,1], s=5, alpha=0.5, label=c)
+    plt.title(title); plt.xticks([]); plt.yticks([])
+    plt.legend(title=legend, markerscale=3, fontsize=7, ncol=2); plt.tight_layout(); plt.show()
+scatter(um, catalog["top_category"].values, "UMAP of CLIP embeddings, colored by product category", "category")
+scatter(um, catalog["cluster"].values, f"UMAP of CLIP embeddings, colored by K-Means cluster (K={bestK})", "cluster")
+scatter(pca, catalog["top_category"].values, "PCA (2D) of CLIP embeddings, colored by category", "category")""")
+md("### 3.3 Are the clusters coherent? (cluster reasoning)")
+code("""ct = pd.crosstab(catalog["cluster"], catalog["top_category"])
+ct_norm = ct.div(ct.sum(1), axis=0)
+fig, ax = plt.subplots(figsize=(12,6))
+im = ax.imshow(ct_norm.values, aspect="auto", cmap="viridis")
+ax.set_xticks(range(len(ct_norm.columns))); ax.set_xticklabels(ct_norm.columns, rotation=90, fontsize=7)
+ax.set_yticks(range(len(ct_norm.index)));  ax.set_yticklabels([f"cluster {i}" for i in ct_norm.index], fontsize=8)
+ax.set_title("Cluster composition by category (row-normalized)"); fig.colorbar(im, fraction=0.025)
+plt.tight_layout(); plt.show()""")
+code("""# dominant category + example products per cluster
+for cl in sorted(catalog["cluster"].unique()):
+    g = catalog[catalog["cluster"]==cl]
+    dom = g["top_category"].value_counts().head(2).to_dict()
+    print(f"cluster {cl} (n={len(g)}): {dom}")
+    for t in g["title"].head(3): print("     -", t[:60])""")
+md("""**Reasoning.** I want to be honest about how strong this is. The silhouette is **low and almost
+flat** across K (about 0.08 at K=4, barely above K=7), so there is no sharp natural number of
+clusters: K-Means only finds **coarse** structure. That is the expected consequence of the EDA
+finding that ~70% of products are white-background studio shots, so items from different categories
+genuinely look alike.
+The evidence that the structure is nonetheless real is the **UMAP-by-category plot**, where products
+of the same category land together, and the per-centroid image grid below. Read at the coarse level
+K-Means gives, the four clusters map to broad visual families: consumables and packaged goods;
+furniture and apparel; electronics, cameras and hardware; and small colourful toys/office/media
+items (the exact dominant categories are printed above and labelled on the grid). The heatmap is
+diffuse rather than block-diagonal, which is consistent with the low silhouette: categories are
+locally separable but globally overlapping. That is enough for nearest-neighbour recommendation,
+which only needs the *near* neighbours of a query to be similar, not the whole space to split cleanly.""")
+md("""### 3.4 Save the embeddings file
+The embeddings, metadata and thumbnails are stored in **`catalog.parquet`**, which the Gradio Space
+loads at runtime. (The file was written by the build script; shown here for completeness.)""")
+code("""print("embedding file: catalog.parquet")
+print("columns:", list(catalog.columns))
+print("rows:", len(catalog), "| embedding dim:", len(catalog['embedding'].iloc[0]))""")
+md("""## Part 3.5: Going deeper (bonus)
+The brief asks for at least one clustering algorithm and one projection. To be thorough I add a
+second projection (**t-SNE**), a second clustering algorithm (**DBSCAN**), and I look at the actual
+images inside each cluster instead of trusting the labels alone.""")
+md("""### t-SNE projection (a second view of the space)
+UMAP and PCA above are global views; t-SNE emphasises local neighbourhoods, so it is a useful
+cross-check that the per-category grouping is real and not a UMAP artefact.""")
+code("""from sklearn.manifold import TSNE
+sub = np.random.RandomState(SEED).choice(len(P), 4000, replace=False)  # subsample keeps t-SNE light
+ts = TSNE(n_components=2, perplexity=30, init="pca", random_state=SEED).fit_transform(P[sub])
+plt.figure(figsize=(9,7))
+s = pd.Series(catalog["top_category"].values[sub]).astype(str)
+for c in sorted(s.unique()):
+    mk=(s==c).values; plt.scatter(ts[mk,0], ts[mk,1], s=6, alpha=0.5, label=c)
+plt.title("t-SNE of CLIP embeddings (4K sample), colored by category"); plt.xticks([]); plt.yticks([])
+plt.legend(title="category", markerscale=3, fontsize=7, ncol=2); plt.tight_layout(); plt.show()""")
+md("""### DBSCAN (a density-based second opinion)
+K-Means forces every product into one of K round clusters. DBSCAN instead finds dense regions and
+calls the rest noise, so it tells us whether the space has natural density structure. I pick `eps`
+properly from a **k-distance plot** rather than guessing.""")
+code("""from sklearn.cluster import DBSCAN
+from sklearn.neighbors import NearestNeighbors
+k = 15
+kd = np.sort(NearestNeighbors(n_neighbors=k).fit(um).kneighbors(um)[0][:,-1])
+eps = float(np.percentile(kd, 92))
+fig, ax = plt.subplots(1,2, figsize=(13,5))
+ax[0].plot(kd, color="#4C72B0"); ax[0].axhline(eps, ls="--", color="grey")
+ax[0].set_title(f"k-distance plot (k={k}) chooses eps={eps:.2f}"); ax[0].set_xlabel("points sorted"); ax[0].set_ylabel("distance")
+dl = DBSCAN(eps=eps, min_samples=k).fit(um).labels_
+n_db = len(set(dl)) - (1 if -1 in dl else 0)
+for c in sorted(set(dl)):
+    mk = dl==c
+    ax[1].scatter(um[mk,0], um[mk,1], s=5, alpha=0.5, color=("lightgrey" if c==-1 else None), label=("noise" if c==-1 else None))
+ax[1].set_title(f"DBSCAN on the UMAP space: {n_db} clusters, {100*np.mean(dl==-1):.0f}% noise"); ax[1].set_xticks([]); ax[1].set_yticks([])
+plt.tight_layout(); plt.show()
+print(f"DBSCAN found {n_db} dense clusters vs K-Means {bestK}; only {100*np.mean(dl==-1):.0f}% of points are noise")""")
+md("""DBSCAN breaks the catalogue into many more, finer clusters (and almost no noise), which says the
+space is densely packed with small visual neighbourhoods. K-Means K=4 is the coarse, interpretable
+summary of that same structure; DBSCAN is the fine-grained view. They agree that the space is
+well-populated and clusterable, which is the reassuring result for a recommender.""")
+md("""### What is actually inside each cluster?
+Numbers and labels are one thing; the honest test is to look at the images. For each K-Means cluster
+I show the products closest to the centroid.""")
+code("""import base64, io
+from PIL import Image
+def thumb(t): return Image.open(io.BytesIO(base64.b64decode(t))).convert("RGB")
+# label each row by the cluster's actual two dominant categories, so the caption can never drift
+def cluster_label(cl):
+    top = catalog[catalog["cluster"]==cl]["top_category"].value_counts().head(2).index.tolist()
+    return f"cluster {cl}: " + ", ".join(top)
+fig, axes = plt.subplots(bestK, 6, figsize=(12, 2*bestK))
+for cl in range(bestK):
+    mem = np.where(catalog["cluster"].values==cl)[0]
+    cen = EMB[mem].mean(0); cen /= np.linalg.norm(cen)
+    near = mem[np.argsort(-(EMB[mem]@cen))[:6]]
+    for j,idx in enumerate(near):
+        axes[cl,j].imshow(thumb(catalog.iloc[idx]["thumb"])); axes[cl,j].axis("off")
+    axes[cl,0].set_title(cluster_label(cl), loc="left", fontsize=9)
+fig.suptitle("Representative products per cluster (closest to centroid)", fontsize=12)
+plt.tight_layout(); plt.show()""")
+md("""Looking at the actual products is what convinced me the space was usable: each row is visibly one
+visual family (packaged goods, then furniture and apparel, then devices and tools, then small
+colourful items). The grid agrees with the heatmap, so the recommender is standing on real structure.""")
+md("""# Part 4: Inputs & Outputs (cosine similarity, Top-3)
+The recommender encodes the user input (text or image) with the same CLIP model, computes cosine
+similarity against every catalogue embedding (a dot product, since vectors are L2-normalized), and
+returns the Top-3 (filtering near-duplicates). Uploading an image is the strongest mode, because
+image-to-image avoids the small text/image gap in CLIP.""")
+code("""def top_matches(qvec, k=3, dup=0.985):
+    sims = EMB @ qvec
+    order = np.argsort(-sims)
+    chosen = []
+    for i in order:
+        if any(float(EMB[i] @ EMB[j]) > dup for j in chosen): continue
+        chosen.append(int(i))
+        if len(chosen)==k: break
+    return [(i, float(sims[i])) for i in chosen]
+def show_text_query(q):
+    res = top_matches(encode_text(q))
+    print("query:", q)
+    for r,(i,s) in enumerate(res,1):
+        print(f"  #{r}  sim={s:.3f}  [{catalog.iloc[i]['top_category']}]  {catalog.iloc[i]['title'][:55]}")
+    return res
+for q in ["camera lens","helmet","sofa","dog leash","sunglasses"]:
+    show_text_query(q); print()""")
+md("**Image query** (an uploaded product photo that is *not* in the catalogue), with the Top-3 shown as images:")
+code("""from PIL import Image
+query_img = Image.open("../space/examples/cameras.jpg").convert("RGB")  # a held-out product photo
+qvec = embed_images([query_img])[0]
+res = top_matches(qvec)
+fig, ax = plt.subplots(1, 4, figsize=(14,4))
+ax[0].imshow(query_img); ax[0].set_title("QUERY (uploaded photo)", fontsize=9); ax[0].axis("off")
+for k,(i,s) in enumerate(res,1):
+    im = plt.imread(io.BytesIO(base64.b64decode(catalog.iloc[i]['thumb'])), format="jpeg")
+    ax[k].imshow(im); ax[k].set_title(f"#{k} sim={s:.2f}\\n{catalog.iloc[i]['top_category']}", fontsize=9); ax[k].axis("off")
+plt.tight_layout(); plt.show()""")
+md("""### Evaluation
+To quantify quality I run **image-to-image** retrieval on 80 held-out products and check how often
+the returned products share the query's top-level category (a conservative proxy, since visually
+similar products often sit in different taxonomy categories, e.g. a metal spear tip retrieving
+nails).""")
+code("""import random as _r
+used = set(catalog["row_id"].tolist())
+gt = [ (c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown") for c in ds["ground_truth_category"] ]
+cand = [i for i in range(len(ds)-1,0,-1) if i not in used]
+_r.seed(7); qrows = _r.sample(cand, 80)
+p1 = p3 = 0
+for qi in qrows:
+    v = embed_images([ds[qi]["product_image"].convert("RGB")])[0]
+    order = np.argsort(-(EMB @ v))[:3]
+    cats = [catalog.iloc[i]["top_category"] for i in order]
+    p1 += (cats[0] == gt[qi]); p3 += sum(c == gt[qi] for c in cats)
+print(f"image->image category-match  precision@1 = {p1/len(qrows):.2f}   precision@3 = {p3/(3*len(qrows)):.2f}")
+print("(frequency-weighted random baseline is roughly 0.10-0.13, so this is ~3x baseline)")""")
+md("""Two things to read here. First, ~0.39 is about 3x the random baseline, so the retrieval is real,
+not luck. Second, precision@3 is a little *lower* than precision@1, which is expected for a visual
+recommender: the single nearest neighbour is the safest match, and ranks 2 and 3 start drifting into
+products that look alike but sit in a different taxonomy category (the spear-tip and nails case).
+That drift is a property of matching appearance rather than meaning, not a failure of the model.""")
+md("""### A visual check: query to Top-3 (held-out photos)
+The category metric is conservative. The clearest test is to look at real recommendations for photos
+the catalogue never saw.""")
+code("""ex = [f"../space/examples/{f}" for f in ["cameras.jpg","furniture.jpg","animals.jpg","sporting.jpg","electronics.jpg"]]
+ex = [e for e in ex if os.path.exists(e)]
+fig, axes = plt.subplots(len(ex), 4, figsize=(11, 2.6*len(ex)))
+for r,fn in enumerate(ex):
+    q = Image.open(fn).convert("RGB"); res = top_matches(embed_images([q])[0])
+    axes[r,0].imshow(q); axes[r,0].set_title("query", fontsize=9); axes[r,0].axis("off")
+    for k2,(i,s) in enumerate(res,1):
+        axes[r,k2].imshow(thumb(catalog.iloc[i]["thumb"]))
+        axes[r,k2].set_title(f"#{k2}  {s:.2f}\\n{catalog.iloc[i]['top_category'][:16]}", fontsize=8); axes[r,k2].axis("off")
+fig.suptitle("Query (uploaded photo) -> Top-3 recommendations", fontsize=12)
+plt.tight_layout(); plt.show()""")
+md("""### Bonus: a faster similarity backend with FAISS
+A linear `EMB @ q` scan is fine for 12K items, but a real catalogue is millions. **FAISS** is the
+standard library for fast vector search, so I index the embeddings with it and confirm it returns
+the **same** Top-3 as the brute-force scan, much faster. This is the piece that would let the same
+app scale.""")
+code('''# FAISS and PyTorch share an OpenMP runtime, so I run the benchmark in a clean subprocess
+# (numpy + faiss only) to keep this notebook kernel stable.
+import subprocess, sys, textwrap
+_bench = textwrap.dedent("""
+    import numpy as np, pandas as pd, faiss, time
+    E = np.ascontiguousarray(np.array(pd.read_parquet('../space/catalog.parquet')['embedding'].tolist(), dtype='float32'))
+    idx = faiss.IndexFlatIP(E.shape[1]); idx.add(E)                 # inner product == cosine on normalized vectors
+    rng = np.random.RandomState(0); Q = np.ascontiguousarray(E[rng.choice(len(E), 500, replace=False)])
+    t0=time.time(); _, I = idx.search(Q, 4); ft=time.time()-t0
+    t0=time.time()
+    for q in Q: np.argsort(-(E@q))[:4]
+    bt=time.time()-t0
+    ag = np.mean([len(set(I[i,1:4].tolist()) & set(np.argsort(-(E@Q[i]))[1:4].tolist()))/3 for i in range(len(Q))])
+    print(f"FAISS {ft*1000:.0f} ms vs brute force {bt*1000:.0f} ms on 500 queries  ({bt/ft:.0f}x faster)")
+    print(f"Top-3 agreement with brute force: {ag:.0%}  (FAISS is exact here, just faster)")
+""")
+print(subprocess.run([sys.executable, "-c", _bench], capture_output=True, text=True).stdout)''')
+md("""### Business and ethical considerations (bonus)
+**Business value.** Visual similarity search is the engine behind "shop the look" and "more like
+this" features in e-commerce. It needs no manual tagging (it runs on the product image alone), works
+across languages (useful here, since titles are multilingual), and helps with cold-start items that
+have no clicks yet. The same `catalog.parquet` + FAISS setup would power related-item carousels or a
+visual search bar.
+**Limits and ethics.**
+- **Visual, not semantic.** The model matches appearance, so it can pair items that look alike but
+  serve different purposes. For shopping that is usually fine, but it should not be used where the
+  *function* matters (for example medical or safety products).
+- **Representation bias.** CLIP was trained on web images and reflects their biases; a product photo
+  in an unusual style or from an under-represented region may embed poorly and be under-recommended.
+- **Catalogue gaps.** Recommendations can only ever point inside the catalogue, so sparse categories
+  (few necklaces, few mugs here) give weak results regardless of the model.
+**What I would improve next.** A larger or domain-tuned CLIP for the weak text queries, a
+single-domain catalogue (the white-background overlap hurts cross-category separation), and proper
+human relevance judgements instead of the category-match proxy.""")
+md("""# Part 5 & 7: Space + Submission
+The same `catalog.parquet` and CLIP model power a **Gradio app deployed on Hugging Face Spaces**.
+The app takes an uploaded product photo or a text description and returns the Top-3 most similar
+products.
+**Live Space:** https://huggingface.co/spaces/Noam12345/visual-product-recommender
+## Conclusion
+CLIP gives a single shared space for images and text. Clustering on PCA-reduced embeddings recovered
+four interpretable visual product families, confirming the space is meaningful. Cosine similarity
+over those embeddings produces relevant Top-3 recommendations, strongest for image-to-image queries
+(precision@1 about 3x the random baseline), and the whole pipeline is served live in the Gradio
+Space.""")
+nb["cells"] = cells
+nb["metadata"]["kernelspec"] = {"name":"python3","display_name":"Python 3","language":"python"}
+nbf.write(nb, "Assignment_3_NoamFuchs.ipynb")
+print("notebook written")

scripts/03_finalize.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""Finalize clustering (PCA-50 + K-Means) and evaluate the recommender, reusing the saved
+embeddings in catalog.parquet (no re-embedding). Regenerates the README/notebook artifacts and
+updates the catalog's cluster column. Run from work/ with the venv active."""
+import os, json, warnings, random
+import numpy as np, pandas as pd
+warnings.filterwarnings("ignore")
+import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
+plt.rcParams["figure.dpi"]=120
+SEED=42; random.seed(SEED); np.random.seed(SEED)
+ART="artifacts"; SPACE="../space"
+cat = pd.read_parquet(f"{SPACE}/catalog.parquet")
+E = np.array(cat["embedding"].tolist(), dtype="float32")
+print("catalog:", cat.shape, "emb:", E.shape)
+from sklearn.decomposition import PCA
+from sklearn.cluster import KMeans
+from sklearn.metrics import silhouette_score
+# PCA-50 denoising before clustering (high-dim CLIP vectors cluster better after PCA)
+pca50 = PCA(n_components=50, random_state=SEED).fit(E)
+P = pca50.transform(E)
+print("PCA-50 explained variance:", round(float(pca50.explained_variance_ratio_.sum()),3))
+idx = np.random.choice(len(P), min(8000,len(P)), replace=False)
+sil = {}
+for K in range(4,11):
+    km = KMeans(K, n_init=5, random_state=SEED).fit(P[idx])
+    sil[K] = float(silhouette_score(P[idx], km.labels_))
+bestK = max(sil, key=sil.get)
+print("silhouette:", {k:round(v,3) for k,v in sil.items()}, "-> bestK", bestK)
+json.dump(sil, open(f"{ART}/silhouette.json","w"), indent=2)
+plt.figure(figsize=(7,4))
+plt.plot(list(sil),list(sil.values()),"o-",color="#4C72B0"); plt.axvline(bestK,ls="--",color="grey")
+plt.xlabel("K"); plt.ylabel("silhouette score"); plt.title("K-Means model selection (on PCA-50)")
+plt.tight_layout(); plt.savefig(f"{ART}/silhouette.png"); plt.close()
+km = KMeans(bestK, n_init=10, random_state=SEED).fit(P)
+cat["cluster"] = km.labels_
+# 2D projections (UMAP on full embeddings, PCA-2 for a linear view)
+import umap
+um = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=SEED).fit_transform(E)
+pca2 = PCA(n_components=2, random_state=SEED).fit_transform(E)
+def scatter(xy, labels, title, fname, legend):
+    plt.figure(figsize=(9,7))
+    s = pd.Series(labels).astype(str)
+    for c in sorted(s.unique(), key=lambda x:(len(x),x)):
+        m=(s==c).values; plt.scatter(xy[m,0],xy[m,1],s=5,alpha=0.5,label=c)
+    plt.title(title); plt.xticks([]); plt.yticks([])
+    plt.legend(title=legend, markerscale=3, fontsize=7, ncol=2, loc="best")
+    plt.tight_layout(); plt.savefig(f"{ART}/{fname}"); plt.close()
+scatter(um, cat["top_category"].values, "UMAP of CLIP embeddings — colored by product category", "umap_category.png", "category")
+scatter(um, cat["cluster"].values, f"UMAP of CLIP embeddings — colored by K-Means cluster (K={bestK})", "umap_cluster.png", "cluster")
+scatter(pca2, cat["top_category"].values, "PCA (2D) of CLIP embeddings — colored by category", "pca_category.png", "category")
+ct = pd.crosstab(cat["cluster"], cat["top_category"]); ctn = ct.div(ct.sum(1),axis=0)
+plt.figure(figsize=(12,6)); plt.imshow(ctn.values, aspect="auto", cmap="viridis")
+plt.xticks(range(len(ctn.columns)), ctn.columns, rotation=90, fontsize=7)
+plt.yticks(range(len(ctn.index)), [f"cluster {i}" for i in ctn.index], fontsize=8)
+plt.title("Cluster composition by category (row-normalized)"); plt.colorbar(fraction=0.025)
+plt.tight_layout(); plt.savefig(f"{ART}/cluster_category_heatmap.png"); plt.close()
+profile={}
+print("\n=== cluster profiles ===")
+for cl in sorted(cat["cluster"].unique()):
+    g=cat[cat["cluster"]==cl]; dom=g["top_category"].value_counts().head(3)
+    profile[int(cl)]={"size":int(len(g)),"dominant":dom.to_dict(),"examples":g["title"].head(4).tolist()}
+    print(f"cluster {cl} (n={len(g)}): {dom.to_dict()}")
+json.dump(profile, open(f"{ART}/cluster_profile.json","w"), indent=2, default=str)
+cat.to_parquet(f"{SPACE}/catalog.parquet", index=False)
+cat.to_parquet(f"{ART}/catalog.parquet", index=False)
+print("updated catalog.parquet with PCA-50 clusters")
+# ---------------- recommender evaluation (image->image, held-out queries) ----------------
+import torch
+from datasets import load_dataset
+from transformers import CLIPModel, CLIPProcessor
+m=CLIPModel.from_pretrained("openai/clip-vit-base-patch32").eval()
+p=CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+ds=load_dataset("Shopify/product-catalogue", split="train")
+topcat=lambda c:(c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown")
+gt=[topcat(c) for c in ds["ground_truth_category"]]
+@torch.no_grad()
+def ei(img):
+    i=p(images=img.convert("RGB"),return_tensors="pt"); v=m.vision_model(pixel_values=i["pixel_values"])
+    f=m.visual_projection(v.pooler_output); return (f/f.norm(dim=-1,keepdim=True)).numpy()[0]
+used=set(cat["row_id"].tolist())
+cand=[i for i in range(len(ds)-1,0,-1) if i not in used]
+random.seed(7); queries=random.sample(cand, 80)
+p1=p3=tot3=0
+for qi in queries:
+    v=ei(ds[qi]["product_image"]); qc=gt[qi]
+    order=np.argsort(-(E@v))[:3]
+    cats=[cat.iloc[i]["top_category"] for i in order]
+    p1 += (cats[0]==qc); p3 += sum(c==qc for c in cats); tot3 += 3
+prec1=p1/len(queries); prec3=p3/tot3
+print(f"\nimage->image category-match: precision@1={prec1:.2f}  precision@3={prec3:.2f}  (n={len(queries)})")
+json.dump({"precision_at_1":prec1,"precision_at_3":prec3,"n_queries":len(queries)},
+          open(f"{ART}/eval.json","w"), indent=2)
+print("FINALIZE COMPLETE")

scripts/04_eda.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""Expanded EDA for the image product catalogue. Generates richer plots into artifacts/ and
+space/assets/ (sample image grid, image dimensions, colour/background, brands), plus a JSON summary.
+Run from work/ with the venv active."""
+import os, io, json, warnings, random
+import numpy as np, pandas as pd
+warnings.filterwarnings("ignore")
+import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
+plt.rcParams["figure.dpi"]=120
+SEED=42; random.seed(SEED); np.random.seed(SEED)
+ART="artifacts"; ASSETS="../space/assets"
+os.makedirs(ART, exist_ok=True); os.makedirs(ASSETS, exist_ok=True)
+from datasets import load_dataset
+from PIL import Image
+print("loading dataset (cached)...")
+ds = load_dataset("Shopify/product-catalogue", split="train")
+N = len(ds)
+topcat = lambda c:(c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown")
+cats = [topcat(c) for c in ds["ground_truth_category"]]
+brands = ds["ground_truth_brand"]
+titles = ds["product_title"]
+sh = ds["ground_truth_is_secondhand"]
+df = pd.DataFrame({"top_category":cats,"brand":brands,"title":titles,"secondhand":sh})
+summary={"n_rows":int(N)}
+# ---- 1) sample image grid: one product per top-category ----
+order = df["top_category"].value_counts()
+cats_sorted = [c for c in order.index if c!="Unknown"][:18]
+rng = random.Random(SEED)
+fig, axes = plt.subplots(3, 6, figsize=(15, 8))
+for ax, cat in zip(axes.ravel(), cats_sorted):
+    idxs = df.index[df["top_category"]==cat].tolist()
+    pick = rng.choice(idxs)
+    try:
+        im = ds[int(pick)]["product_image"].convert("RGB")
+        ax.imshow(im)
+    except Exception:
+        pass
+    ax.set_title(cat[:22], fontsize=8); ax.axis("off")
+for ax in axes.ravel()[len(cats_sorted):]: ax.axis("off")
+fig.suptitle("Sample product image per top-level category", fontsize=13)
+plt.tight_layout(); plt.savefig(f"{ART}/eda_sample_grid.png"); plt.savefig(f"{ASSETS}/eda_sample_grid.png"); plt.close()
+print("sample grid done")
+# ---- 2) image dimensions + 3) colour/background, from a 3000-image sample ----
+samp = rng.sample(range(N), 3000)
+W=[]; H=[]; gray=0; whitebg=0; bright=[]
+for i in samp:
+    im = ds[int(i)]["product_image"]
+    w,h = im.size; W.append(w); H.append(h)
+    rgb = im.convert("RGB"); a = np.asarray(rgb.resize((32,32)), dtype="float32")
+    bright.append(float(a.mean()))
+    # grayscale if R==G==B almost everywhere
+    if np.abs(a[...,0]-a[...,1]).mean()<3 and np.abs(a[...,1]-a[...,2]).mean()<3: gray+=1
+    # white background if the 1px border is near-white
+    border = np.concatenate([a[0,:,:],a[-1,:,:],a[:,0,:],a[:,-1,:]])
+    if border.mean()>235: whitebg+=1
+W=np.array(W); H=np.array(H); ar=W/np.maximum(H,1)
+summary.update({"sample_for_image_stats":len(samp),
+                "width_median":float(np.median(W)),"height_median":float(np.median(H)),
+                "grayscale_pct":round(100*gray/len(samp),1),
+                "white_background_pct":round(100*whitebg/len(samp),1),
+                "brightness_median":round(float(np.median(bright)),1)})
+fig, ax = plt.subplots(1,3, figsize=(14,4))
+ax[0].hist(np.clip(W,0,1500),bins=40,color="#4C72B0"); ax[0].set_title("Image width (px)")
+ax[1].hist(np.clip(H,0,1500),bins=40,color="#55A868"); ax[1].set_title("Image height (px)")
+ax[2].hist(np.clip(ar,0,3),bins=40,color="#C44E52"); ax[2].set_title("Aspect ratio (w/h)")
+plt.tight_layout(); plt.savefig(f"{ART}/eda_image_dims.png"); plt.savefig(f"{ASSETS}/eda_image_dims.png"); plt.close()
+fig, ax = plt.subplots(1,2, figsize=(11,4))
+ax[0].hist(bright,bins=40,color="#8172B3"); ax[0].axvline(np.median(bright),ls="--",color="k")
+ax[0].set_title(f"Mean image brightness (0-255)\nmedian={np.median(bright):.0f}")
+ax[1].bar(["white\nbackground","grayscale","colour\nphoto"],
+          [summary["white_background_pct"], summary["grayscale_pct"], 100-summary["grayscale_pct"]],
+          color=["#CCCCCC","#888888","#4C72B0"])
+ax[1].set_ylabel("% of sampled images"); ax[1].set_title("Image composition")
+plt.tight_layout(); plt.savefig(f"{ART}/eda_image_color.png"); plt.savefig(f"{ASSETS}/eda_image_color.png"); plt.close()
+print("image stats done:", {k:summary[k] for k in ["grayscale_pct","white_background_pct","brightness_median"]})
+# ---- 4) top brands ----
+top_brands = df["brand"].replace("", np.nan).dropna().value_counts().head(15)
+fig, ax = plt.subplots(figsize=(9,5))
+top_brands.sort_values().plot.barh(ax=ax, color="#937860")
+ax.set_title("Top 15 brands by product count"); ax.set_xlabel("products")
+plt.tight_layout(); plt.savefig(f"{ART}/eda_brands.png"); plt.savefig(f"{ASSETS}/eda_brands.png"); plt.close()
+# ---- 5) metadata summary ----
+summary.update({
+    "n_top_categories": int(df["top_category"].nunique()),
+    "n_brands": int(df["brand"].replace("",np.nan).dropna().nunique()),
+    "secondhand_pct": round(100*float(pd.Series(sh).astype(bool).mean()),1),
+    "missing_title_pct": round(100*float((pd.Series(titles).isna()|(pd.Series(titles).astype(str).str.len()==0)).mean()),1),
+    "missing_brand_pct": round(100*float((pd.Series(brands).isna()|(pd.Series(brands).astype(str).str.len()==0)).mean()),1),
+    "taxonomy_depth_median": int(np.median([c.count(">")+1 for c in ds["ground_truth_category"] if isinstance(c,str) and c.strip()])),
+})
+json.dump(summary, open(f"{ART}/eda_expanded.json","w"), indent=2)
+print("SUMMARY:", json.dumps(summary, indent=2))
+print("EDA EXPANDED COMPLETE")

scripts/05_bonus.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""Bonus / depth analyses, reusing the saved embeddings in catalog.parquet (no re-embedding):
+  - t-SNE projection (third dimensionality-reduction method)
+  - DBSCAN as a second clustering algorithm, with k-distance eps tuning
+  - per-cluster representative image grids (what each cluster actually contains)
+  - query -> Top-3 visual montage on held-out products
+  - FAISS index benchmark vs brute force (a standard similarity-search DS tool)
+Outputs go to artifacts/ and space/assets/. Run from work/ with the venv active."""
+import os, io, json, base64, time, warnings, random
+import numpy as np, pandas as pd
+warnings.filterwarnings("ignore")
+import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
+from PIL import Image
+plt.rcParams["figure.dpi"]=120
+SEED=42; random.seed(SEED); np.random.seed(SEED)
+ART="artifacts"; ASSETS="../space/assets"
+cat = pd.read_parquet("../space/catalog.parquet")
+EMB = np.array(cat["embedding"].tolist(), dtype="float32")
+KM = cat["cluster"].values
+print("catalog:", cat.shape)
+b64 = lambda t: Image.open(io.BytesIO(base64.b64decode(t))).convert("RGB")
+# ---------------- t-SNE (third projection method) ----------------
+from sklearn.decomposition import PCA
+from sklearn.manifold import TSNE
+print("t-SNE (on PCA-50)...")
+P50 = PCA(n_components=50, random_state=SEED).fit_transform(EMB)
+ts = TSNE(n_components=2, perplexity=30, init="pca", random_state=SEED).fit_transform(P50)
+plt.figure(figsize=(9,7))
+s = pd.Series(cat["top_category"]).astype(str)
+for c in sorted(s.unique()):
+    m=(s==c).values; plt.scatter(ts[m,0],ts[m,1],s=5,alpha=0.5,label=c)
+plt.title("t-SNE of CLIP embeddings, coloured by product category"); plt.xticks([]); plt.yticks([])
+plt.legend(title="category", markerscale=3, fontsize=7, ncol=2)
+plt.tight_layout(); plt.savefig(f"{ART}/tsne_category.png"); plt.savefig(f"{ASSETS}/tsne_category.png"); plt.close()
+# ---------------- DBSCAN (second clustering algorithm) ----------------
+import umap
+from sklearn.cluster import DBSCAN
+from sklearn.neighbors import NearestNeighbors
+print("UMAP for DBSCAN...")
+U = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=SEED).fit_transform(EMB)
+# k-distance plot to choose eps (rigorous eps selection rather than guessing)
+k=15
+nn = NearestNeighbors(n_neighbors=k).fit(U)
+kd = np.sort(nn.kneighbors(U)[0][:,-1])
+eps = float(np.percentile(kd, 92))   # knee region
+fig, ax = plt.subplots(1,2, figsize=(13,5))
+ax[0].plot(kd, color="#4C72B0"); ax[0].axhline(eps, ls="--", color="grey")
+ax[0].set_title(f"k-distance plot (k={k}) -> eps={eps:.2f}"); ax[0].set_xlabel("points sorted"); ax[0].set_ylabel(f"{k}-NN distance")
+db = DBSCAN(eps=eps, min_samples=k).fit(U)
+lab = db.labels_
+n_clusters = len(set(lab)) - (1 if -1 in lab else 0)
+noise = float((lab==-1).mean())
+for c in sorted(set(lab)):
+    m=lab==c; col="lightgrey" if c==-1 else None
+    ax[1].scatter(U[m,0],U[m,1],s=5,alpha=0.5,label=("noise" if c==-1 else f"c{c}"),color=col)
+ax[1].set_title(f"DBSCAN on UMAP: {n_clusters} clusters, {noise*100:.0f}% noise"); ax[1].set_xticks([]); ax[1].set_yticks([])
+ax[1].legend(fontsize=6, markerscale=2, ncol=2)
+plt.tight_layout(); plt.savefig(f"{ART}/dbscan.png"); plt.savefig(f"{ASSETS}/dbscan.png"); plt.close()
+print(f"DBSCAN: {n_clusters} clusters, noise={noise:.2f}, eps={eps:.2f}")
+# ---------------- per-cluster representative image grids ----------------
+print("per-cluster image grids...")
+ncl = int(KM.max())+1
+# Derive each cluster's label from its actual dominant categories (never hardcode -> cannot mislabel).
+def cluster_label(cl):
+    top = cat[cat["cluster"]==cl]["top_category"].value_counts().head(2).index.tolist()
+    return f"cluster {cl}: " + ", ".join(top)
+fig, axes = plt.subplots(ncl, 6, figsize=(12, 2*ncl))
+for cl in range(ncl):
+    members = np.where(KM==cl)[0]
+    centroid = EMB[members].mean(0); centroid/=np.linalg.norm(centroid)
+    nearest = members[np.argsort(-(EMB[members]@centroid))[:6]]
+    for j,idx in enumerate(nearest):
+        ax = axes[cl,j] if ncl>1 else axes[j]
+        ax.imshow(b64(cat.iloc[idx]["thumb"])); ax.axis("off")
+    axes[cl,0].set_title(cluster_label(cl), loc="left", fontsize=9)
+fig.suptitle("Representative products per K-Means cluster (closest to each centroid)", fontsize=12)
+plt.tight_layout(); plt.savefig(f"{ART}/cluster_examples.png"); plt.savefig(f"{ASSETS}/cluster_examples.png"); plt.close()
+# ---------------- query -> Top-3 montage on held-out products ----------------
+print("query montage (held-out)...")
+import torch
+from transformers import CLIPModel, CLIPProcessor
+m = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").eval()
+proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+@torch.no_grad()
+def ei(img):
+    inp=proc(images=img.convert("RGB"),return_tensors="pt"); v=m.vision_model(pixel_values=inp["pixel_values"])
+    f=m.visual_projection(v.pooler_output); return (f/f.norm(dim=-1,keepdim=True)).numpy()[0]
+ex_files=[f for f in ["cameras.jpg","furniture.jpg","animals.jpg","sporting.jpg","electronics.jpg"] if os.path.exists(f"../space/examples/{f}")]
+fig, axes = plt.subplots(len(ex_files), 4, figsize=(11, 2.6*len(ex_files)))
+for r,fn in enumerate(ex_files):
+    q=Image.open(f"../space/examples/{fn}").convert("RGB"); v=ei(q)
+    sims=EMB@v; order=[]
+    for i in np.argsort(-sims):
+        if all(float(EMB[i]@EMB[j])<=0.985 for j in order): order.append(int(i))
+        if len(order)==3: break
+    axes[r,0].imshow(q); axes[r,0].set_title("query", fontsize=9); axes[r,0].axis("off")
+    for k2,idx in enumerate(order,1):
+        axes[r,k2].imshow(b64(cat.iloc[idx]["thumb"])); axes[r,k2].axis("off")
+        axes[r,k2].set_title(f"#{k2}  {sims[idx]:.2f}\n{cat.iloc[idx]['top_category'][:16]}", fontsize=8)
+fig.suptitle("Query (uploaded photo) -> Top-3 recommendations", fontsize=12)
+plt.tight_layout(); plt.savefig(f"{ART}/recommend_examples.png"); plt.savefig(f"{ASSETS}/recommend_examples.png"); plt.close()
+# NOTE: the FAISS benchmark is run separately (faiss-cpu and torch share an OpenMP runtime and
+# segfault if imported in the same process). See the notebook, which runs it in an isolated
+# subprocess, and faiss_stats.json for the recorded result.
+print("BONUS PLOTS COMPLETE")