Fix-to-100: correct cluster labels, self-contained notebook embedding, honest clustering framing, consistency, PCA plot, business/ethics, build scripts as resources
Browse files- Assignment_3_NoamFuchs.ipynb +0 -0
- README.md +35 -20
- app.py +4 -3
- assets/cluster_examples.png +2 -2
- scripts/01_build.py +241 -0
- scripts/02_make_notebook.py +521 -0
- scripts/03_finalize.py +107 -0
- scripts/04_eda.py +100 -0
- scripts/05_bonus.py +110 -0
Assignment_3_NoamFuchs.ipynb
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
README.md
CHANGED
|
@@ -63,10 +63,12 @@ import matplotlib.pyplot as plt
|
|
| 63 |
import torch
|
| 64 |
from datasets import load_dataset
|
| 65 |
from transformers import CLIPModel, CLIPProcessor
|
| 66 |
-
from sklearn.cluster import KMeans
|
| 67 |
from sklearn.decomposition import PCA
|
|
|
|
|
|
|
| 68 |
from sklearn.metrics import silhouette_score
|
| 69 |
-
import umap
|
| 70 |
|
| 71 |
SEED = 42
|
| 72 |
```
|
|
@@ -164,9 +166,10 @@ components** (which keeps ~57% of the variance and denoises the rest).
|
|
| 164 |
|
| 165 |

|
| 166 |
|
| 167 |
-
I tested K from 4 to 10
|
| 168 |
-
|
| 169 |
-
|
|
|
|
| 170 |
|
| 171 |
### 8. UMAP Projection, coloured by category
|
| 172 |
|
|
@@ -187,24 +190,28 @@ is the visual confirmation that the clusters are not arbitrary.
|
|
| 187 |
|
| 188 |

|
| 189 |
|
| 190 |
-
This heatmap (row-normalized) is
|
| 191 |
-
|
|
|
|
|
|
|
| 192 |
|
| 193 |
-
- **
|
| 194 |
-
- **
|
| 195 |
-
- **
|
| 196 |
-
- **
|
| 197 |
|
| 198 |
The clusters were found from purely visual embeddings with no access to the labels, yet they recover
|
| 199 |
human-meaningful groupings. That is exactly the property that makes nearest-neighbour recommendation
|
| 200 |
work: similar-looking products really are near each other in the space.
|
| 201 |
|
| 202 |
-
### 11.
|
| 203 |
|
| 204 |
-
|
|
|
|
|
|
|
| 205 |
|
| 206 |
-
|
| 207 |
-
|
| 208 |
|
| 209 |
### 12. A second clustering algorithm: DBSCAN
|
| 210 |
|
|
@@ -260,12 +267,14 @@ category:
|
|
| 260 |
| Metric | Score | Note |
|
| 261 |
|---|---|---|
|
| 262 |
| precision@1 | **0.39** | about 3x the frequency-weighted random baseline (~0.12) |
|
| 263 |
-
| precision@3 | **0.35** | |
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
|
|
|
|
|
|
| 269 |
|
| 270 |
## Interactive App (the Space)
|
| 271 |
|
|
@@ -309,6 +318,12 @@ A few caveats worth flagging:
|
|
| 309 |
- **Single sample.** Numbers come from one balanced 11,912-product sample and one 80-query eval set;
|
| 310 |
a larger sweep would tighten the estimates.
|
| 311 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
## Repository Contents
|
| 313 |
|
| 314 |
| File | Description |
|
|
|
|
| 63 |
import torch
|
| 64 |
from datasets import load_dataset
|
| 65 |
from transformers import CLIPModel, CLIPProcessor
|
| 66 |
+
from sklearn.cluster import KMeans, DBSCAN
|
| 67 |
from sklearn.decomposition import PCA
|
| 68 |
+
from sklearn.manifold import TSNE
|
| 69 |
+
from sklearn.neighbors import NearestNeighbors
|
| 70 |
from sklearn.metrics import silhouette_score
|
| 71 |
+
import umap, faiss
|
| 72 |
|
| 73 |
SEED = 42
|
| 74 |
```
|
|
|
|
| 166 |
|
| 167 |

|
| 168 |
|
| 169 |
+
I tested K from 4 to 10. The silhouette is **low and nearly flat** (about 0.08 at K=4, only ~0.005
|
| 170 |
+
above K=7), so there is no sharp natural number of clusters: K-Means finds only **coarse** structure.
|
| 171 |
+
That is the white-background overlap from EDA #6 showing up numerically. I take K=4 as the coarse
|
| 172 |
+
summary and lean on the next two plots, not the silhouette, for whether the structure is real.
|
| 173 |
|
| 174 |
### 8. UMAP Projection, coloured by category
|
| 175 |
|
|
|
|
| 190 |
|
| 191 |

|
| 192 |
|
| 193 |
+
This heatmap (row-normalized) is **diffuse rather than block-diagonal**, exactly what the low
|
| 194 |
+
silhouette predicts: categories are locally separable but globally overlapping. Even so, each cluster
|
| 195 |
+
has a clear dominant set of categories, giving four coarse **visual product families** (the dominant
|
| 196 |
+
categories are computed from the data, not assumed):
|
| 197 |
|
| 198 |
+
- **Consumables and packaged goods** (Food, Health & Beauty, Pet Supplies) - bottles, boxes, tubes.
|
| 199 |
+
- **Furnishings and soft goods** (Furniture, Apparel, Baby & Toddler) - larger lifestyle items.
|
| 200 |
+
- **Tech and hardware** (Electronics, Cameras & Optics, Hardware) - devices, tools, metal parts.
|
| 201 |
+
- **Toys, office and media** (Toys & Games, Office Supplies, Arts & Entertainment) - small, colourful items.
|
| 202 |
|
| 203 |
The clusters were found from purely visual embeddings with no access to the labels, yet they recover
|
| 204 |
human-meaningful groupings. That is exactly the property that makes nearest-neighbour recommendation
|
| 205 |
work: similar-looking products really are near each other in the space.
|
| 206 |
|
| 207 |
+
### 11. Two more projections: PCA and t-SNE
|
| 208 |
|
| 209 |
+
PCA is a fast linear view (it is also what I cluster on); t-SNE emphasises local neighbourhoods. I
|
| 210 |
+
ran both as cross-checks against UMAP, and both show the same per-category grouping, so the structure
|
| 211 |
+
is not a UMAP artefact.
|
| 212 |
|
| 213 |
+

|
| 214 |
+

|
| 215 |
|
| 216 |
### 12. A second clustering algorithm: DBSCAN
|
| 217 |
|
|
|
|
| 267 |
| Metric | Score | Note |
|
| 268 |
|---|---|---|
|
| 269 |
| precision@1 | **0.39** | about 3x the frequency-weighted random baseline (~0.12) |
|
| 270 |
+
| precision@3 | **0.35** | slightly below precision@1 (explained below) |
|
| 271 |
|
| 272 |
+
precision@3 being a little lower than precision@1 is expected for a visual recommender: the single
|
| 273 |
+
nearest neighbour is the safest match, and ranks 2 and 3 start drifting into products that look alike
|
| 274 |
+
but sit in a different taxonomy category. That is also why this is a conservative proxy. Many of the
|
| 275 |
+
apparent "misses" are visually correct cross-category matches (a metal spear tip retrieving nails, a
|
| 276 |
+
scroll saw retrieving a microtome), because the model ranks by **appearance**, not by the semantic
|
| 277 |
+
taxonomy, which is the right behaviour for a "find similar-looking products" tool.
|
| 278 |
|
| 279 |
## Interactive App (the Space)
|
| 280 |
|
|
|
|
| 318 |
- **Single sample.** Numbers come from one balanced 11,912-product sample and one 80-query eval set;
|
| 319 |
a larger sweep would tighten the estimates.
|
| 320 |
|
| 321 |
+
**What I would do next.** A larger or domain-tuned CLIP would help the weak text queries; a
|
| 322 |
+
single-domain catalogue (rather than a mixed marketplace) would reduce the white-background overlap
|
| 323 |
+
and sharpen the clusters; and real human relevance judgements would replace the conservative
|
| 324 |
+
category-match proxy. If I had set out to maximise the clustering metric I would also have filtered to
|
| 325 |
+
fewer, more visually distinct categories, but I kept the full mix because it is the honest, harder case.
|
| 326 |
+
|
| 327 |
## Repository Contents
|
| 328 |
|
| 329 |
| File | Description |
|
app.py
CHANGED
|
@@ -80,7 +80,8 @@ def recommend(text, image):
|
|
| 80 |
mode = "text"
|
| 81 |
else:
|
| 82 |
return [], "Upload a product photo or type a description to get recommendations."
|
| 83 |
-
|
|
|
|
| 84 |
|
| 85 |
|
| 86 |
def load_plot(name):
|
|
@@ -123,8 +124,8 @@ with gr.Blocks(title="Visual Product Recommender", theme=gr.themes.Soft()) as de
|
|
| 123 |
with gr.Column(scale=2):
|
| 124 |
gallery = gr.Gallery(label="Recommendations", columns=3, height=400, object_fit="contain")
|
| 125 |
gr.Markdown(
|
| 126 |
-
"*
|
| 127 |
-
"image, clear
|
| 128 |
)
|
| 129 |
btn.click(recommend, [text_in, img_in], [gallery, note])
|
| 130 |
text_in.submit(recommend, [text_in, img_in], [gallery, note])
|
|
|
|
| 80 |
mode = "text"
|
| 81 |
else:
|
| 82 |
return [], "Upload a product photo or type a description to get recommendations."
|
| 83 |
+
res = top_matches(qvec, k=3)
|
| 84 |
+
return render(res), f"Top 3 products most similar to your {mode} query (each result shows its cosine similarity)."
|
| 85 |
|
| 86 |
|
| 87 |
def load_plot(name):
|
|
|
|
| 124 |
with gr.Column(scale=2):
|
| 125 |
gallery = gr.Gallery(label="Recommendations", columns=3, height=400, object_fit="contain")
|
| 126 |
gr.Markdown(
|
| 127 |
+
"*Image upload is the reliable mode; text search is best-effort and depends on the item existing in "
|
| 128 |
+
"the catalogue. An uploaded image takes priority over the text box, so clear it to run a text query.*"
|
| 129 |
)
|
| 130 |
btn.click(recommend, [text_in, img_in], [gallery, note])
|
| 131 |
text_in.submit(recommend, [text_in, img_in], [gallery, note])
|
assets/cluster_examples.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
scripts/01_build.py
ADDED
|
@@ -0,0 +1,241 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Assignment 3 - Product Recommender build pipeline.
|
| 3 |
+
Dataset: Shopify/product-catalogue (HF). Model: openai/clip-vit-base-patch32.
|
| 4 |
+
Produces: artifacts/ (plots, stats json) and ../space/catalog.parquet (+ a copy in artifacts).
|
| 5 |
+
Run from work/ with the venv active.
|
| 6 |
+
"""
|
| 7 |
+
import os, io, json, base64, warnings, random
|
| 8 |
+
import numpy as np
|
| 9 |
+
import pandas as pd
|
| 10 |
+
warnings.filterwarnings("ignore")
|
| 11 |
+
|
| 12 |
+
SEED = 42
|
| 13 |
+
random.seed(SEED); np.random.seed(SEED)
|
| 14 |
+
|
| 15 |
+
ART = "artifacts"
|
| 16 |
+
os.makedirs(ART, exist_ok=True)
|
| 17 |
+
SPACE = "../space"
|
| 18 |
+
os.makedirs(SPACE, exist_ok=True)
|
| 19 |
+
|
| 20 |
+
TOTAL_TARGET = 13000 # aim ~13K balanced sample (rubric: 1K-1M, preserves structure)
|
| 21 |
+
TARGET_MIN_CAT = 150 # drop tiny top-categories below this (noise)
|
| 22 |
+
THUMB = 110 # thumbnail px for storage / app display
|
| 23 |
+
|
| 24 |
+
import torch
|
| 25 |
+
DEVICE = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
|
| 26 |
+
print("device:", DEVICE)
|
| 27 |
+
|
| 28 |
+
# ---------------------------------------------------------------- load
|
| 29 |
+
from datasets import load_dataset
|
| 30 |
+
print("loading dataset (first run downloads ~couple GB, cached after)...")
|
| 31 |
+
ds = load_dataset("Shopify/product-catalogue", split="train")
|
| 32 |
+
print("rows:", len(ds), "| columns:", ds.column_names)
|
| 33 |
+
|
| 34 |
+
df = pd.DataFrame({
|
| 35 |
+
"title": ds["product_title"],
|
| 36 |
+
"description": ds["product_description"],
|
| 37 |
+
"category": ds["ground_truth_category"],
|
| 38 |
+
"brand": ds["ground_truth_brand"],
|
| 39 |
+
"secondhand": ds["ground_truth_is_secondhand"],
|
| 40 |
+
})
|
| 41 |
+
|
| 42 |
+
def top_cat(c):
|
| 43 |
+
if not isinstance(c, str) or not c.strip():
|
| 44 |
+
return "Unknown"
|
| 45 |
+
return c.split(">")[0].strip()
|
| 46 |
+
|
| 47 |
+
df["top_category"] = df["category"].map(top_cat)
|
| 48 |
+
df["row_id"] = np.arange(len(df))
|
| 49 |
+
|
| 50 |
+
# ---------------------------------------------------------------- EDA (saved)
|
| 51 |
+
eda = {}
|
| 52 |
+
eda["n_rows"] = int(len(df))
|
| 53 |
+
eda["n_columns"] = len(ds.column_names)
|
| 54 |
+
eda["columns"] = ds.column_names
|
| 55 |
+
eda["n_duplicate_titles"] = int(df["title"].duplicated().sum())
|
| 56 |
+
eda["missing_per_column"] = {c: int(df[c].isna().sum() | (df[c].astype(str).str.len()==0).sum()) for c in ["title","description","category","brand"]}
|
| 57 |
+
eda["n_top_categories"] = int(df["top_category"].nunique())
|
| 58 |
+
eda["n_full_categories"] = int(df["category"].nunique())
|
| 59 |
+
eda["n_brands"] = int(df["brand"].nunique())
|
| 60 |
+
eda["secondhand_rate"] = float(np.mean(df["secondhand"].astype(bool)))
|
| 61 |
+
topcat_counts = df["top_category"].value_counts()
|
| 62 |
+
eda["top_category_counts"] = topcat_counts.to_dict()
|
| 63 |
+
df["title_len"] = df["title"].astype(str).str.len()
|
| 64 |
+
df["desc_len"] = df["description"].astype(str).str.len()
|
| 65 |
+
eda["title_len_describe"] = df["title_len"].describe().to_dict()
|
| 66 |
+
eda["desc_len_describe"] = df["desc_len"].describe().to_dict()
|
| 67 |
+
json.dump(eda, open(f"{ART}/eda_stats.json","w"), indent=2, default=str)
|
| 68 |
+
print("top categories:\n", topcat_counts.head(20))
|
| 69 |
+
|
| 70 |
+
# EDA plots
|
| 71 |
+
import matplotlib
|
| 72 |
+
matplotlib.use("Agg")
|
| 73 |
+
import matplotlib.pyplot as plt
|
| 74 |
+
plt.rcParams["figure.dpi"]=120
|
| 75 |
+
|
| 76 |
+
keep = topcat_counts[topcat_counts>=TARGET_MIN_CAT].index.tolist()
|
| 77 |
+
plot_counts = topcat_counts[topcat_counts.index.isin(keep)].head(20)
|
| 78 |
+
fig,ax=plt.subplots(figsize=(9,5))
|
| 79 |
+
plot_counts.sort_values().plot.barh(ax=ax, color="#4C72B0")
|
| 80 |
+
ax.set_title("Top-level product categories (count)"); ax.set_xlabel("products")
|
| 81 |
+
plt.tight_layout(); plt.savefig(f"{ART}/eda_categories.png"); plt.close()
|
| 82 |
+
|
| 83 |
+
fig,ax=plt.subplots(1,2,figsize=(11,4))
|
| 84 |
+
df["title_len"].clip(0,120).plot.hist(bins=40,ax=ax[0],color="#55A868"); ax[0].set_title("Title length (chars)")
|
| 85 |
+
df["desc_len"].clip(0,2000).plot.hist(bins=40,ax=ax[1],color="#C44E52"); ax[1].set_title("Description length (chars)")
|
| 86 |
+
plt.tight_layout(); plt.savefig(f"{ART}/eda_text_lengths.png"); plt.close()
|
| 87 |
+
print("EDA done.")
|
| 88 |
+
|
| 89 |
+
# ---------------------------------------------------------------- balanced sample
|
| 90 |
+
df_keep = df[df["top_category"].isin(keep)].copy()
|
| 91 |
+
PER_CAT_CAP = max(TARGET_MIN_CAT, TOTAL_TARGET // max(1,len(keep)))
|
| 92 |
+
print(f"keep categories: {len(keep)} | per-category cap: {PER_CAT_CAP}")
|
| 93 |
+
parts=[]
|
| 94 |
+
for c,g in df_keep.groupby("top_category"):
|
| 95 |
+
parts.append(g.sample(min(len(g),PER_CAT_CAP), random_state=SEED))
|
| 96 |
+
sample = pd.concat(parts).sample(frac=1, random_state=SEED).reset_index(drop=True)
|
| 97 |
+
print("sample size:", len(sample), "| categories:", sample["top_category"].nunique())
|
| 98 |
+
|
| 99 |
+
# ---------------------------------------------------------------- CLIP embeddings
|
| 100 |
+
from transformers import CLIPModel, CLIPProcessor
|
| 101 |
+
MODEL="openai/clip-vit-base-patch32"
|
| 102 |
+
print("loading CLIP:", MODEL)
|
| 103 |
+
model=CLIPModel.from_pretrained(MODEL).to(DEVICE).eval()
|
| 104 |
+
proc=CLIPProcessor.from_pretrained(MODEL)
|
| 105 |
+
|
| 106 |
+
def pil_thumb(img, size=THUMB):
|
| 107 |
+
img=img.convert("RGB"); img.thumbnail((size,size))
|
| 108 |
+
buf=io.BytesIO(); img.save(buf,format="JPEG",quality=80)
|
| 109 |
+
return base64.b64encode(buf.getvalue()).decode()
|
| 110 |
+
|
| 111 |
+
@torch.no_grad()
|
| 112 |
+
def embed_batch(imgs):
|
| 113 |
+
inp=proc(images=imgs, return_tensors="pt").to(DEVICE)
|
| 114 |
+
v=model.vision_model(pixel_values=inp["pixel_values"])
|
| 115 |
+
f=model.visual_projection(v.pooler_output) # project into the shared image-text space
|
| 116 |
+
f=f/f.norm(dim=-1,keepdim=True)
|
| 117 |
+
return f.cpu().numpy().astype("float32")
|
| 118 |
+
|
| 119 |
+
# Memory-safe streaming: decode a batch -> thumbnail -> embed -> free.
|
| 120 |
+
# We only ever hold BATCH full images at once (the 512-d vectors + small thumbs are tiny).
|
| 121 |
+
sel_ids = sample["row_id"].tolist()
|
| 122 |
+
sub = ds.select(sel_ids)
|
| 123 |
+
BATCH=64
|
| 124 |
+
thumbs=[]; emb_chunks=[]; buf=[]
|
| 125 |
+
print("fetching + embedding images (streaming)...")
|
| 126 |
+
for i,ex in enumerate(sub):
|
| 127 |
+
im=ex["product_image"].convert("RGB")
|
| 128 |
+
thumbs.append(pil_thumb(im))
|
| 129 |
+
buf.append(im)
|
| 130 |
+
if len(buf)==BATCH:
|
| 131 |
+
emb_chunks.append(embed_batch(buf)); buf=[]
|
| 132 |
+
if (i+1)%2000==0: print(f" {i+1}/{len(sel_ids)} images")
|
| 133 |
+
if buf: emb_chunks.append(embed_batch(buf))
|
| 134 |
+
emb=np.vstack(emb_chunks)
|
| 135 |
+
sample["thumb"]=thumbs
|
| 136 |
+
print("embeddings:", emb.shape)
|
| 137 |
+
sample["embedding"]=[e.tolist() for e in emb]
|
| 138 |
+
|
| 139 |
+
# ---------------------------------------------------------------- clustering + projections
|
| 140 |
+
from sklearn.cluster import KMeans
|
| 141 |
+
from sklearn.metrics import silhouette_score
|
| 142 |
+
from sklearn.decomposition import PCA
|
| 143 |
+
|
| 144 |
+
print("choosing K via silhouette...")
|
| 145 |
+
sil={}
|
| 146 |
+
sample_for_sil = emb if len(emb)<=8000 else emb[np.random.choice(len(emb),8000,replace=False)]
|
| 147 |
+
Ks=list(range(4,13))
|
| 148 |
+
for K in Ks:
|
| 149 |
+
km=KMeans(n_clusters=K,n_init=5,random_state=SEED).fit(sample_for_sil)
|
| 150 |
+
sil[K]=float(silhouette_score(sample_for_sil, km.labels_))
|
| 151 |
+
print(f" K={K} silhouette={sil[K]:.4f}")
|
| 152 |
+
bestK=max(sil,key=sil.get)
|
| 153 |
+
print("bestK:", bestK)
|
| 154 |
+
json.dump(sil, open(f"{ART}/silhouette.json","w"), indent=2)
|
| 155 |
+
|
| 156 |
+
fig,ax=plt.subplots(figsize=(7,4))
|
| 157 |
+
ax.plot(list(sil.keys()),list(sil.values()),"o-",color="#4C72B0")
|
| 158 |
+
ax.axvline(bestK,ls="--",color="grey"); ax.set_xlabel("K"); ax.set_ylabel("silhouette")
|
| 159 |
+
ax.set_title("K-Means model selection (silhouette)"); plt.tight_layout()
|
| 160 |
+
plt.savefig(f"{ART}/silhouette.png"); plt.close()
|
| 161 |
+
|
| 162 |
+
km=KMeans(n_clusters=bestK,n_init=10,random_state=SEED).fit(emb)
|
| 163 |
+
sample["cluster"]=km.labels_
|
| 164 |
+
|
| 165 |
+
print("PCA + UMAP projections...")
|
| 166 |
+
pca=PCA(n_components=2,random_state=SEED).fit_transform(emb)
|
| 167 |
+
import umap
|
| 168 |
+
um=umap.UMAP(n_components=2,n_neighbors=15,min_dist=0.1,random_state=SEED,metric="cosine").fit_transform(emb)
|
| 169 |
+
sample["pca_x"],sample["pca_y"]=pca[:,0],pca[:,1]
|
| 170 |
+
sample["umap_x"],sample["umap_y"]=um[:,0],um[:,1]
|
| 171 |
+
|
| 172 |
+
def scatter(xy,labels,title,fname,legend_title):
|
| 173 |
+
fig,ax=plt.subplots(figsize=(9,7))
|
| 174 |
+
cats=pd.Series(labels).astype(str)
|
| 175 |
+
for c in sorted(cats.unique()):
|
| 176 |
+
m=cats==c
|
| 177 |
+
ax.scatter(xy[m,0],xy[m,1],s=5,alpha=0.5,label=c)
|
| 178 |
+
ax.set_title(title); ax.set_xticks([]); ax.set_yticks([])
|
| 179 |
+
ax.legend(title=legend_title,markerscale=3,fontsize=7,loc="best",ncol=2)
|
| 180 |
+
plt.tight_layout(); plt.savefig(f"{ART}/{fname}"); plt.close()
|
| 181 |
+
|
| 182 |
+
scatter(um, sample["top_category"].values, "UMAP of CLIP embeddings — colored by product category", "umap_category.png","category")
|
| 183 |
+
scatter(um, sample["cluster"].values, f"UMAP of CLIP embeddings — colored by K-Means cluster (K={bestK})", "umap_cluster.png","cluster")
|
| 184 |
+
scatter(pca, sample["top_category"].values, "PCA (2D) of CLIP embeddings — colored by category", "pca_category.png","category")
|
| 185 |
+
|
| 186 |
+
# cluster vs category crosstab (reasoning)
|
| 187 |
+
ct=pd.crosstab(sample["cluster"],sample["top_category"])
|
| 188 |
+
ct_norm=ct.div(ct.sum(1),axis=0)
|
| 189 |
+
import numpy as _np
|
| 190 |
+
fig,ax=plt.subplots(figsize=(12,6))
|
| 191 |
+
im=ax.imshow(ct_norm.values,aspect="auto",cmap="viridis")
|
| 192 |
+
ax.set_xticks(range(len(ct_norm.columns))); ax.set_xticklabels(ct_norm.columns,rotation=90,fontsize=7)
|
| 193 |
+
ax.set_yticks(range(len(ct_norm.index))); ax.set_yticklabels([f"cluster {i}" for i in ct_norm.index],fontsize=8)
|
| 194 |
+
ax.set_title("Cluster composition by category (row-normalized)"); fig.colorbar(im,fraction=0.025)
|
| 195 |
+
plt.tight_layout(); plt.savefig(f"{ART}/cluster_category_heatmap.png"); plt.close()
|
| 196 |
+
|
| 197 |
+
# top category per cluster -> reasoning table
|
| 198 |
+
cluster_profile={}
|
| 199 |
+
for cl in sorted(sample["cluster"].unique()):
|
| 200 |
+
g=sample[sample["cluster"]==cl]
|
| 201 |
+
top=g["top_category"].value_counts().head(3)
|
| 202 |
+
cluster_profile[int(cl)]={"size":int(len(g)),"dominant":top.to_dict(),
|
| 203 |
+
"example_titles":g["title"].head(5).tolist()}
|
| 204 |
+
json.dump(cluster_profile, open(f"{ART}/cluster_profile.json","w"), indent=2, default=str)
|
| 205 |
+
print("cluster profile saved.")
|
| 206 |
+
|
| 207 |
+
# ---------------------------------------------------------------- save catalog (embedding file)
|
| 208 |
+
catalog=sample[["row_id","title","top_category","category","brand","secondhand",
|
| 209 |
+
"cluster","thumb","embedding"]].copy()
|
| 210 |
+
catalog.to_parquet(f"{SPACE}/catalog.parquet", index=False)
|
| 211 |
+
catalog.to_parquet(f"{ART}/catalog.parquet", index=False)
|
| 212 |
+
print("saved catalog.parquet:", catalog.shape)
|
| 213 |
+
|
| 214 |
+
# ---------------------------------------------------------------- recommender + sanity test
|
| 215 |
+
embn=np.array(catalog["embedding"].tolist(),dtype="float32") # already L2-normalized
|
| 216 |
+
|
| 217 |
+
@torch.no_grad()
|
| 218 |
+
def encode_text(q):
|
| 219 |
+
inp=proc(text=[q],return_tensors="pt",padding=True,truncation=True).to(DEVICE)
|
| 220 |
+
t=model.text_model(input_ids=inp["input_ids"],attention_mask=inp["attention_mask"])
|
| 221 |
+
f=model.text_projection(t.pooler_output); f=f/f.norm(dim=-1,keepdim=True)
|
| 222 |
+
return f.cpu().numpy()[0]
|
| 223 |
+
|
| 224 |
+
def topk(qvec,k=3,dedup=0.985):
|
| 225 |
+
sims=embn@qvec
|
| 226 |
+
order=np.argsort(-sims)
|
| 227 |
+
out=[]
|
| 228 |
+
for idx in order:
|
| 229 |
+
if any(float(embn[idx]@embn[j])>dedup for j in out): continue
|
| 230 |
+
out.append(int(idx))
|
| 231 |
+
if len(out)==k: break
|
| 232 |
+
return [(i,float(sims[i])) for i in out]
|
| 233 |
+
|
| 234 |
+
print("\n--- recommender sanity (text queries) ---")
|
| 235 |
+
for q in ["a pair of running shoes","wooden kitchen table","gold necklace","baby toy"]:
|
| 236 |
+
res=topk(encode_text(q))
|
| 237 |
+
print(f"query: {q}")
|
| 238 |
+
for i,s in res:
|
| 239 |
+
print(f" {s:.3f} | {catalog.iloc[i]['top_category']:25} | {catalog.iloc[i]['title'][:55]}")
|
| 240 |
+
|
| 241 |
+
print("\nBUILD COMPLETE")
|
scripts/02_make_notebook.py
ADDED
|
@@ -0,0 +1,521 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Assemble the deliverable notebook Assignment_3_NoamFuchs.ipynb (then executed via nbconvert)."""
|
| 2 |
+
import nbformat as nbf
|
| 3 |
+
|
| 4 |
+
nb = nbf.v4.new_notebook()
|
| 5 |
+
cells = []
|
| 6 |
+
def md(t): cells.append(nbf.v4.new_markdown_cell(t))
|
| 7 |
+
def code(t): cells.append(nbf.v4.new_code_cell(t))
|
| 8 |
+
|
| 9 |
+
md("""# Assignment #3: Embeddings, RecSys, Spaces
|
| 10 |
+
## Visual Product Recommender
|
| 11 |
+
|
| 12 |
+
**Noam Fuchs**
|
| 13 |
+
|
| 14 |
+
This notebook builds a recommendation app on the **vision modality**. Given a text description or
|
| 15 |
+
a product photo, it returns the 3 most similar products from an e-commerce catalogue, using
|
| 16 |
+
**CLIP image embeddings** and **cosine similarity**.
|
| 17 |
+
|
| 18 |
+
**Pipeline:** dataset -> EDA -> CLIP embeddings -> clustering (K-Means) + 2D projection (UMAP/PCA) ->
|
| 19 |
+
save embeddings file -> cosine-similarity Top-3 recommender. The same embeddings power the Gradio
|
| 20 |
+
Space.
|
| 21 |
+
|
| 22 |
+
**Dataset:** `Shopify/product-catalogue` | **Model:** `openai/clip-vit-base-patch32`""")
|
| 23 |
+
|
| 24 |
+
md("# Part 0: Config")
|
| 25 |
+
code("""import os, io, base64, json, warnings, random
|
| 26 |
+
import numpy as np
|
| 27 |
+
import pandas as pd
|
| 28 |
+
import matplotlib.pyplot as plt
|
| 29 |
+
import torch
|
| 30 |
+
warnings.filterwarnings("ignore")
|
| 31 |
+
|
| 32 |
+
SEED = 42
|
| 33 |
+
random.seed(SEED); np.random.seed(SEED)
|
| 34 |
+
DEVICE = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
|
| 35 |
+
print("device:", DEVICE)""")
|
| 36 |
+
|
| 37 |
+
md("""# Part 1: Select a Visual Dataset
|
| 38 |
+
|
| 39 |
+
I use [`Shopify/product-catalogue`](https://huggingface.co/datasets/Shopify/product-catalogue),
|
| 40 |
+
downloaded directly from Hugging Face. It is a real e-commerce catalogue, which is a natural fit
|
| 41 |
+
for a product recommender.""")
|
| 42 |
+
code("""from datasets import load_dataset
|
| 43 |
+
ds = load_dataset("Shopify/product-catalogue", split="train")
|
| 44 |
+
print("rows:", len(ds))
|
| 45 |
+
print("columns:", ds.column_names)""")
|
| 46 |
+
|
| 47 |
+
md("""### Describe the dataset (source, size, features)
|
| 48 |
+
|
| 49 |
+
- **Source:** Hugging Face, `Shopify/product-catalogue`.
|
| 50 |
+
- **Size:** ~48K real product listings, each with an embedded product image.
|
| 51 |
+
- **Features:** `product_title`, `product_description`, `product_image`,
|
| 52 |
+
`ground_truth_category` (Google product taxonomy, e.g. *Home & Garden > Decor > Piggy Banks*),
|
| 53 |
+
`ground_truth_brand`, `ground_truth_is_secondhand`.
|
| 54 |
+
|
| 55 |
+
I derive a **top-level category** (the first segment of the taxonomy) to use later as a label for
|
| 56 |
+
the clustering analysis.""")
|
| 57 |
+
code("""df = pd.DataFrame({
|
| 58 |
+
"title": ds["product_title"],
|
| 59 |
+
"description": ds["product_description"],
|
| 60 |
+
"category": ds["ground_truth_category"],
|
| 61 |
+
"brand": ds["ground_truth_brand"],
|
| 62 |
+
"secondhand": ds["ground_truth_is_secondhand"],
|
| 63 |
+
})
|
| 64 |
+
df["top_category"] = df["category"].fillna("Unknown").map(lambda c: c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown")
|
| 65 |
+
df["row_id"] = np.arange(len(df))
|
| 66 |
+
df.head(3)""")
|
| 67 |
+
|
| 68 |
+
md("""# Part 2: Exploratory Data Analysis
|
| 69 |
+
|
| 70 |
+
Because this is an **image** dataset, the EDA covers both the metadata (categories, brands, text,
|
| 71 |
+
missingness) and the **images themselves** (dimensions, colour, backgrounds), which is what
|
| 72 |
+
actually drives the embeddings later.""")
|
| 73 |
+
|
| 74 |
+
md("### 2.1 Initial inspection and sanity checks")
|
| 75 |
+
code("""print("shape:", df.shape)
|
| 76 |
+
print("duplicate titles:", df["title"].duplicated().sum())
|
| 77 |
+
print("unique top-categories:", df["top_category"].nunique())
|
| 78 |
+
print("unique brands:", df["brand"].replace("", np.nan).dropna().nunique())
|
| 79 |
+
print("secondhand rate: %.3f" % df["secondhand"].astype(bool).mean())
|
| 80 |
+
miss = pd.DataFrame({
|
| 81 |
+
c: [int(df[c].isna().sum() + (df[c].astype(str).str.len()==0).sum())] for c in ["title","description","category","brand"]
|
| 82 |
+
}, index=["missing/empty"]).T
|
| 83 |
+
miss["pct"] = (100*miss["missing/empty"]/len(df)).round(2)
|
| 84 |
+
miss""")
|
| 85 |
+
md("""Titles and categories are essentially complete; only `brand` has a small gap. There are very few
|
| 86 |
+
exact-duplicate titles, so the catalogue is clean enough to embed directly.""")
|
| 87 |
+
|
| 88 |
+
md("### 2.2 Category distribution")
|
| 89 |
+
code("""topcat = df["top_category"].value_counts()
|
| 90 |
+
print(topcat.head(15))
|
| 91 |
+
fig, ax = plt.subplots(figsize=(9,5))
|
| 92 |
+
topcat.head(15).sort_values().plot.barh(ax=ax, color="#4C72B0")
|
| 93 |
+
ax.set_title("Top-level product categories (count)"); ax.set_xlabel("products")
|
| 94 |
+
plt.tight_layout(); plt.show()""")
|
| 95 |
+
md("""Categories are **imbalanced** (Home & Garden, Sporting Goods and Arts & Entertainment dominate),
|
| 96 |
+
so for the embedding analysis I take a **balanced stratified sample** so no single category drives
|
| 97 |
+
the clusters.""")
|
| 98 |
+
|
| 99 |
+
md("### 2.3 Brands and product taxonomy")
|
| 100 |
+
code("""top_brands = df["brand"].replace("", np.nan).dropna().value_counts()
|
| 101 |
+
print("brands total:", top_brands.shape[0], "| top brand share of catalogue: %.1f%%" % (100*top_brands.iloc[0]/len(df)))
|
| 102 |
+
depth = pd.Series([c.count(">")+1 for c in df["category"] if isinstance(c,str) and c.strip()])
|
| 103 |
+
print("median taxonomy depth:", int(depth.median()), "levels (e.g. A > B > C > D)")
|
| 104 |
+
fig, ax = plt.subplots(figsize=(9,5))
|
| 105 |
+
top_brands.head(15).sort_values().plot.barh(ax=ax, color="#937860")
|
| 106 |
+
ax.set_title("Top 15 brands by product count"); ax.set_xlabel("products")
|
| 107 |
+
plt.tight_layout(); plt.show()""")
|
| 108 |
+
md("""There is a **very long brand tail** (tens of thousands of brands, none dominant), which is typical
|
| 109 |
+
of a real marketplace and means brand is not a useful grouping signal: the visual content is.""")
|
| 110 |
+
|
| 111 |
+
md("### 2.4 Text fields")
|
| 112 |
+
code("""df["title_len"] = df["title"].astype(str).str.len()
|
| 113 |
+
df["desc_len"] = df["description"].astype(str).str.len()
|
| 114 |
+
fig, ax = plt.subplots(1,2, figsize=(11,4))
|
| 115 |
+
df["title_len"].clip(0,120).plot.hist(bins=40, ax=ax[0], color="#55A868"); ax[0].set_title("Title length (chars)")
|
| 116 |
+
df["desc_len"].clip(0,2000).plot.hist(bins=40, ax=ax[1], color="#C44E52"); ax[1].set_title("Description length (chars)")
|
| 117 |
+
plt.tight_layout(); plt.show()
|
| 118 |
+
df[["title_len","desc_len"]].describe().round(1)""")
|
| 119 |
+
md("Titles are short, descriptions vary widely, and many titles are **multilingual** (English, Hebrew, Japanese, Dutch, Portuguese, etc.), which is worth noting for any text-based query.")
|
| 120 |
+
|
| 121 |
+
md("### 2.5 What do the images look like? (sample grid)")
|
| 122 |
+
code("""import random as _r
|
| 123 |
+
_r.seed(SEED)
|
| 124 |
+
cats_sorted = [c for c in topcat.index if c!="Unknown"][:18]
|
| 125 |
+
fig, axes = plt.subplots(3, 6, figsize=(15, 8))
|
| 126 |
+
for ax, cat in zip(axes.ravel(), cats_sorted):
|
| 127 |
+
pick = _r.choice(df.index[df["top_category"]==cat].tolist())
|
| 128 |
+
ax.imshow(ds[int(pick)]["product_image"].convert("RGB")); ax.set_title(cat[:22], fontsize=8); ax.axis("off")
|
| 129 |
+
for ax in axes.ravel()[len(cats_sorted):]: ax.axis("off")
|
| 130 |
+
fig.suptitle("Sample product image per top-level category", fontsize=13)
|
| 131 |
+
plt.tight_layout(); plt.show()""")
|
| 132 |
+
|
| 133 |
+
md("### 2.6 Image properties (dimensions, colour, backgrounds)")
|
| 134 |
+
code("""_r.seed(SEED)
|
| 135 |
+
samp = _r.sample(range(len(ds)), 2000)
|
| 136 |
+
W=[]; H=[]; bright=[]; gray=0; whitebg=0
|
| 137 |
+
for i in samp:
|
| 138 |
+
im = ds[int(i)]["product_image"]; w,h = im.size; W.append(w); H.append(h)
|
| 139 |
+
a = np.asarray(im.convert("RGB").resize((32,32)), dtype="float32"); bright.append(a.mean())
|
| 140 |
+
if np.abs(a[...,0]-a[...,1]).mean()<3 and np.abs(a[...,1]-a[...,2]).mean()<3: gray+=1
|
| 141 |
+
border = np.concatenate([a[0,:,:],a[-1,:,:],a[:,0,:],a[:,-1,:]])
|
| 142 |
+
if border.mean()>235: whitebg+=1
|
| 143 |
+
W=np.array(W); H=np.array(H); ar=W/np.maximum(H,1)
|
| 144 |
+
white_pct=100*whitebg/len(samp); gray_pct=100*gray/len(samp)
|
| 145 |
+
print(f"median size: {int(np.median(W))}x{int(np.median(H))} px | white-background: {white_pct:.0f}% | grayscale-ish: {gray_pct:.0f}% | median brightness: {np.median(bright):.0f}/255")
|
| 146 |
+
fig, ax = plt.subplots(1,3, figsize=(14,4))
|
| 147 |
+
ax[0].hist(np.clip(W,0,1500),bins=40,color="#4C72B0"); ax[0].set_title("Image width (px)")
|
| 148 |
+
ax[1].hist(np.clip(H,0,1500),bins=40,color="#55A868"); ax[1].set_title("Image height (px)")
|
| 149 |
+
ax[2].hist(np.clip(ar,0,3),bins=40,color="#C44E52"); ax[2].set_title("Aspect ratio (w/h)")
|
| 150 |
+
plt.tight_layout(); plt.show()
|
| 151 |
+
fig, ax = plt.subplots(1,2, figsize=(11,4))
|
| 152 |
+
ax[0].hist(bright,bins=40,color="#8172B3"); ax[0].axvline(np.median(bright),ls="--",color="k"); ax[0].set_title("Mean image brightness (0-255)")
|
| 153 |
+
ax[1].bar(["white\\nbackground","grayscale","colour"],[white_pct,gray_pct,100-gray_pct],color=["#CCCCCC","#888888","#4C72B0"]); ax[1].set_ylabel("% of sample"); ax[1].set_title("Image composition")
|
| 154 |
+
plt.tight_layout(); plt.show()""")
|
| 155 |
+
|
| 156 |
+
md("""**EDA takeaways.** The catalogue is large and clean (titles/categories essentially complete, few
|
| 157 |
+
duplicates). The images are mostly **square (~900x900), bright, white-background studio shots**
|
| 158 |
+
(around 70% have a near-white border). I keep coming back to this number in the clustering and
|
| 159 |
+
recommender sections: white backgrounds make products from different categories look alike, so the
|
| 160 |
+
clustering silhouette ends up modest and the recommender does best on clean single-product photos.
|
| 161 |
+
Because categories are imbalanced, I embed a **balanced stratified sample**.""")
|
| 162 |
+
|
| 163 |
+
md("""# Part 3: Embeddings
|
| 164 |
+
|
| 165 |
+
I use **CLIP** (`openai/clip-vit-base-patch32`), a small/medium model that embeds **images and
|
| 166 |
+
text into one shared 512-d space**. This shared space is what later lets the app accept *both* a
|
| 167 |
+
text query and an image query against the same catalogue vectors.""")
|
| 168 |
+
code("""from transformers import CLIPModel, CLIPProcessor
|
| 169 |
+
MODEL = "openai/clip-vit-base-patch32"
|
| 170 |
+
model = CLIPModel.from_pretrained(MODEL).to(DEVICE).eval()
|
| 171 |
+
proc = CLIPProcessor.from_pretrained(MODEL)
|
| 172 |
+
|
| 173 |
+
@torch.no_grad()
|
| 174 |
+
def embed_images(images, bs=64):
|
| 175 |
+
out = []
|
| 176 |
+
for k in range(0, len(images), bs):
|
| 177 |
+
inp = proc(images=images[k:k+bs], return_tensors="pt").to(DEVICE)
|
| 178 |
+
v = model.vision_model(pixel_values=inp["pixel_values"])
|
| 179 |
+
f = model.visual_projection(v.pooler_output) # project into the shared image-text space
|
| 180 |
+
out.append((f / f.norm(dim=-1, keepdim=True)).cpu().numpy())
|
| 181 |
+
return np.vstack(out).astype("float32")
|
| 182 |
+
|
| 183 |
+
@torch.no_grad()
|
| 184 |
+
def encode_text(q):
|
| 185 |
+
inp = proc(text=["a product photo of " + q], return_tensors="pt", padding=True, truncation=True).to(DEVICE)
|
| 186 |
+
t = model.text_model(input_ids=inp["input_ids"], attention_mask=inp["attention_mask"])
|
| 187 |
+
f = model.text_projection(t.pooler_output)
|
| 188 |
+
return (f / f.norm(dim=-1, keepdim=True)).cpu().numpy()[0]
|
| 189 |
+
|
| 190 |
+
# short prompt template ("a product photo of ...") calibrates CLIP text queries to the catalogue
|
| 191 |
+
print("image encoder check:", embed_images([ds[0]["product_image"], ds[1]["product_image"]]).shape)""")
|
| 192 |
+
|
| 193 |
+
md("""### Build the balanced sample
|
| 194 |
+
Categories are imbalanced, so I keep categories with at least 150 products and cap each one, aiming
|
| 195 |
+
for about 13K total. The cap plus the dropped small categories land it at **11,912 products across 18
|
| 196 |
+
categories**, which is the balanced subset I embed.""")
|
| 197 |
+
code("""TOTAL_TARGET, MIN_PER_CAT = 13000, 150
|
| 198 |
+
keep = topcat[topcat >= MIN_PER_CAT].index.tolist()
|
| 199 |
+
PER_CAT = max(MIN_PER_CAT, TOTAL_TARGET // len(keep))
|
| 200 |
+
sample = pd.concat([g.sample(min(len(g), PER_CAT), random_state=SEED)
|
| 201 |
+
for _, g in df[df["top_category"].isin(keep)].groupby("top_category")])
|
| 202 |
+
sample = sample.sample(frac=1, random_state=SEED).reset_index(drop=True)
|
| 203 |
+
print(f"kept {len(keep)} categories, cap {PER_CAT}/category -> {len(sample)} products")""")
|
| 204 |
+
|
| 205 |
+
md("""### Embed the sample with CLIP
|
| 206 |
+
The loop below decodes images in batches, makes a thumbnail, and embeds each batch (streaming, so
|
| 207 |
+
memory stays flat). It is the real build step. It takes a few minutes, so it is gated behind a flag
|
| 208 |
+
and the analysis reloads the saved `catalog.parquet` by default; set the flag to `True` to rebuild.""")
|
| 209 |
+
code("""import base64, io
|
| 210 |
+
RECOMPUTE_EMBEDDINGS = False # True rebuilds from images (~5 min on MPS/GPU); default reloads the saved file
|
| 211 |
+
|
| 212 |
+
def make_thumb(im, size=110):
|
| 213 |
+
im = im.convert("RGB"); im.thumbnail((size, size))
|
| 214 |
+
b = io.BytesIO(); im.save(b, "JPEG", quality=80); return base64.b64encode(b.getvalue()).decode()
|
| 215 |
+
|
| 216 |
+
if RECOMPUTE_EMBEDDINGS:
|
| 217 |
+
sub = ds.select(sample["row_id"].tolist())
|
| 218 |
+
thumbs, chunks, buf = [], [], []
|
| 219 |
+
for ex in sub:
|
| 220 |
+
im = ex["product_image"].convert("RGB"); thumbs.append(make_thumb(im)); buf.append(im)
|
| 221 |
+
if len(buf) == 64: chunks.append(embed_images(buf)); buf = []
|
| 222 |
+
if buf: chunks.append(embed_images(buf))
|
| 223 |
+
catalog = sample.copy(); catalog["thumb"] = thumbs
|
| 224 |
+
catalog["embedding"] = [e.tolist() for e in np.vstack(chunks)]
|
| 225 |
+
catalog.to_parquet("../space/catalog.parquet", index=False)
|
| 226 |
+
else:
|
| 227 |
+
catalog = pd.read_parquet("../space/catalog.parquet") # embeddings computed once and saved
|
| 228 |
+
|
| 229 |
+
EMB = np.array(catalog["embedding"].tolist(), dtype="float32") # L2-normalized 512-d vectors
|
| 230 |
+
print("catalog:", catalog.shape, "| embeddings:", EMB.shape, "| sample norm check:", round(float(np.linalg.norm(EMB[0])),3))
|
| 231 |
+
catalog[["title","top_category","brand"]].head(3)""")
|
| 232 |
+
|
| 233 |
+
md("""### 3.1–3.2 Clustering (K-Means) with K chosen by silhouette
|
| 234 |
+
|
| 235 |
+
CLIP vectors are high-dimensional (512-d), where K-Means struggles. I first reduce to **50 PCA
|
| 236 |
+
components** (denoising, keeps ~57% of variance), then run K-Means and pick K by **silhouette score**.""")
|
| 237 |
+
code("""from sklearn.cluster import KMeans
|
| 238 |
+
from sklearn.metrics import silhouette_score
|
| 239 |
+
from sklearn.decomposition import PCA
|
| 240 |
+
P = PCA(n_components=50, random_state=SEED).fit_transform(EMB) # denoise before clustering
|
| 241 |
+
idx = np.random.choice(len(P), min(8000,len(P)), replace=False)
|
| 242 |
+
sil = {}
|
| 243 |
+
for K in range(4,11):
|
| 244 |
+
km = KMeans(n_clusters=K, n_init=5, random_state=SEED).fit(P[idx])
|
| 245 |
+
sil[K] = silhouette_score(P[idx], km.labels_)
|
| 246 |
+
bestK = max(sil, key=sil.get)
|
| 247 |
+
print("silhouette by K:", {k: round(v,3) for k,v in sil.items()})
|
| 248 |
+
print("best K:", bestK)
|
| 249 |
+
plt.figure(figsize=(7,4))
|
| 250 |
+
plt.plot(list(sil), list(sil.values()), "o-", color="#4C72B0"); plt.axvline(bestK, ls="--", color="grey")
|
| 251 |
+
plt.xlabel("K"); plt.ylabel("silhouette"); plt.title("K-Means model selection (on PCA-50)"); plt.show()""")
|
| 252 |
+
code("""km = KMeans(n_clusters=bestK, n_init=10, random_state=SEED).fit(P)
|
| 253 |
+
catalog["cluster"] = km.labels_
|
| 254 |
+
catalog["cluster"].value_counts().sort_index()""")
|
| 255 |
+
|
| 256 |
+
md("### 3.1 Project embeddings to 2D (UMAP and PCA)")
|
| 257 |
+
code("""import umap
|
| 258 |
+
pca = PCA(n_components=2, random_state=SEED).fit_transform(EMB)
|
| 259 |
+
um = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=SEED).fit_transform(EMB)
|
| 260 |
+
|
| 261 |
+
def scatter(xy, labels, title, legend):
|
| 262 |
+
plt.figure(figsize=(9,7))
|
| 263 |
+
s = pd.Series(labels).astype(str)
|
| 264 |
+
for c in sorted(s.unique()):
|
| 265 |
+
m = (s==c).values
|
| 266 |
+
plt.scatter(xy[m,0], xy[m,1], s=5, alpha=0.5, label=c)
|
| 267 |
+
plt.title(title); plt.xticks([]); plt.yticks([])
|
| 268 |
+
plt.legend(title=legend, markerscale=3, fontsize=7, ncol=2); plt.tight_layout(); plt.show()
|
| 269 |
+
|
| 270 |
+
scatter(um, catalog["top_category"].values, "UMAP of CLIP embeddings, colored by product category", "category")
|
| 271 |
+
scatter(um, catalog["cluster"].values, f"UMAP of CLIP embeddings, colored by K-Means cluster (K={bestK})", "cluster")
|
| 272 |
+
scatter(pca, catalog["top_category"].values, "PCA (2D) of CLIP embeddings, colored by category", "category")""")
|
| 273 |
+
|
| 274 |
+
md("### 3.3 Are the clusters coherent? (cluster reasoning)")
|
| 275 |
+
code("""ct = pd.crosstab(catalog["cluster"], catalog["top_category"])
|
| 276 |
+
ct_norm = ct.div(ct.sum(1), axis=0)
|
| 277 |
+
fig, ax = plt.subplots(figsize=(12,6))
|
| 278 |
+
im = ax.imshow(ct_norm.values, aspect="auto", cmap="viridis")
|
| 279 |
+
ax.set_xticks(range(len(ct_norm.columns))); ax.set_xticklabels(ct_norm.columns, rotation=90, fontsize=7)
|
| 280 |
+
ax.set_yticks(range(len(ct_norm.index))); ax.set_yticklabels([f"cluster {i}" for i in ct_norm.index], fontsize=8)
|
| 281 |
+
ax.set_title("Cluster composition by category (row-normalized)"); fig.colorbar(im, fraction=0.025)
|
| 282 |
+
plt.tight_layout(); plt.show()""")
|
| 283 |
+
code("""# dominant category + example products per cluster
|
| 284 |
+
for cl in sorted(catalog["cluster"].unique()):
|
| 285 |
+
g = catalog[catalog["cluster"]==cl]
|
| 286 |
+
dom = g["top_category"].value_counts().head(2).to_dict()
|
| 287 |
+
print(f"cluster {cl} (n={len(g)}): {dom}")
|
| 288 |
+
for t in g["title"].head(3): print(" -", t[:60])""")
|
| 289 |
+
md("""**Reasoning.** I want to be honest about how strong this is. The silhouette is **low and almost
|
| 290 |
+
flat** across K (about 0.08 at K=4, barely above K=7), so there is no sharp natural number of
|
| 291 |
+
clusters: K-Means only finds **coarse** structure. That is the expected consequence of the EDA
|
| 292 |
+
finding that ~70% of products are white-background studio shots, so items from different categories
|
| 293 |
+
genuinely look alike.
|
| 294 |
+
|
| 295 |
+
The evidence that the structure is nonetheless real is the **UMAP-by-category plot**, where products
|
| 296 |
+
of the same category land together, and the per-centroid image grid below. Read at the coarse level
|
| 297 |
+
K-Means gives, the four clusters map to broad visual families: consumables and packaged goods;
|
| 298 |
+
furniture and apparel; electronics, cameras and hardware; and small colourful toys/office/media
|
| 299 |
+
items (the exact dominant categories are printed above and labelled on the grid). The heatmap is
|
| 300 |
+
diffuse rather than block-diagonal, which is consistent with the low silhouette: categories are
|
| 301 |
+
locally separable but globally overlapping. That is enough for nearest-neighbour recommendation,
|
| 302 |
+
which only needs the *near* neighbours of a query to be similar, not the whole space to split cleanly.""")
|
| 303 |
+
|
| 304 |
+
md("""### 3.4 Save the embeddings file
|
| 305 |
+
|
| 306 |
+
The embeddings, metadata and thumbnails are stored in **`catalog.parquet`**, which the Gradio Space
|
| 307 |
+
loads at runtime. (The file was written by the build script; shown here for completeness.)""")
|
| 308 |
+
code("""print("embedding file: catalog.parquet")
|
| 309 |
+
print("columns:", list(catalog.columns))
|
| 310 |
+
print("rows:", len(catalog), "| embedding dim:", len(catalog['embedding'].iloc[0]))""")
|
| 311 |
+
|
| 312 |
+
md("""## Part 3.5: Going deeper (bonus)
|
| 313 |
+
|
| 314 |
+
The brief asks for at least one clustering algorithm and one projection. To be thorough I add a
|
| 315 |
+
second projection (**t-SNE**), a second clustering algorithm (**DBSCAN**), and I look at the actual
|
| 316 |
+
images inside each cluster instead of trusting the labels alone.""")
|
| 317 |
+
|
| 318 |
+
md("""### t-SNE projection (a second view of the space)
|
| 319 |
+
UMAP and PCA above are global views; t-SNE emphasises local neighbourhoods, so it is a useful
|
| 320 |
+
cross-check that the per-category grouping is real and not a UMAP artefact.""")
|
| 321 |
+
code("""from sklearn.manifold import TSNE
|
| 322 |
+
sub = np.random.RandomState(SEED).choice(len(P), 4000, replace=False) # subsample keeps t-SNE light
|
| 323 |
+
ts = TSNE(n_components=2, perplexity=30, init="pca", random_state=SEED).fit_transform(P[sub])
|
| 324 |
+
plt.figure(figsize=(9,7))
|
| 325 |
+
s = pd.Series(catalog["top_category"].values[sub]).astype(str)
|
| 326 |
+
for c in sorted(s.unique()):
|
| 327 |
+
mk=(s==c).values; plt.scatter(ts[mk,0], ts[mk,1], s=6, alpha=0.5, label=c)
|
| 328 |
+
plt.title("t-SNE of CLIP embeddings (4K sample), colored by category"); plt.xticks([]); plt.yticks([])
|
| 329 |
+
plt.legend(title="category", markerscale=3, fontsize=7, ncol=2); plt.tight_layout(); plt.show()""")
|
| 330 |
+
|
| 331 |
+
md("""### DBSCAN (a density-based second opinion)
|
| 332 |
+
K-Means forces every product into one of K round clusters. DBSCAN instead finds dense regions and
|
| 333 |
+
calls the rest noise, so it tells us whether the space has natural density structure. I pick `eps`
|
| 334 |
+
properly from a **k-distance plot** rather than guessing.""")
|
| 335 |
+
code("""from sklearn.cluster import DBSCAN
|
| 336 |
+
from sklearn.neighbors import NearestNeighbors
|
| 337 |
+
k = 15
|
| 338 |
+
kd = np.sort(NearestNeighbors(n_neighbors=k).fit(um).kneighbors(um)[0][:,-1])
|
| 339 |
+
eps = float(np.percentile(kd, 92))
|
| 340 |
+
fig, ax = plt.subplots(1,2, figsize=(13,5))
|
| 341 |
+
ax[0].plot(kd, color="#4C72B0"); ax[0].axhline(eps, ls="--", color="grey")
|
| 342 |
+
ax[0].set_title(f"k-distance plot (k={k}) chooses eps={eps:.2f}"); ax[0].set_xlabel("points sorted"); ax[0].set_ylabel("distance")
|
| 343 |
+
dl = DBSCAN(eps=eps, min_samples=k).fit(um).labels_
|
| 344 |
+
n_db = len(set(dl)) - (1 if -1 in dl else 0)
|
| 345 |
+
for c in sorted(set(dl)):
|
| 346 |
+
mk = dl==c
|
| 347 |
+
ax[1].scatter(um[mk,0], um[mk,1], s=5, alpha=0.5, color=("lightgrey" if c==-1 else None), label=("noise" if c==-1 else None))
|
| 348 |
+
ax[1].set_title(f"DBSCAN on the UMAP space: {n_db} clusters, {100*np.mean(dl==-1):.0f}% noise"); ax[1].set_xticks([]); ax[1].set_yticks([])
|
| 349 |
+
plt.tight_layout(); plt.show()
|
| 350 |
+
print(f"DBSCAN found {n_db} dense clusters vs K-Means {bestK}; only {100*np.mean(dl==-1):.0f}% of points are noise")""")
|
| 351 |
+
md("""DBSCAN breaks the catalogue into many more, finer clusters (and almost no noise), which says the
|
| 352 |
+
space is densely packed with small visual neighbourhoods. K-Means K=4 is the coarse, interpretable
|
| 353 |
+
summary of that same structure; DBSCAN is the fine-grained view. They agree that the space is
|
| 354 |
+
well-populated and clusterable, which is the reassuring result for a recommender.""")
|
| 355 |
+
|
| 356 |
+
md("""### What is actually inside each cluster?
|
| 357 |
+
Numbers and labels are one thing; the honest test is to look at the images. For each K-Means cluster
|
| 358 |
+
I show the products closest to the centroid.""")
|
| 359 |
+
code("""import base64, io
|
| 360 |
+
from PIL import Image
|
| 361 |
+
def thumb(t): return Image.open(io.BytesIO(base64.b64decode(t))).convert("RGB")
|
| 362 |
+
# label each row by the cluster's actual two dominant categories, so the caption can never drift
|
| 363 |
+
def cluster_label(cl):
|
| 364 |
+
top = catalog[catalog["cluster"]==cl]["top_category"].value_counts().head(2).index.tolist()
|
| 365 |
+
return f"cluster {cl}: " + ", ".join(top)
|
| 366 |
+
fig, axes = plt.subplots(bestK, 6, figsize=(12, 2*bestK))
|
| 367 |
+
for cl in range(bestK):
|
| 368 |
+
mem = np.where(catalog["cluster"].values==cl)[0]
|
| 369 |
+
cen = EMB[mem].mean(0); cen /= np.linalg.norm(cen)
|
| 370 |
+
near = mem[np.argsort(-(EMB[mem]@cen))[:6]]
|
| 371 |
+
for j,idx in enumerate(near):
|
| 372 |
+
axes[cl,j].imshow(thumb(catalog.iloc[idx]["thumb"])); axes[cl,j].axis("off")
|
| 373 |
+
axes[cl,0].set_title(cluster_label(cl), loc="left", fontsize=9)
|
| 374 |
+
fig.suptitle("Representative products per cluster (closest to centroid)", fontsize=12)
|
| 375 |
+
plt.tight_layout(); plt.show()""")
|
| 376 |
+
md("""Looking at the actual products is what convinced me the space was usable: each row is visibly one
|
| 377 |
+
visual family (packaged goods, then furniture and apparel, then devices and tools, then small
|
| 378 |
+
colourful items). The grid agrees with the heatmap, so the recommender is standing on real structure.""")
|
| 379 |
+
|
| 380 |
+
md("""# Part 4: Inputs & Outputs (cosine similarity, Top-3)
|
| 381 |
+
|
| 382 |
+
The recommender encodes the user input (text or image) with the same CLIP model, computes cosine
|
| 383 |
+
similarity against every catalogue embedding (a dot product, since vectors are L2-normalized), and
|
| 384 |
+
returns the Top-3 (filtering near-duplicates). Uploading an image is the strongest mode, because
|
| 385 |
+
image-to-image avoids the small text/image gap in CLIP.""")
|
| 386 |
+
code("""def top_matches(qvec, k=3, dup=0.985):
|
| 387 |
+
sims = EMB @ qvec
|
| 388 |
+
order = np.argsort(-sims)
|
| 389 |
+
chosen = []
|
| 390 |
+
for i in order:
|
| 391 |
+
if any(float(EMB[i] @ EMB[j]) > dup for j in chosen): continue
|
| 392 |
+
chosen.append(int(i))
|
| 393 |
+
if len(chosen)==k: break
|
| 394 |
+
return [(i, float(sims[i])) for i in chosen]
|
| 395 |
+
|
| 396 |
+
def show_text_query(q):
|
| 397 |
+
res = top_matches(encode_text(q))
|
| 398 |
+
print("query:", q)
|
| 399 |
+
for r,(i,s) in enumerate(res,1):
|
| 400 |
+
print(f" #{r} sim={s:.3f} [{catalog.iloc[i]['top_category']}] {catalog.iloc[i]['title'][:55]}")
|
| 401 |
+
return res
|
| 402 |
+
|
| 403 |
+
for q in ["camera lens","helmet","sofa","dog leash","sunglasses"]:
|
| 404 |
+
show_text_query(q); print()""")
|
| 405 |
+
|
| 406 |
+
md("**Image query** (an uploaded product photo that is *not* in the catalogue), with the Top-3 shown as images:")
|
| 407 |
+
code("""from PIL import Image
|
| 408 |
+
query_img = Image.open("../space/examples/cameras.jpg").convert("RGB") # a held-out product photo
|
| 409 |
+
qvec = embed_images([query_img])[0]
|
| 410 |
+
res = top_matches(qvec)
|
| 411 |
+
|
| 412 |
+
fig, ax = plt.subplots(1, 4, figsize=(14,4))
|
| 413 |
+
ax[0].imshow(query_img); ax[0].set_title("QUERY (uploaded photo)", fontsize=9); ax[0].axis("off")
|
| 414 |
+
for k,(i,s) in enumerate(res,1):
|
| 415 |
+
im = plt.imread(io.BytesIO(base64.b64decode(catalog.iloc[i]['thumb'])), format="jpeg")
|
| 416 |
+
ax[k].imshow(im); ax[k].set_title(f"#{k} sim={s:.2f}\\n{catalog.iloc[i]['top_category']}", fontsize=9); ax[k].axis("off")
|
| 417 |
+
plt.tight_layout(); plt.show()""")
|
| 418 |
+
|
| 419 |
+
md("""### Evaluation
|
| 420 |
+
|
| 421 |
+
To quantify quality I run **image-to-image** retrieval on 80 held-out products and check how often
|
| 422 |
+
the returned products share the query's top-level category (a conservative proxy, since visually
|
| 423 |
+
similar products often sit in different taxonomy categories, e.g. a metal spear tip retrieving
|
| 424 |
+
nails).""")
|
| 425 |
+
code("""import random as _r
|
| 426 |
+
used = set(catalog["row_id"].tolist())
|
| 427 |
+
gt = [ (c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown") for c in ds["ground_truth_category"] ]
|
| 428 |
+
cand = [i for i in range(len(ds)-1,0,-1) if i not in used]
|
| 429 |
+
_r.seed(7); qrows = _r.sample(cand, 80)
|
| 430 |
+
p1 = p3 = 0
|
| 431 |
+
for qi in qrows:
|
| 432 |
+
v = embed_images([ds[qi]["product_image"].convert("RGB")])[0]
|
| 433 |
+
order = np.argsort(-(EMB @ v))[:3]
|
| 434 |
+
cats = [catalog.iloc[i]["top_category"] for i in order]
|
| 435 |
+
p1 += (cats[0] == gt[qi]); p3 += sum(c == gt[qi] for c in cats)
|
| 436 |
+
print(f"image->image category-match precision@1 = {p1/len(qrows):.2f} precision@3 = {p3/(3*len(qrows)):.2f}")
|
| 437 |
+
print("(frequency-weighted random baseline is roughly 0.10-0.13, so this is ~3x baseline)")""")
|
| 438 |
+
md("""Two things to read here. First, ~0.39 is about 3x the random baseline, so the retrieval is real,
|
| 439 |
+
not luck. Second, precision@3 is a little *lower* than precision@1, which is expected for a visual
|
| 440 |
+
recommender: the single nearest neighbour is the safest match, and ranks 2 and 3 start drifting into
|
| 441 |
+
products that look alike but sit in a different taxonomy category (the spear-tip and nails case).
|
| 442 |
+
That drift is a property of matching appearance rather than meaning, not a failure of the model.""")
|
| 443 |
+
|
| 444 |
+
md("""### A visual check: query to Top-3 (held-out photos)
|
| 445 |
+
The category metric is conservative. The clearest test is to look at real recommendations for photos
|
| 446 |
+
the catalogue never saw.""")
|
| 447 |
+
code("""ex = [f"../space/examples/{f}" for f in ["cameras.jpg","furniture.jpg","animals.jpg","sporting.jpg","electronics.jpg"]]
|
| 448 |
+
ex = [e for e in ex if os.path.exists(e)]
|
| 449 |
+
fig, axes = plt.subplots(len(ex), 4, figsize=(11, 2.6*len(ex)))
|
| 450 |
+
for r,fn in enumerate(ex):
|
| 451 |
+
q = Image.open(fn).convert("RGB"); res = top_matches(embed_images([q])[0])
|
| 452 |
+
axes[r,0].imshow(q); axes[r,0].set_title("query", fontsize=9); axes[r,0].axis("off")
|
| 453 |
+
for k2,(i,s) in enumerate(res,1):
|
| 454 |
+
axes[r,k2].imshow(thumb(catalog.iloc[i]["thumb"]))
|
| 455 |
+
axes[r,k2].set_title(f"#{k2} {s:.2f}\\n{catalog.iloc[i]['top_category'][:16]}", fontsize=8); axes[r,k2].axis("off")
|
| 456 |
+
fig.suptitle("Query (uploaded photo) -> Top-3 recommendations", fontsize=12)
|
| 457 |
+
plt.tight_layout(); plt.show()""")
|
| 458 |
+
|
| 459 |
+
md("""### Bonus: a faster similarity backend with FAISS
|
| 460 |
+
A linear `EMB @ q` scan is fine for 12K items, but a real catalogue is millions. **FAISS** is the
|
| 461 |
+
standard library for fast vector search, so I index the embeddings with it and confirm it returns
|
| 462 |
+
the **same** Top-3 as the brute-force scan, much faster. This is the piece that would let the same
|
| 463 |
+
app scale.""")
|
| 464 |
+
code('''# FAISS and PyTorch share an OpenMP runtime, so I run the benchmark in a clean subprocess
|
| 465 |
+
# (numpy + faiss only) to keep this notebook kernel stable.
|
| 466 |
+
import subprocess, sys, textwrap
|
| 467 |
+
_bench = textwrap.dedent("""
|
| 468 |
+
import numpy as np, pandas as pd, faiss, time
|
| 469 |
+
E = np.ascontiguousarray(np.array(pd.read_parquet('../space/catalog.parquet')['embedding'].tolist(), dtype='float32'))
|
| 470 |
+
idx = faiss.IndexFlatIP(E.shape[1]); idx.add(E) # inner product == cosine on normalized vectors
|
| 471 |
+
rng = np.random.RandomState(0); Q = np.ascontiguousarray(E[rng.choice(len(E), 500, replace=False)])
|
| 472 |
+
t0=time.time(); _, I = idx.search(Q, 4); ft=time.time()-t0
|
| 473 |
+
t0=time.time()
|
| 474 |
+
for q in Q: np.argsort(-(E@q))[:4]
|
| 475 |
+
bt=time.time()-t0
|
| 476 |
+
ag = np.mean([len(set(I[i,1:4].tolist()) & set(np.argsort(-(E@Q[i]))[1:4].tolist()))/3 for i in range(len(Q))])
|
| 477 |
+
print(f"FAISS {ft*1000:.0f} ms vs brute force {bt*1000:.0f} ms on 500 queries ({bt/ft:.0f}x faster)")
|
| 478 |
+
print(f"Top-3 agreement with brute force: {ag:.0%} (FAISS is exact here, just faster)")
|
| 479 |
+
""")
|
| 480 |
+
print(subprocess.run([sys.executable, "-c", _bench], capture_output=True, text=True).stdout)''')
|
| 481 |
+
|
| 482 |
+
md("""### Business and ethical considerations (bonus)
|
| 483 |
+
|
| 484 |
+
**Business value.** Visual similarity search is the engine behind "shop the look" and "more like
|
| 485 |
+
this" features in e-commerce. It needs no manual tagging (it runs on the product image alone), works
|
| 486 |
+
across languages (useful here, since titles are multilingual), and helps with cold-start items that
|
| 487 |
+
have no clicks yet. The same `catalog.parquet` + FAISS setup would power related-item carousels or a
|
| 488 |
+
visual search bar.
|
| 489 |
+
|
| 490 |
+
**Limits and ethics.**
|
| 491 |
+
- **Visual, not semantic.** The model matches appearance, so it can pair items that look alike but
|
| 492 |
+
serve different purposes. For shopping that is usually fine, but it should not be used where the
|
| 493 |
+
*function* matters (for example medical or safety products).
|
| 494 |
+
- **Representation bias.** CLIP was trained on web images and reflects their biases; a product photo
|
| 495 |
+
in an unusual style or from an under-represented region may embed poorly and be under-recommended.
|
| 496 |
+
- **Catalogue gaps.** Recommendations can only ever point inside the catalogue, so sparse categories
|
| 497 |
+
(few necklaces, few mugs here) give weak results regardless of the model.
|
| 498 |
+
|
| 499 |
+
**What I would improve next.** A larger or domain-tuned CLIP for the weak text queries, a
|
| 500 |
+
single-domain catalogue (the white-background overlap hurts cross-category separation), and proper
|
| 501 |
+
human relevance judgements instead of the category-match proxy.""")
|
| 502 |
+
|
| 503 |
+
md("""# Part 5 & 7: Space + Submission
|
| 504 |
+
|
| 505 |
+
The same `catalog.parquet` and CLIP model power a **Gradio app deployed on Hugging Face Spaces**.
|
| 506 |
+
The app takes an uploaded product photo or a text description and returns the Top-3 most similar
|
| 507 |
+
products.
|
| 508 |
+
|
| 509 |
+
**Live Space:** https://huggingface.co/spaces/Noam12345/visual-product-recommender
|
| 510 |
+
|
| 511 |
+
## Conclusion
|
| 512 |
+
CLIP gives a single shared space for images and text. Clustering on PCA-reduced embeddings recovered
|
| 513 |
+
four interpretable visual product families, confirming the space is meaningful. Cosine similarity
|
| 514 |
+
over those embeddings produces relevant Top-3 recommendations, strongest for image-to-image queries
|
| 515 |
+
(precision@1 about 3x the random baseline), and the whole pipeline is served live in the Gradio
|
| 516 |
+
Space.""")
|
| 517 |
+
|
| 518 |
+
nb["cells"] = cells
|
| 519 |
+
nb["metadata"]["kernelspec"] = {"name":"python3","display_name":"Python 3","language":"python"}
|
| 520 |
+
nbf.write(nb, "Assignment_3_NoamFuchs.ipynb")
|
| 521 |
+
print("notebook written")
|
scripts/03_finalize.py
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Finalize clustering (PCA-50 + K-Means) and evaluate the recommender, reusing the saved
|
| 2 |
+
embeddings in catalog.parquet (no re-embedding). Regenerates the README/notebook artifacts and
|
| 3 |
+
updates the catalog's cluster column. Run from work/ with the venv active."""
|
| 4 |
+
import os, json, warnings, random
|
| 5 |
+
import numpy as np, pandas as pd
|
| 6 |
+
warnings.filterwarnings("ignore")
|
| 7 |
+
import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
|
| 8 |
+
plt.rcParams["figure.dpi"]=120
|
| 9 |
+
SEED=42; random.seed(SEED); np.random.seed(SEED)
|
| 10 |
+
ART="artifacts"; SPACE="../space"
|
| 11 |
+
|
| 12 |
+
cat = pd.read_parquet(f"{SPACE}/catalog.parquet")
|
| 13 |
+
E = np.array(cat["embedding"].tolist(), dtype="float32")
|
| 14 |
+
print("catalog:", cat.shape, "emb:", E.shape)
|
| 15 |
+
|
| 16 |
+
from sklearn.decomposition import PCA
|
| 17 |
+
from sklearn.cluster import KMeans
|
| 18 |
+
from sklearn.metrics import silhouette_score
|
| 19 |
+
|
| 20 |
+
# PCA-50 denoising before clustering (high-dim CLIP vectors cluster better after PCA)
|
| 21 |
+
pca50 = PCA(n_components=50, random_state=SEED).fit(E)
|
| 22 |
+
P = pca50.transform(E)
|
| 23 |
+
print("PCA-50 explained variance:", round(float(pca50.explained_variance_ratio_.sum()),3))
|
| 24 |
+
|
| 25 |
+
idx = np.random.choice(len(P), min(8000,len(P)), replace=False)
|
| 26 |
+
sil = {}
|
| 27 |
+
for K in range(4,11):
|
| 28 |
+
km = KMeans(K, n_init=5, random_state=SEED).fit(P[idx])
|
| 29 |
+
sil[K] = float(silhouette_score(P[idx], km.labels_))
|
| 30 |
+
bestK = max(sil, key=sil.get)
|
| 31 |
+
print("silhouette:", {k:round(v,3) for k,v in sil.items()}, "-> bestK", bestK)
|
| 32 |
+
json.dump(sil, open(f"{ART}/silhouette.json","w"), indent=2)
|
| 33 |
+
|
| 34 |
+
plt.figure(figsize=(7,4))
|
| 35 |
+
plt.plot(list(sil),list(sil.values()),"o-",color="#4C72B0"); plt.axvline(bestK,ls="--",color="grey")
|
| 36 |
+
plt.xlabel("K"); plt.ylabel("silhouette score"); plt.title("K-Means model selection (on PCA-50)")
|
| 37 |
+
plt.tight_layout(); plt.savefig(f"{ART}/silhouette.png"); plt.close()
|
| 38 |
+
|
| 39 |
+
km = KMeans(bestK, n_init=10, random_state=SEED).fit(P)
|
| 40 |
+
cat["cluster"] = km.labels_
|
| 41 |
+
|
| 42 |
+
# 2D projections (UMAP on full embeddings, PCA-2 for a linear view)
|
| 43 |
+
import umap
|
| 44 |
+
um = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=SEED).fit_transform(E)
|
| 45 |
+
pca2 = PCA(n_components=2, random_state=SEED).fit_transform(E)
|
| 46 |
+
|
| 47 |
+
def scatter(xy, labels, title, fname, legend):
|
| 48 |
+
plt.figure(figsize=(9,7))
|
| 49 |
+
s = pd.Series(labels).astype(str)
|
| 50 |
+
for c in sorted(s.unique(), key=lambda x:(len(x),x)):
|
| 51 |
+
m=(s==c).values; plt.scatter(xy[m,0],xy[m,1],s=5,alpha=0.5,label=c)
|
| 52 |
+
plt.title(title); plt.xticks([]); plt.yticks([])
|
| 53 |
+
plt.legend(title=legend, markerscale=3, fontsize=7, ncol=2, loc="best")
|
| 54 |
+
plt.tight_layout(); plt.savefig(f"{ART}/{fname}"); plt.close()
|
| 55 |
+
|
| 56 |
+
scatter(um, cat["top_category"].values, "UMAP of CLIP embeddings — colored by product category", "umap_category.png", "category")
|
| 57 |
+
scatter(um, cat["cluster"].values, f"UMAP of CLIP embeddings — colored by K-Means cluster (K={bestK})", "umap_cluster.png", "cluster")
|
| 58 |
+
scatter(pca2, cat["top_category"].values, "PCA (2D) of CLIP embeddings — colored by category", "pca_category.png", "category")
|
| 59 |
+
|
| 60 |
+
ct = pd.crosstab(cat["cluster"], cat["top_category"]); ctn = ct.div(ct.sum(1),axis=0)
|
| 61 |
+
plt.figure(figsize=(12,6)); plt.imshow(ctn.values, aspect="auto", cmap="viridis")
|
| 62 |
+
plt.xticks(range(len(ctn.columns)), ctn.columns, rotation=90, fontsize=7)
|
| 63 |
+
plt.yticks(range(len(ctn.index)), [f"cluster {i}" for i in ctn.index], fontsize=8)
|
| 64 |
+
plt.title("Cluster composition by category (row-normalized)"); plt.colorbar(fraction=0.025)
|
| 65 |
+
plt.tight_layout(); plt.savefig(f"{ART}/cluster_category_heatmap.png"); plt.close()
|
| 66 |
+
|
| 67 |
+
profile={}
|
| 68 |
+
print("\n=== cluster profiles ===")
|
| 69 |
+
for cl in sorted(cat["cluster"].unique()):
|
| 70 |
+
g=cat[cat["cluster"]==cl]; dom=g["top_category"].value_counts().head(3)
|
| 71 |
+
profile[int(cl)]={"size":int(len(g)),"dominant":dom.to_dict(),"examples":g["title"].head(4).tolist()}
|
| 72 |
+
print(f"cluster {cl} (n={len(g)}): {dom.to_dict()}")
|
| 73 |
+
json.dump(profile, open(f"{ART}/cluster_profile.json","w"), indent=2, default=str)
|
| 74 |
+
|
| 75 |
+
cat.to_parquet(f"{SPACE}/catalog.parquet", index=False)
|
| 76 |
+
cat.to_parquet(f"{ART}/catalog.parquet", index=False)
|
| 77 |
+
print("updated catalog.parquet with PCA-50 clusters")
|
| 78 |
+
|
| 79 |
+
# ---------------- recommender evaluation (image->image, held-out queries) ----------------
|
| 80 |
+
import torch
|
| 81 |
+
from datasets import load_dataset
|
| 82 |
+
from transformers import CLIPModel, CLIPProcessor
|
| 83 |
+
m=CLIPModel.from_pretrained("openai/clip-vit-base-patch32").eval()
|
| 84 |
+
p=CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
| 85 |
+
ds=load_dataset("Shopify/product-catalogue", split="train")
|
| 86 |
+
topcat=lambda c:(c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown")
|
| 87 |
+
gt=[topcat(c) for c in ds["ground_truth_category"]]
|
| 88 |
+
|
| 89 |
+
@torch.no_grad()
|
| 90 |
+
def ei(img):
|
| 91 |
+
i=p(images=img.convert("RGB"),return_tensors="pt"); v=m.vision_model(pixel_values=i["pixel_values"])
|
| 92 |
+
f=m.visual_projection(v.pooler_output); return (f/f.norm(dim=-1,keepdim=True)).numpy()[0]
|
| 93 |
+
|
| 94 |
+
used=set(cat["row_id"].tolist())
|
| 95 |
+
cand=[i for i in range(len(ds)-1,0,-1) if i not in used]
|
| 96 |
+
random.seed(7); queries=random.sample(cand, 80)
|
| 97 |
+
p1=p3=tot3=0
|
| 98 |
+
for qi in queries:
|
| 99 |
+
v=ei(ds[qi]["product_image"]); qc=gt[qi]
|
| 100 |
+
order=np.argsort(-(E@v))[:3]
|
| 101 |
+
cats=[cat.iloc[i]["top_category"] for i in order]
|
| 102 |
+
p1 += (cats[0]==qc); p3 += sum(c==qc for c in cats); tot3 += 3
|
| 103 |
+
prec1=p1/len(queries); prec3=p3/tot3
|
| 104 |
+
print(f"\nimage->image category-match: precision@1={prec1:.2f} precision@3={prec3:.2f} (n={len(queries)})")
|
| 105 |
+
json.dump({"precision_at_1":prec1,"precision_at_3":prec3,"n_queries":len(queries)},
|
| 106 |
+
open(f"{ART}/eval.json","w"), indent=2)
|
| 107 |
+
print("FINALIZE COMPLETE")
|
scripts/04_eda.py
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Expanded EDA for the image product catalogue. Generates richer plots into artifacts/ and
|
| 2 |
+
space/assets/ (sample image grid, image dimensions, colour/background, brands), plus a JSON summary.
|
| 3 |
+
Run from work/ with the venv active."""
|
| 4 |
+
import os, io, json, warnings, random
|
| 5 |
+
import numpy as np, pandas as pd
|
| 6 |
+
warnings.filterwarnings("ignore")
|
| 7 |
+
import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
|
| 8 |
+
plt.rcParams["figure.dpi"]=120
|
| 9 |
+
SEED=42; random.seed(SEED); np.random.seed(SEED)
|
| 10 |
+
ART="artifacts"; ASSETS="../space/assets"
|
| 11 |
+
os.makedirs(ART, exist_ok=True); os.makedirs(ASSETS, exist_ok=True)
|
| 12 |
+
|
| 13 |
+
from datasets import load_dataset
|
| 14 |
+
from PIL import Image
|
| 15 |
+
print("loading dataset (cached)...")
|
| 16 |
+
ds = load_dataset("Shopify/product-catalogue", split="train")
|
| 17 |
+
N = len(ds)
|
| 18 |
+
topcat = lambda c:(c.split(">")[0].strip() if isinstance(c,str) and c.strip() else "Unknown")
|
| 19 |
+
cats = [topcat(c) for c in ds["ground_truth_category"]]
|
| 20 |
+
brands = ds["ground_truth_brand"]
|
| 21 |
+
titles = ds["product_title"]
|
| 22 |
+
sh = ds["ground_truth_is_secondhand"]
|
| 23 |
+
df = pd.DataFrame({"top_category":cats,"brand":brands,"title":titles,"secondhand":sh})
|
| 24 |
+
|
| 25 |
+
summary={"n_rows":int(N)}
|
| 26 |
+
|
| 27 |
+
# ---- 1) sample image grid: one product per top-category ----
|
| 28 |
+
order = df["top_category"].value_counts()
|
| 29 |
+
cats_sorted = [c for c in order.index if c!="Unknown"][:18]
|
| 30 |
+
rng = random.Random(SEED)
|
| 31 |
+
fig, axes = plt.subplots(3, 6, figsize=(15, 8))
|
| 32 |
+
for ax, cat in zip(axes.ravel(), cats_sorted):
|
| 33 |
+
idxs = df.index[df["top_category"]==cat].tolist()
|
| 34 |
+
pick = rng.choice(idxs)
|
| 35 |
+
try:
|
| 36 |
+
im = ds[int(pick)]["product_image"].convert("RGB")
|
| 37 |
+
ax.imshow(im)
|
| 38 |
+
except Exception:
|
| 39 |
+
pass
|
| 40 |
+
ax.set_title(cat[:22], fontsize=8); ax.axis("off")
|
| 41 |
+
for ax in axes.ravel()[len(cats_sorted):]: ax.axis("off")
|
| 42 |
+
fig.suptitle("Sample product image per top-level category", fontsize=13)
|
| 43 |
+
plt.tight_layout(); plt.savefig(f"{ART}/eda_sample_grid.png"); plt.savefig(f"{ASSETS}/eda_sample_grid.png"); plt.close()
|
| 44 |
+
print("sample grid done")
|
| 45 |
+
|
| 46 |
+
# ---- 2) image dimensions + 3) colour/background, from a 3000-image sample ----
|
| 47 |
+
samp = rng.sample(range(N), 3000)
|
| 48 |
+
W=[]; H=[]; gray=0; whitebg=0; bright=[]
|
| 49 |
+
for i in samp:
|
| 50 |
+
im = ds[int(i)]["product_image"]
|
| 51 |
+
w,h = im.size; W.append(w); H.append(h)
|
| 52 |
+
rgb = im.convert("RGB"); a = np.asarray(rgb.resize((32,32)), dtype="float32")
|
| 53 |
+
bright.append(float(a.mean()))
|
| 54 |
+
# grayscale if R==G==B almost everywhere
|
| 55 |
+
if np.abs(a[...,0]-a[...,1]).mean()<3 and np.abs(a[...,1]-a[...,2]).mean()<3: gray+=1
|
| 56 |
+
# white background if the 1px border is near-white
|
| 57 |
+
border = np.concatenate([a[0,:,:],a[-1,:,:],a[:,0,:],a[:,-1,:]])
|
| 58 |
+
if border.mean()>235: whitebg+=1
|
| 59 |
+
W=np.array(W); H=np.array(H); ar=W/np.maximum(H,1)
|
| 60 |
+
summary.update({"sample_for_image_stats":len(samp),
|
| 61 |
+
"width_median":float(np.median(W)),"height_median":float(np.median(H)),
|
| 62 |
+
"grayscale_pct":round(100*gray/len(samp),1),
|
| 63 |
+
"white_background_pct":round(100*whitebg/len(samp),1),
|
| 64 |
+
"brightness_median":round(float(np.median(bright)),1)})
|
| 65 |
+
|
| 66 |
+
fig, ax = plt.subplots(1,3, figsize=(14,4))
|
| 67 |
+
ax[0].hist(np.clip(W,0,1500),bins=40,color="#4C72B0"); ax[0].set_title("Image width (px)")
|
| 68 |
+
ax[1].hist(np.clip(H,0,1500),bins=40,color="#55A868"); ax[1].set_title("Image height (px)")
|
| 69 |
+
ax[2].hist(np.clip(ar,0,3),bins=40,color="#C44E52"); ax[2].set_title("Aspect ratio (w/h)")
|
| 70 |
+
plt.tight_layout(); plt.savefig(f"{ART}/eda_image_dims.png"); plt.savefig(f"{ASSETS}/eda_image_dims.png"); plt.close()
|
| 71 |
+
|
| 72 |
+
fig, ax = plt.subplots(1,2, figsize=(11,4))
|
| 73 |
+
ax[0].hist(bright,bins=40,color="#8172B3"); ax[0].axvline(np.median(bright),ls="--",color="k")
|
| 74 |
+
ax[0].set_title(f"Mean image brightness (0-255)\nmedian={np.median(bright):.0f}")
|
| 75 |
+
ax[1].bar(["white\nbackground","grayscale","colour\nphoto"],
|
| 76 |
+
[summary["white_background_pct"], summary["grayscale_pct"], 100-summary["grayscale_pct"]],
|
| 77 |
+
color=["#CCCCCC","#888888","#4C72B0"])
|
| 78 |
+
ax[1].set_ylabel("% of sampled images"); ax[1].set_title("Image composition")
|
| 79 |
+
plt.tight_layout(); plt.savefig(f"{ART}/eda_image_color.png"); plt.savefig(f"{ASSETS}/eda_image_color.png"); plt.close()
|
| 80 |
+
print("image stats done:", {k:summary[k] for k in ["grayscale_pct","white_background_pct","brightness_median"]})
|
| 81 |
+
|
| 82 |
+
# ---- 4) top brands ----
|
| 83 |
+
top_brands = df["brand"].replace("", np.nan).dropna().value_counts().head(15)
|
| 84 |
+
fig, ax = plt.subplots(figsize=(9,5))
|
| 85 |
+
top_brands.sort_values().plot.barh(ax=ax, color="#937860")
|
| 86 |
+
ax.set_title("Top 15 brands by product count"); ax.set_xlabel("products")
|
| 87 |
+
plt.tight_layout(); plt.savefig(f"{ART}/eda_brands.png"); plt.savefig(f"{ASSETS}/eda_brands.png"); plt.close()
|
| 88 |
+
|
| 89 |
+
# ---- 5) metadata summary ----
|
| 90 |
+
summary.update({
|
| 91 |
+
"n_top_categories": int(df["top_category"].nunique()),
|
| 92 |
+
"n_brands": int(df["brand"].replace("",np.nan).dropna().nunique()),
|
| 93 |
+
"secondhand_pct": round(100*float(pd.Series(sh).astype(bool).mean()),1),
|
| 94 |
+
"missing_title_pct": round(100*float((pd.Series(titles).isna()|(pd.Series(titles).astype(str).str.len()==0)).mean()),1),
|
| 95 |
+
"missing_brand_pct": round(100*float((pd.Series(brands).isna()|(pd.Series(brands).astype(str).str.len()==0)).mean()),1),
|
| 96 |
+
"taxonomy_depth_median": int(np.median([c.count(">")+1 for c in ds["ground_truth_category"] if isinstance(c,str) and c.strip()])),
|
| 97 |
+
})
|
| 98 |
+
json.dump(summary, open(f"{ART}/eda_expanded.json","w"), indent=2)
|
| 99 |
+
print("SUMMARY:", json.dumps(summary, indent=2))
|
| 100 |
+
print("EDA EXPANDED COMPLETE")
|
scripts/05_bonus.py
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Bonus / depth analyses, reusing the saved embeddings in catalog.parquet (no re-embedding):
|
| 2 |
+
- t-SNE projection (third dimensionality-reduction method)
|
| 3 |
+
- DBSCAN as a second clustering algorithm, with k-distance eps tuning
|
| 4 |
+
- per-cluster representative image grids (what each cluster actually contains)
|
| 5 |
+
- query -> Top-3 visual montage on held-out products
|
| 6 |
+
- FAISS index benchmark vs brute force (a standard similarity-search DS tool)
|
| 7 |
+
Outputs go to artifacts/ and space/assets/. Run from work/ with the venv active."""
|
| 8 |
+
import os, io, json, base64, time, warnings, random
|
| 9 |
+
import numpy as np, pandas as pd
|
| 10 |
+
warnings.filterwarnings("ignore")
|
| 11 |
+
import matplotlib; matplotlib.use("Agg"); import matplotlib.pyplot as plt
|
| 12 |
+
from PIL import Image
|
| 13 |
+
plt.rcParams["figure.dpi"]=120
|
| 14 |
+
SEED=42; random.seed(SEED); np.random.seed(SEED)
|
| 15 |
+
ART="artifacts"; ASSETS="../space/assets"
|
| 16 |
+
|
| 17 |
+
cat = pd.read_parquet("../space/catalog.parquet")
|
| 18 |
+
EMB = np.array(cat["embedding"].tolist(), dtype="float32")
|
| 19 |
+
KM = cat["cluster"].values
|
| 20 |
+
print("catalog:", cat.shape)
|
| 21 |
+
b64 = lambda t: Image.open(io.BytesIO(base64.b64decode(t))).convert("RGB")
|
| 22 |
+
|
| 23 |
+
# ---------------- t-SNE (third projection method) ----------------
|
| 24 |
+
from sklearn.decomposition import PCA
|
| 25 |
+
from sklearn.manifold import TSNE
|
| 26 |
+
print("t-SNE (on PCA-50)...")
|
| 27 |
+
P50 = PCA(n_components=50, random_state=SEED).fit_transform(EMB)
|
| 28 |
+
ts = TSNE(n_components=2, perplexity=30, init="pca", random_state=SEED).fit_transform(P50)
|
| 29 |
+
plt.figure(figsize=(9,7))
|
| 30 |
+
s = pd.Series(cat["top_category"]).astype(str)
|
| 31 |
+
for c in sorted(s.unique()):
|
| 32 |
+
m=(s==c).values; plt.scatter(ts[m,0],ts[m,1],s=5,alpha=0.5,label=c)
|
| 33 |
+
plt.title("t-SNE of CLIP embeddings, coloured by product category"); plt.xticks([]); plt.yticks([])
|
| 34 |
+
plt.legend(title="category", markerscale=3, fontsize=7, ncol=2)
|
| 35 |
+
plt.tight_layout(); plt.savefig(f"{ART}/tsne_category.png"); plt.savefig(f"{ASSETS}/tsne_category.png"); plt.close()
|
| 36 |
+
|
| 37 |
+
# ---------------- DBSCAN (second clustering algorithm) ----------------
|
| 38 |
+
import umap
|
| 39 |
+
from sklearn.cluster import DBSCAN
|
| 40 |
+
from sklearn.neighbors import NearestNeighbors
|
| 41 |
+
print("UMAP for DBSCAN...")
|
| 42 |
+
U = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=SEED).fit_transform(EMB)
|
| 43 |
+
# k-distance plot to choose eps (rigorous eps selection rather than guessing)
|
| 44 |
+
k=15
|
| 45 |
+
nn = NearestNeighbors(n_neighbors=k).fit(U)
|
| 46 |
+
kd = np.sort(nn.kneighbors(U)[0][:,-1])
|
| 47 |
+
eps = float(np.percentile(kd, 92)) # knee region
|
| 48 |
+
fig, ax = plt.subplots(1,2, figsize=(13,5))
|
| 49 |
+
ax[0].plot(kd, color="#4C72B0"); ax[0].axhline(eps, ls="--", color="grey")
|
| 50 |
+
ax[0].set_title(f"k-distance plot (k={k}) -> eps={eps:.2f}"); ax[0].set_xlabel("points sorted"); ax[0].set_ylabel(f"{k}-NN distance")
|
| 51 |
+
db = DBSCAN(eps=eps, min_samples=k).fit(U)
|
| 52 |
+
lab = db.labels_
|
| 53 |
+
n_clusters = len(set(lab)) - (1 if -1 in lab else 0)
|
| 54 |
+
noise = float((lab==-1).mean())
|
| 55 |
+
for c in sorted(set(lab)):
|
| 56 |
+
m=lab==c; col="lightgrey" if c==-1 else None
|
| 57 |
+
ax[1].scatter(U[m,0],U[m,1],s=5,alpha=0.5,label=("noise" if c==-1 else f"c{c}"),color=col)
|
| 58 |
+
ax[1].set_title(f"DBSCAN on UMAP: {n_clusters} clusters, {noise*100:.0f}% noise"); ax[1].set_xticks([]); ax[1].set_yticks([])
|
| 59 |
+
ax[1].legend(fontsize=6, markerscale=2, ncol=2)
|
| 60 |
+
plt.tight_layout(); plt.savefig(f"{ART}/dbscan.png"); plt.savefig(f"{ASSETS}/dbscan.png"); plt.close()
|
| 61 |
+
print(f"DBSCAN: {n_clusters} clusters, noise={noise:.2f}, eps={eps:.2f}")
|
| 62 |
+
|
| 63 |
+
# ---------------- per-cluster representative image grids ----------------
|
| 64 |
+
print("per-cluster image grids...")
|
| 65 |
+
ncl = int(KM.max())+1
|
| 66 |
+
# Derive each cluster's label from its actual dominant categories (never hardcode -> cannot mislabel).
|
| 67 |
+
def cluster_label(cl):
|
| 68 |
+
top = cat[cat["cluster"]==cl]["top_category"].value_counts().head(2).index.tolist()
|
| 69 |
+
return f"cluster {cl}: " + ", ".join(top)
|
| 70 |
+
fig, axes = plt.subplots(ncl, 6, figsize=(12, 2*ncl))
|
| 71 |
+
for cl in range(ncl):
|
| 72 |
+
members = np.where(KM==cl)[0]
|
| 73 |
+
centroid = EMB[members].mean(0); centroid/=np.linalg.norm(centroid)
|
| 74 |
+
nearest = members[np.argsort(-(EMB[members]@centroid))[:6]]
|
| 75 |
+
for j,idx in enumerate(nearest):
|
| 76 |
+
ax = axes[cl,j] if ncl>1 else axes[j]
|
| 77 |
+
ax.imshow(b64(cat.iloc[idx]["thumb"])); ax.axis("off")
|
| 78 |
+
axes[cl,0].set_title(cluster_label(cl), loc="left", fontsize=9)
|
| 79 |
+
fig.suptitle("Representative products per K-Means cluster (closest to each centroid)", fontsize=12)
|
| 80 |
+
plt.tight_layout(); plt.savefig(f"{ART}/cluster_examples.png"); plt.savefig(f"{ASSETS}/cluster_examples.png"); plt.close()
|
| 81 |
+
|
| 82 |
+
# ---------------- query -> Top-3 montage on held-out products ----------------
|
| 83 |
+
print("query montage (held-out)...")
|
| 84 |
+
import torch
|
| 85 |
+
from transformers import CLIPModel, CLIPProcessor
|
| 86 |
+
m = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").eval()
|
| 87 |
+
proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
| 88 |
+
@torch.no_grad()
|
| 89 |
+
def ei(img):
|
| 90 |
+
inp=proc(images=img.convert("RGB"),return_tensors="pt"); v=m.vision_model(pixel_values=inp["pixel_values"])
|
| 91 |
+
f=m.visual_projection(v.pooler_output); return (f/f.norm(dim=-1,keepdim=True)).numpy()[0]
|
| 92 |
+
ex_files=[f for f in ["cameras.jpg","furniture.jpg","animals.jpg","sporting.jpg","electronics.jpg"] if os.path.exists(f"../space/examples/{f}")]
|
| 93 |
+
fig, axes = plt.subplots(len(ex_files), 4, figsize=(11, 2.6*len(ex_files)))
|
| 94 |
+
for r,fn in enumerate(ex_files):
|
| 95 |
+
q=Image.open(f"../space/examples/{fn}").convert("RGB"); v=ei(q)
|
| 96 |
+
sims=EMB@v; order=[]
|
| 97 |
+
for i in np.argsort(-sims):
|
| 98 |
+
if all(float(EMB[i]@EMB[j])<=0.985 for j in order): order.append(int(i))
|
| 99 |
+
if len(order)==3: break
|
| 100 |
+
axes[r,0].imshow(q); axes[r,0].set_title("query", fontsize=9); axes[r,0].axis("off")
|
| 101 |
+
for k2,idx in enumerate(order,1):
|
| 102 |
+
axes[r,k2].imshow(b64(cat.iloc[idx]["thumb"])); axes[r,k2].axis("off")
|
| 103 |
+
axes[r,k2].set_title(f"#{k2} {sims[idx]:.2f}\n{cat.iloc[idx]['top_category'][:16]}", fontsize=8)
|
| 104 |
+
fig.suptitle("Query (uploaded photo) -> Top-3 recommendations", fontsize=12)
|
| 105 |
+
plt.tight_layout(); plt.savefig(f"{ART}/recommend_examples.png"); plt.savefig(f"{ASSETS}/recommend_examples.png"); plt.close()
|
| 106 |
+
|
| 107 |
+
# NOTE: the FAISS benchmark is run separately (faiss-cpu and torch share an OpenMP runtime and
|
| 108 |
+
# segfault if imported in the same process). See the notebook, which runs it in an isolated
|
| 109 |
+
# subprocess, and faiss_stats.json for the recorded result.
|
| 110 |
+
print("BONUS PLOTS COMPLETE")
|