Write combined model card (Stage 1 + Stage 2 side-by-side)

c63dc0f verified about 1 month ago

10.2 kB

	---
	library_name: rfdetr
	license: apache-2.0
	pipeline_tag: object-detection
	tags:
	- object-detection
	- brand-detection
	- logo-detection
	- rf-detr
	- rfdetr-small
	- owlv2
	- dinov2
	- pytorch
	model-index:
	- name: logo-detector-rfdetr
	results:
	- task:
	type: object-detection
	name: Logo detection
	dataset:
	type: custom
	name: brand_detection hf_dataset (42 human-labeled images)
	metrics:
	- type: mAP@0.50
	value: 93.23
	- type: mAP@0.75
	value: 84.71
	- type: mAP@0.85
	value: 61.81
	- type: mAP@0.95
	value: 8.72
	---

	# logo-detector-rfdetr

	Two-stage brand-logo detector for 7 brands — `lenscrafters`, `lufthansa`, `mediamarkt`, `meta`, `nuance`, `oakley`, `ray_ban`.

	The pipeline started as a zero-shot detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository:

	\| Stage \| Model \| Supervision \| Output \|
	\|---\|---\|---\|---\|
	\| Stage 1 \| [OWLv2](https://huggingface.co/google/owlv2-large-patch14-ensemble) image-guided detection \| Zero-shot, uses 52 hand-labeled crops as exemplars \| `stage1_gallery/embeddings.pt` (mean image embedding per brand) + `metadata.json` \|
	\| Stage 2 \| [RF-DETR-Small](https://github.com/roboflow/rf-detr) with a frozen DINOv2 backbone \| Fine-tuned on synthetic copy-paste + real images \| `checkpoint_best_ema.pth` — RF-DETR-Small best-EMA checkpoint \|

	Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly.

	## Side-by-side evaluation — 42 real human-labeled images

	Per-class Average Precision (%) at IoU ∈ {0.50, 0.75, 0.85, 0.95}. Both stages were evaluated on the same 42-image ground-truth set (`brand_detection/data/hf_dataset/`), so the numbers are directly comparable.

	\| Class \| S1 AP@0.50 \| S2 AP@0.50 \| S1 AP@0.75 \| S2 AP@0.75 \| S1 AP@0.85 \| S2 AP@0.85 \| S1 AP@0.95 \| S2 AP@0.95 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| lenscrafters \| 0.12 \| 100.00 \| 0.00 \| 100.00 \| 0.00 \| 51.72 \| 0.00 \| 0.00 \|
	\| lufthansa \| 0.36 \| 90.14 \| 0.25 \| 72.91 \| 0.13 \| 56.83 \| 0.00 \| 21.53 \|
	\| mediamarkt \| 0.01 \| 82.58 \| 0.01 \| 82.58 \| 0.01 \| 74.59 \| 0.00 \| 2.48 \|
	\| meta \| 0.05 \| 88.64 \| 0.01 \| 76.84 \| 0.00 \| 25.74 \| 0.00 \| 0.00 \|
	\| nuance \| 0.40 \| 100.00 \| 0.20 \| 100.00 \| 0.01 \| 85.15 \| 0.00 \| 24.09 \|
	\| oakley \| 0.12 \| 91.26 \| 0.00 \| 91.26 \| 0.00 \| 71.29 \| 0.00 \| 12.97 \|
	\| ray_ban \| 2.15 \| 100.00 \| 0.01 \| 69.41 \| 0.01 \| 67.33 \| 0.00 \| 0.00 \|
	\| mAP \| 0.46 \| 93.23 \| 0.07 \| 84.71 \| 0.02 \| 61.81 \| 0.00 \| 8.72 \|

	S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).

	## Training details (Stage 2)

	### Dataset

	The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds:

	- Real labeled images : 42
	- Synthetic copy-paste images : 1400 (~200 per brand × 7 brands)
	- COCO val2017 backgrounds : 50
	- YOLO split : 1156 train / 286 val (stratified by class, seed 42)

	### Hyperparameters

	- Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone
	- Backbone : frozen (`lr_encoder = 0`)
	- Head learning rate : `1e-4`
	- Resolution : `640`
	- Effective batch size : 16 (per-device 4 × grad_accum 4)
	- Epochs : 100 with early stopping (patience 20)
	- EMA : enabled, used for best-checkpoint selection
	- Seed : 42

	Best EMA checkpoint reached `mAP@0.50:0.95 = 0.8981` (≈ 89.81%) on the synthetic + real val split at epoch `33` — this is the checkpoint shipped here (`checkpoint_best_ema.pth`). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep — unrelated to the model itself — and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53.

	Model parameters : 32.11 M.

	### Hardware

	Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took ≈ 4 min 40 s; training to the best EMA checkpoint was ≈ 2.5 h.

	## Usage

	### Stage 2 (recommended) — run RF-DETR-Small inference

	```python
	from huggingface_hub import hf_hub_download
	from rfdetr import RFDETRSmall
	from PIL import Image

	ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth")
	model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt)

	CLASS_NAMES = [
	"lenscrafters", "lufthansa", "mediamarkt", "meta",
	"nuance", "oakley", "ray_ban",
	]

	img = Image.open("your_image.jpg").convert("RGB")
	dets = model.predict(img, threshold=0.5)
	for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence):
	print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f} box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})")
	```

	> The `class_id` returned by `rfdetr` is 0-indexed (`0 = lenscrafters … 6 = ray_ban`). If you compare against the COCO-format ground-truth file shipped with this project, add `+1` to convert back to category IDs `1..7`.

	### Stage 1 — OWLv2 exemplar gallery

	The mean OWLv2 image embedding for each brand (tensor of shape `[hidden_dim]`) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into `Owlv2ForObjectDetection.image_guided_detection` if you wanted to re-run the zero-shot baseline.

	```python
	import torch
	from huggingface_hub import hf_hub_download

	emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt")
	meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json")
	embeddings = torch.load(emb_path, map_location="cpu") # dict[class_name] -> Tensor
	import json; metadata = json.loads(open(meta_path).read())
	print(metadata["classes"]) # ['lenscrafters', ..., 'ray_ban']
	print(embeddings["lufthansa"].shape) # torch.Size([hidden_dim])
	```

	<details><summary>Full <code>stage1/metrics.md</code></summary>

	# Stage 1 — OWLv2 exemplar-gallery — detection metrics

	Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

	Predictions: 126731 boxes loaded from `predictions.json`


	Per-class Average Precision (%):


	\| class \| AP@0.50 \| AP@0.75 \| AP@0.85 \| AP@0.95 \| mean \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| lenscrafters \| 0.12 \| 0.00 \| 0.00 \| 0.00 \| 0.03 \|
	\| lufthansa \| 0.36 \| 0.25 \| 0.13 \| 0.00 \| 0.19 \|
	\| mediamarkt \| 0.01 \| 0.01 \| 0.01 \| 0.00 \| 0.00 \|
	\| meta \| 0.05 \| 0.01 \| 0.00 \| 0.00 \| 0.01 \|
	\| nuance \| 0.40 \| 0.20 \| 0.01 \| 0.00 \| 0.15 \|
	\| oakley \| 0.12 \| 0.00 \| 0.00 \| 0.00 \| 0.03 \|
	\| ray_ban \| 2.15 \| 0.01 \| 0.01 \| 0.00 \| 0.54 \|
	\| mAP \| 0.46 \| 0.07 \| 0.02 \| 0.00 \| 0.14 \|

	```
	pycocotools summary (using our custom IoU thresholds):
	Average Precision (AP) @[ IoU=0.50:0.95 \| area= all \| maxDets=100 ] = 0.001
	Average Precision (AP) @[ IoU=0.50 \| area= all \| maxDets=100 ] = 0.005
	Average Precision (AP) @[ IoU=0.75 \| area= all \| maxDets=100 ] = 0.001
	Average Precision (AP) @[ IoU=0.50:0.95 \| area= small \| maxDets=100 ] = 0.002
	Average Precision (AP) @[ IoU=0.50:0.95 \| area=medium \| maxDets=100 ] = 0.009
	Average Precision (AP) @[ IoU=0.50:0.95 \| area= large \| maxDets=100 ] = 0.001
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= all \| maxDets= 1 ] = 0.000
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= all \| maxDets= 10 ] = 0.015
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= all \| maxDets=100 ] = 0.250
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= small \| maxDets=100 ] = 0.500
	Average Recall (AR) @[ IoU=0.50:0.95 \| area=medium \| maxDets=100 ] = 0.319
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= large \| maxDets=100 ] = 0.206
	```


	</details>

	<details><summary>Full <code>stage2/metrics.md</code></summary>

	# Stage 2 — RF-DETR-Small (frozen DINOv2 backbone) — detection metrics

	Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

	Predictions: 6759 boxes loaded from `predictions.json`


	Per-class Average Precision (%):


	\| class \| AP@0.50 \| AP@0.75 \| AP@0.85 \| AP@0.95 \| mean \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| lenscrafters \| 100.00 \| 100.00 \| 51.72 \| 0.00 \| 62.93 \|
	\| lufthansa \| 90.14 \| 72.91 \| 56.83 \| 21.53 \| 60.35 \|
	\| mediamarkt \| 82.58 \| 82.58 \| 74.59 \| 2.48 \| 60.56 \|
	\| meta \| 88.64 \| 76.84 \| 25.74 \| 0.00 \| 47.81 \|
	\| nuance \| 100.00 \| 100.00 \| 85.15 \| 24.09 \| 77.31 \|
	\| oakley \| 91.26 \| 91.26 \| 71.29 \| 12.97 \| 66.70 \|
	\| ray_ban \| 100.00 \| 69.41 \| 67.33 \| 0.00 \| 59.18 \|
	\| mAP \| 93.23 \| 84.71 \| 61.81 \| 8.72 \| 62.12 \|

	```
	pycocotools summary (using our custom IoU thresholds):
	Average Precision (AP) @[ IoU=0.50:0.95 \| area= all \| maxDets=100 ] = 0.621
	Average Precision (AP) @[ IoU=0.50 \| area= all \| maxDets=100 ] = 0.932
	Average Precision (AP) @[ IoU=0.75 \| area= all \| maxDets=100 ] = 0.847
	Average Precision (AP) @[ IoU=0.50:0.95 \| area= small \| maxDets=100 ] = 0.750
	Average Precision (AP) @[ IoU=0.50:0.95 \| area=medium \| maxDets=100 ] = 0.547
	Average Precision (AP) @[ IoU=0.50:0.95 \| area= large \| maxDets=100 ] = 0.576
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= all \| maxDets= 1 ] = 0.605
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= all \| maxDets= 10 ] = 0.680
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= all \| maxDets=100 ] = 0.703
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= small \| maxDets=100 ] = 0.750
	Average Recall (AR) @[ IoU=0.50:0.95 \| area=medium \| maxDets=100 ] = 0.667
	Average Recall (AR) @[ IoU=0.50:0.95 \| area= large \| maxDets=100 ] = 0.648
	```


	</details>

	## Files in this repo

	\| Path \| Purpose \|
	\|---\|---\|
	\| `checkpoint_best_ema.pth` \| RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB) \|
	\| `stage1_gallery/embeddings.pt` \| Per-class OWLv2 mean image embedding (Stage 1) \|
	\| `stage1_gallery/metadata.json` \| Stage 1 gallery metadata (classes, exemplar counts) \|
	\| `README.md` \| This model card \|