--- library_name: rfdetr license: apache-2.0 pipeline_tag: object-detection tags: - object-detection - brand-detection - logo-detection - rf-detr - rfdetr-small - owlv2 - dinov2 - pytorch model-index: - name: logo-detector-rfdetr results: - task: type: object-detection name: Logo detection dataset: type: custom name: brand_detection hf_dataset (42 human-labeled images) metrics: - type: mAP@0.50 value: 93.23 - type: mAP@0.75 value: 84.71 - type: mAP@0.85 value: 61.81 - type: mAP@0.95 value: 8.72 --- # logo-detector-rfdetr Two-stage brand-logo detector for **7 brands** — `lenscrafters`, `lufthansa`, `mediamarkt`, `meta`, `nuance`, `oakley`, `ray_ban`. The pipeline started as a *zero-shot* detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository: | Stage | Model | Supervision | Output | |---|---|---|---| | **Stage 1** | [OWLv2](https://huggingface.co/google/owlv2-large-patch14-ensemble) image-guided detection | Zero-shot, uses 52 hand-labeled crops as exemplars | `stage1_gallery/embeddings.pt` (mean image embedding per brand) + `metadata.json` | | **Stage 2** | [RF-DETR-Small](https://github.com/roboflow/rf-detr) with a frozen DINOv2 backbone | Fine-tuned on synthetic copy-paste + real images | `checkpoint_best_ema.pth` — RF-DETR-Small best-EMA checkpoint | Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly. ## Side-by-side evaluation — 42 real human-labeled images Per-class Average Precision (%) at IoU ∈ {0.50, 0.75, 0.85, 0.95}. **Both stages were evaluated on the same 42-image ground-truth set** (`brand_detection/data/hf_dataset/`), so the numbers are directly comparable. | Class | S1 AP@0.50 | S2 AP@0.50 | S1 AP@0.75 | S2 AP@0.75 | S1 AP@0.85 | S2 AP@0.85 | S1 AP@0.95 | S2 AP@0.95 | |---|---|---|---|---|---|---|---|---| | lenscrafters | 0.12 | 100.00 | 0.00 | 100.00 | 0.00 | 51.72 | 0.00 | 0.00 | | lufthansa | 0.36 | 90.14 | 0.25 | 72.91 | 0.13 | 56.83 | 0.00 | 21.53 | | mediamarkt | 0.01 | 82.58 | 0.01 | 82.58 | 0.01 | 74.59 | 0.00 | 2.48 | | meta | 0.05 | 88.64 | 0.01 | 76.84 | 0.00 | 25.74 | 0.00 | 0.00 | | nuance | 0.40 | 100.00 | 0.20 | 100.00 | 0.01 | 85.15 | 0.00 | 24.09 | | oakley | 0.12 | 91.26 | 0.00 | 91.26 | 0.00 | 71.29 | 0.00 | 12.97 | | ray_ban | 2.15 | 100.00 | 0.01 | 69.41 | 0.01 | 67.33 | 0.00 | 0.00 | | **mAP** | 0.46 | 93.23 | 0.07 | 84.71 | 0.02 | 61.81 | 0.00 | 8.72 | *S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).* ## Training details (Stage 2) ### Dataset The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds: - Real labeled images : **42** - Synthetic copy-paste images : **1400** (~200 per brand × 7 brands) - COCO val2017 backgrounds : 50 - YOLO split : **1156 train / 286 val** (stratified by class, seed 42) ### Hyperparameters - Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone - Backbone : **frozen** (`lr_encoder = 0`) - Head learning rate : `1e-4` - Resolution : `640` - Effective batch size : 16 (per-device 4 × grad_accum 4) - Epochs : 100 with early stopping (patience 20) - EMA : enabled, used for best-checkpoint selection - Seed : 42 Best EMA checkpoint reached `mAP@0.50:0.95 = 0.8981` (≈ 89.81%) on the synthetic + real val split at epoch `33` — this is the checkpoint shipped here (`checkpoint_best_ema.pth`). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep — unrelated to the model itself — and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53. Model parameters : **32.11 M**. ### Hardware Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took ≈ 4 min 40 s; training to the best EMA checkpoint was ≈ 2.5 h. ## Usage ### Stage 2 (recommended) — run RF-DETR-Small inference ```python from huggingface_hub import hf_hub_download from rfdetr import RFDETRSmall from PIL import Image ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth") model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt) CLASS_NAMES = [ "lenscrafters", "lufthansa", "mediamarkt", "meta", "nuance", "oakley", "ray_ban", ] img = Image.open("your_image.jpg").convert("RGB") dets = model.predict(img, threshold=0.5) for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence): print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f} box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})") ``` > The `class_id` returned by `rfdetr` is 0-indexed (`0 = lenscrafters … 6 = ray_ban`). If you compare against the COCO-format ground-truth file shipped with this project, add `+1` to convert back to category IDs `1..7`. ### Stage 1 — OWLv2 exemplar gallery The mean OWLv2 image embedding for each brand (tensor of shape `[hidden_dim]`) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into `Owlv2ForObjectDetection.image_guided_detection` if you wanted to re-run the zero-shot baseline. ```python import torch from huggingface_hub import hf_hub_download emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt") meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json") embeddings = torch.load(emb_path, map_location="cpu") # dict[class_name] -> Tensor import json; metadata = json.loads(open(meta_path).read()) print(metadata["classes"]) # ['lenscrafters', ..., 'ray_ban'] print(embeddings["lufthansa"].shape) # torch.Size([hidden_dim]) ```
Full stage1/metrics.md # Stage 1 — OWLv2 exemplar-gallery — detection metrics Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban) Predictions: 126731 boxes loaded from `predictions.json` Per-class Average Precision (%): | class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean | | --- | --- | --- | --- | --- | --- | | lenscrafters | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 | | lufthansa | 0.36 | 0.25 | 0.13 | 0.00 | 0.19 | | mediamarkt | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | | meta | 0.05 | 0.01 | 0.00 | 0.00 | 0.01 | | nuance | 0.40 | 0.20 | 0.01 | 0.00 | 0.15 | | oakley | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 | | ray_ban | 2.15 | 0.01 | 0.01 | 0.00 | 0.54 | | **mAP** | **0.46** | **0.07** | **0.02** | **0.00** | **0.14** | ``` pycocotools summary (using our custom IoU thresholds): Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.005 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.009 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.015 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.500 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.206 ```
Full stage2/metrics.md # Stage 2 — RF-DETR-Small (frozen DINOv2 backbone) — detection metrics Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban) Predictions: 6759 boxes loaded from `predictions.json` Per-class Average Precision (%): | class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean | | --- | --- | --- | --- | --- | --- | | lenscrafters | 100.00 | 100.00 | 51.72 | 0.00 | 62.93 | | lufthansa | 90.14 | 72.91 | 56.83 | 21.53 | 60.35 | | mediamarkt | 82.58 | 82.58 | 74.59 | 2.48 | 60.56 | | meta | 88.64 | 76.84 | 25.74 | 0.00 | 47.81 | | nuance | 100.00 | 100.00 | 85.15 | 24.09 | 77.31 | | oakley | 91.26 | 91.26 | 71.29 | 12.97 | 66.70 | | ray_ban | 100.00 | 69.41 | 67.33 | 0.00 | 59.18 | | **mAP** | **93.23** | **84.71** | **61.81** | **8.72** | **62.12** | ``` pycocotools summary (using our custom IoU thresholds): Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.621 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.932 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.847 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.547 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.576 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.605 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.680 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.703 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.667 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648 ```
## Files in this repo | Path | Purpose | |---|---| | `checkpoint_best_ema.pth` | RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB) | | `stage1_gallery/embeddings.pt` | Per-class OWLv2 mean image embedding (Stage 1) | | `stage1_gallery/metadata.json` | Stage 1 gallery metadata (classes, exemplar counts) | | `README.md` | This model card |