| --- |
| library_name: rfdetr |
| license: apache-2.0 |
| pipeline_tag: object-detection |
| tags: |
| - object-detection |
| - brand-detection |
| - logo-detection |
| - rf-detr |
| - rfdetr-small |
| - owlv2 |
| - dinov2 |
| - pytorch |
| model-index: |
| - name: logo-detector-rfdetr |
| results: |
| - task: |
| type: object-detection |
| name: Logo detection |
| dataset: |
| type: custom |
| name: brand_detection hf_dataset (42 human-labeled images) |
| metrics: |
| - type: mAP@0.50 |
| value: 93.23 |
| - type: mAP@0.75 |
| value: 84.71 |
| - type: mAP@0.85 |
| value: 61.81 |
| - type: mAP@0.95 |
| value: 8.72 |
| --- |
| |
| # logo-detector-rfdetr |
|
|
| Two-stage brand-logo detector for **7 brands** — `lenscrafters`, `lufthansa`, `mediamarkt`, `meta`, `nuance`, `oakley`, `ray_ban`. |
|
|
| The pipeline started as a *zero-shot* detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository: |
|
|
| | Stage | Model | Supervision | Output | |
| |---|---|---|---| |
| | **Stage 1** | [OWLv2](https://huggingface.co/google/owlv2-large-patch14-ensemble) image-guided detection | Zero-shot, uses 52 hand-labeled crops as exemplars | `stage1_gallery/embeddings.pt` (mean image embedding per brand) + `metadata.json` | |
| | **Stage 2** | [RF-DETR-Small](https://github.com/roboflow/rf-detr) with a frozen DINOv2 backbone | Fine-tuned on synthetic copy-paste + real images | `checkpoint_best_ema.pth` — RF-DETR-Small best-EMA checkpoint | |
|
|
| Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly. |
|
|
| ## Side-by-side evaluation — 42 real human-labeled images |
|
|
| Per-class Average Precision (%) at IoU ∈ {0.50, 0.75, 0.85, 0.95}. **Both stages were evaluated on the same 42-image ground-truth set** (`brand_detection/data/hf_dataset/`), so the numbers are directly comparable. |
|
|
| | Class | S1 AP@0.50 | S2 AP@0.50 | S1 AP@0.75 | S2 AP@0.75 | S1 AP@0.85 | S2 AP@0.85 | S1 AP@0.95 | S2 AP@0.95 | |
| |---|---|---|---|---|---|---|---|---| |
| | lenscrafters | 0.12 | 100.00 | 0.00 | 100.00 | 0.00 | 51.72 | 0.00 | 0.00 | |
| | lufthansa | 0.36 | 90.14 | 0.25 | 72.91 | 0.13 | 56.83 | 0.00 | 21.53 | |
| | mediamarkt | 0.01 | 82.58 | 0.01 | 82.58 | 0.01 | 74.59 | 0.00 | 2.48 | |
| | meta | 0.05 | 88.64 | 0.01 | 76.84 | 0.00 | 25.74 | 0.00 | 0.00 | |
| | nuance | 0.40 | 100.00 | 0.20 | 100.00 | 0.01 | 85.15 | 0.00 | 24.09 | |
| | oakley | 0.12 | 91.26 | 0.00 | 91.26 | 0.00 | 71.29 | 0.00 | 12.97 | |
| | ray_ban | 2.15 | 100.00 | 0.01 | 69.41 | 0.01 | 67.33 | 0.00 | 0.00 | |
| | **mAP** | 0.46 | 93.23 | 0.07 | 84.71 | 0.02 | 61.81 | 0.00 | 8.72 | |
| |
| *S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).* |
| |
| ## Training details (Stage 2) |
| |
| ### Dataset |
| |
| The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds: |
| |
| - Real labeled images : **42** |
| - Synthetic copy-paste images : **1400** (~200 per brand × 7 brands) |
| - COCO val2017 backgrounds : 50 |
| - YOLO split : **1156 train / 286 val** (stratified by class, seed 42) |
| |
| ### Hyperparameters |
| |
| - Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone |
| - Backbone : **frozen** (`lr_encoder = 0`) |
| - Head learning rate : `1e-4` |
| - Resolution : `640` |
| - Effective batch size : 16 (per-device 4 × grad_accum 4) |
| - Epochs : 100 with early stopping (patience 20) |
| - EMA : enabled, used for best-checkpoint selection |
| - Seed : 42 |
| |
| Best EMA checkpoint reached `mAP@0.50:0.95 = 0.8981` (≈ 89.81%) on the synthetic + real val split at epoch `33` — this is the checkpoint shipped here (`checkpoint_best_ema.pth`). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep — unrelated to the model itself — and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53. |
| |
| Model parameters : **32.11 M**. |
| |
| ### Hardware |
| |
| Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took ≈ 4 min 40 s; training to the best EMA checkpoint was ≈ 2.5 h. |
| |
| ## Usage |
| |
| ### Stage 2 (recommended) — run RF-DETR-Small inference |
| |
| ```python |
| from huggingface_hub import hf_hub_download |
| from rfdetr import RFDETRSmall |
| from PIL import Image |
|
|
| ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth") |
| model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt) |
|
|
| CLASS_NAMES = [ |
| "lenscrafters", "lufthansa", "mediamarkt", "meta", |
| "nuance", "oakley", "ray_ban", |
| ] |
|
|
| img = Image.open("your_image.jpg").convert("RGB") |
| dets = model.predict(img, threshold=0.5) |
| for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence): |
| print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f} box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})") |
| ``` |
| |
| > The `class_id` returned by `rfdetr` is 0-indexed (`0 = lenscrafters … 6 = ray_ban`). If you compare against the COCO-format ground-truth file shipped with this project, add `+1` to convert back to category IDs `1..7`. |
|
|
| ### Stage 1 — OWLv2 exemplar gallery |
|
|
| The mean OWLv2 image embedding for each brand (tensor of shape `[hidden_dim]`) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into `Owlv2ForObjectDetection.image_guided_detection` if you wanted to re-run the zero-shot baseline. |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| |
| emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt") |
| meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json") |
| embeddings = torch.load(emb_path, map_location="cpu") # dict[class_name] -> Tensor |
| import json; metadata = json.loads(open(meta_path).read()) |
| print(metadata["classes"]) # ['lenscrafters', ..., 'ray_ban'] |
| print(embeddings["lufthansa"].shape) # torch.Size([hidden_dim]) |
| ``` |
|
|
| <details><summary>Full <code>stage1/metrics.md</code></summary> |
|
|
| # Stage 1 — OWLv2 exemplar-gallery — detection metrics |
|
|
| Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban) |
| |
| Predictions: 126731 boxes loaded from `predictions.json` |
| |
| |
| Per-class Average Precision (%): |
| |
| |
| | class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean | |
| | --- | --- | --- | --- | --- | --- | |
| | lenscrafters | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 | |
| | lufthansa | 0.36 | 0.25 | 0.13 | 0.00 | 0.19 | |
| | mediamarkt | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | |
| | meta | 0.05 | 0.01 | 0.00 | 0.00 | 0.01 | |
| | nuance | 0.40 | 0.20 | 0.01 | 0.00 | 0.15 | |
| | oakley | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 | |
| | ray_ban | 2.15 | 0.01 | 0.01 | 0.00 | 0.54 | |
| | **mAP** | **0.46** | **0.07** | **0.02** | **0.00** | **0.14** | |
|
|
| ``` |
| pycocotools summary (using our custom IoU thresholds): |
| Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 |
| Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.005 |
| Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.001 |
| Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002 |
| Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.009 |
| Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.015 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.500 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.206 |
| ``` |
|
|
|
|
| </details> |
|
|
| <details><summary>Full <code>stage2/metrics.md</code></summary> |
|
|
| # Stage 2 — RF-DETR-Small (frozen DINOv2 backbone) — detection metrics |
|
|
| Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban) |
| |
| Predictions: 6759 boxes loaded from `predictions.json` |
| |
| |
| Per-class Average Precision (%): |
| |
| |
| | class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean | |
| | --- | --- | --- | --- | --- | --- | |
| | lenscrafters | 100.00 | 100.00 | 51.72 | 0.00 | 62.93 | |
| | lufthansa | 90.14 | 72.91 | 56.83 | 21.53 | 60.35 | |
| | mediamarkt | 82.58 | 82.58 | 74.59 | 2.48 | 60.56 | |
| | meta | 88.64 | 76.84 | 25.74 | 0.00 | 47.81 | |
| | nuance | 100.00 | 100.00 | 85.15 | 24.09 | 77.31 | |
| | oakley | 91.26 | 91.26 | 71.29 | 12.97 | 66.70 | |
| | ray_ban | 100.00 | 69.41 | 67.33 | 0.00 | 59.18 | |
| | **mAP** | **93.23** | **84.71** | **61.81** | **8.72** | **62.12** | |
|
|
| ``` |
| pycocotools summary (using our custom IoU thresholds): |
| Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.621 |
| Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.932 |
| Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.847 |
| Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750 |
| Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.547 |
| Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.576 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.605 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.680 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.703 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.667 |
| Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648 |
| ``` |
|
|
|
|
| </details> |
|
|
| ## Files in this repo |
|
|
| | Path | Purpose | |
| |---|---| |
| | `checkpoint_best_ema.pth` | RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB) | |
| | `stage1_gallery/embeddings.pt` | Per-class OWLv2 mean image embedding (Stage 1) | |
| | `stage1_gallery/metadata.json` | Stage 1 gallery metadata (classes, exemplar counts) | |
| | `README.md` | This model card | |
|
|
|
|