---
library_name: rfdetr
license: apache-2.0
pipeline_tag: object-detection
tags:
- object-detection
- brand-detection
- logo-detection
- rf-detr
- rfdetr-small
- owlv2
- dinov2
- pytorch
model-index:
- name: logo-detector-rfdetr
  results:
  - task:
      type: object-detection
      name: Logo detection
    dataset:
      type: custom
      name: brand_detection hf_dataset (42 human-labeled images)
    metrics:
    - type: mAP@0.50
      value: 93.23
    - type: mAP@0.75
      value: 84.71
    - type: mAP@0.85
      value: 61.81
    - type: mAP@0.95
      value: 8.72
---

# logo-detector-rfdetr

Two-stage brand-logo detector for **7 brands** — `lenscrafters`, `lufthansa`, `mediamarkt`, `meta`, `nuance`, `oakley`, `ray_ban`.

The pipeline started as a *zero-shot* detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository:

| Stage | Model | Supervision | Output |
|---|---|---|---|
| **Stage 1** | [OWLv2](https://huggingface.co/google/owlv2-large-patch14-ensemble) image-guided detection | Zero-shot, uses 52 hand-labeled crops as exemplars | `stage1_gallery/embeddings.pt` (mean image embedding per brand) + `metadata.json` |
| **Stage 2** | [RF-DETR-Small](https://github.com/roboflow/rf-detr) with a frozen DINOv2 backbone | Fine-tuned on synthetic copy-paste + real images | `checkpoint_best_ema.pth` — RF-DETR-Small best-EMA checkpoint |

Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly.

## Side-by-side evaluation — 42 real human-labeled images

Per-class Average Precision (%) at IoU ∈ {0.50, 0.75, 0.85, 0.95}. **Both stages were evaluated on the same 42-image ground-truth set** (`brand_detection/data/hf_dataset/`), so the numbers are directly comparable.

| Class | S1 AP@0.50 | S2 AP@0.50 | S1 AP@0.75 | S2 AP@0.75 | S1 AP@0.85 | S2 AP@0.85 | S1 AP@0.95 | S2 AP@0.95 |
|---|---|---|---|---|---|---|---|---|
| lenscrafters | 0.12 | 100.00 | 0.00 | 100.00 | 0.00 | 51.72 | 0.00 | 0.00 |
| lufthansa | 0.36 | 90.14 | 0.25 | 72.91 | 0.13 | 56.83 | 0.00 | 21.53 |
| mediamarkt | 0.01 | 82.58 | 0.01 | 82.58 | 0.01 | 74.59 | 0.00 | 2.48 |
| meta | 0.05 | 88.64 | 0.01 | 76.84 | 0.00 | 25.74 | 0.00 | 0.00 |
| nuance | 0.40 | 100.00 | 0.20 | 100.00 | 0.01 | 85.15 | 0.00 | 24.09 |
| oakley | 0.12 | 91.26 | 0.00 | 91.26 | 0.00 | 71.29 | 0.00 | 12.97 |
| ray_ban | 2.15 | 100.00 | 0.01 | 69.41 | 0.01 | 67.33 | 0.00 | 0.00 |
| **mAP** | 0.46 | 93.23 | 0.07 | 84.71 | 0.02 | 61.81 | 0.00 | 8.72 |

*S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).*

## Training details (Stage 2)

### Dataset

The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds:

- Real labeled images : **42**
- Synthetic copy-paste images : **1400** (~200 per brand × 7 brands)
- COCO val2017 backgrounds : 50
- YOLO split : **1156 train / 286 val** (stratified by class, seed 42)

### Hyperparameters

- Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone
- Backbone : **frozen** (`lr_encoder = 0`)
- Head learning rate : `1e-4`
- Resolution : `640`
- Effective batch size : 16 (per-device 4 × grad_accum 4)
- Epochs : 100 with early stopping (patience 20)
- EMA : enabled, used for best-checkpoint selection
- Seed : 42

Best EMA checkpoint reached `mAP@0.50:0.95 = 0.8981` (≈ 89.81%) on the synthetic + real val split at epoch `33` — this is the checkpoint shipped here (`checkpoint_best_ema.pth`). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep — unrelated to the model itself — and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53.

Model parameters : **32.11 M**.

### Hardware

Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took ≈ 4 min 40 s; training to the best EMA checkpoint was ≈ 2.5 h.

## Usage

### Stage 2 (recommended) — run RF-DETR-Small inference

```python
from huggingface_hub import hf_hub_download
from rfdetr import RFDETRSmall
from PIL import Image

ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth")
model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt)

CLASS_NAMES = [
    "lenscrafters", "lufthansa", "mediamarkt", "meta",
    "nuance", "oakley", "ray_ban",
]

img = Image.open("your_image.jpg").convert("RGB")
dets = model.predict(img, threshold=0.5)
for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence):
    print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f}  box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})")
```

> The `class_id` returned by `rfdetr` is 0-indexed (`0 = lenscrafters … 6 = ray_ban`). If you compare against the COCO-format ground-truth file shipped with this project, add `+1` to convert back to category IDs `1..7`.

### Stage 1 — OWLv2 exemplar gallery

The mean OWLv2 image embedding for each brand (tensor of shape `[hidden_dim]`) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into `Owlv2ForObjectDetection.image_guided_detection` if you wanted to re-run the zero-shot baseline.

```python
import torch
from huggingface_hub import hf_hub_download

emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt")
meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json")
embeddings = torch.load(emb_path, map_location="cpu")   # dict[class_name] -> Tensor
import json; metadata = json.loads(open(meta_path).read())
print(metadata["classes"])              # ['lenscrafters', ..., 'ray_ban']
print(embeddings["lufthansa"].shape)    # torch.Size([hidden_dim])
```

<details><summary>Full <code>stage1/metrics.md</code></summary>

# Stage 1 — OWLv2 exemplar-gallery — detection metrics

Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

Predictions: 126731 boxes loaded from `predictions.json`


Per-class Average Precision (%):


| class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean |
| --- | --- | --- | --- | --- | --- |
| lenscrafters | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 |
| lufthansa | 0.36 | 0.25 | 0.13 | 0.00 | 0.19 |
| mediamarkt | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 |
| meta | 0.05 | 0.01 | 0.00 | 0.00 | 0.01 |
| nuance | 0.40 | 0.20 | 0.01 | 0.00 | 0.15 |
| oakley | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 |
| ray_ban | 2.15 | 0.01 | 0.01 | 0.00 | 0.54 |
| **mAP** | **0.46** | **0.07** | **0.02** | **0.00** | **0.14** |

```
pycocotools summary (using our custom IoU thresholds):
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.009
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.206
```


</details>

<details><summary>Full <code>stage2/metrics.md</code></summary>

# Stage 2 — RF-DETR-Small (frozen DINOv2 backbone) — detection metrics

Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

Predictions: 6759 boxes loaded from `predictions.json`


Per-class Average Precision (%):


| class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean |
| --- | --- | --- | --- | --- | --- |
| lenscrafters | 100.00 | 100.00 | 51.72 | 0.00 | 62.93 |
| lufthansa | 90.14 | 72.91 | 56.83 | 21.53 | 60.35 |
| mediamarkt | 82.58 | 82.58 | 74.59 | 2.48 | 60.56 |
| meta | 88.64 | 76.84 | 25.74 | 0.00 | 47.81 |
| nuance | 100.00 | 100.00 | 85.15 | 24.09 | 77.31 |
| oakley | 91.26 | 91.26 | 71.29 | 12.97 | 66.70 |
| ray_ban | 100.00 | 69.41 | 67.33 | 0.00 | 59.18 |
| **mAP** | **93.23** | **84.71** | **61.81** | **8.72** | **62.12** |

```
pycocotools summary (using our custom IoU thresholds):
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.932
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.847
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.547
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.576
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.605
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.680
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.703
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648
```


</details>

## Files in this repo

| Path | Purpose |
|---|---|
| `checkpoint_best_ema.pth` | RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB) |
| `stage1_gallery/embeddings.pt` | Per-class OWLv2 mean image embedding (Stage 1) |
| `stage1_gallery/metadata.json` | Stage 1 gallery metadata (classes, exemplar counts) |
| `README.md` | This model card |