qwen3-vl-32b-soccer-v11-fp8

LoRA-merged FP8 quantized variant of Qwen3-VL-32B for soccer event classification.

This is the production checkpoint that powers the dual-pass event detector in the soccer-video-pipeline project. It's a single ~34 GB artifact you can serve directly with vLLM — no separate base + adapter merge step needed.

What it does

Given a short window of soccer match frames (4-8 frames sampled at 1 Hz over a 5-10 second clip), the model classifies the event happening in the window as one of:

goal
shot_on_target
free_kick_shot
catch
shot_stop_diving, shot_stop_standing
corner_kick, goal_kick, throw_in
kickoff_restart, active_play, idle (auxiliary states)

The model was fine-tuned to suppress some noisy auxiliary labels (notably kickoff_restart) for cleaner downstream event classification. For detecting kickoff restarts (used in the goal-recall pipeline), use the base Qwen/Qwen3-VL-32B-Instruct-FP8 instead — see the architecture doc in the GitHub repo.

How to serve

vLLM 0.19.1 is the only known-working version. Newer vLLM releases silently break this checkpoint (garbage token output). Pin it.

pip install vllm==0.19.1

vllm serve acatorcini/qwen3-vl-32b-soccer-v11-fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 16 \
  --port 8000 \
  --host 0.0.0.0 \
  --dtype auto \
  --served-model-name qwen3-vl-32b \
  --quantization compressed-tensors

--quantization compressed-tensors is required (this is a LoRA-merged FP8 checkpoint). Using --quantization fp8 will fail. Conversely, the base model is served with --quantization fp8 — don't mix them up.

Hardware

Minimum: 2× RTX 3090 / 4090 over NVLink (48 GB VRAM total), tensor-parallel 2
Single GPU: needs ≥40 GB VRAM (A100, H100)

Training data

Custom-curated set of ~10,000 short soccer event clips with manual labels, drawn from amateur and youth-level matches (1080p, sideline camera at ~50m). Multiple games, multiple venues, varied lighting. Training data is not redistributed.

Intended use

Personal soccer analytics, research on amateur sports video understanding, component of the open-source soccer-video-pipeline system. Not intended for professional broadcast use.

Limitations

The model's ViT cannot reliably distinguish the ball at >50m camera distance (the ball is 3-5 px). This affects raw goal recall — the upstream pipeline compensates with a kickoff-restart ensemble.
Performance degrades on dramatically different camera framings than the training corpus (e.g., behind-goal cameras, drone footage).
Trained on English commentary / labels only.

License

Inherits the Qwen3-VL Tongyi Qianwen License. Commercial use is permitted for products with <100M MAU; otherwise license terms apply. Read the linked license for the authoritative terms.