Qwen2.5-VL floor plan GRPO adapter (stage 2)

Hub: mudasir13cs/qwen25-vl-3b-floorplan-grpo

Improved using Qwen — LoRA adapter continuing from mudasir13cs/qwen25-vl-3b-floorplan-sft; trained with GRPO and geometric rewards. Base checkpoint: Qwen2.5-VL-3B-Instruct (LICENSE).

Intended non-commercial / research use, consistent with CubiCasa5K (CC BY‑NC 4.0) and the Qwen research license.

Paper & upstream training material

Method: FloorplanVLM (arXiv:2602.06507)
Original Hub collection: manitocross/floorplan-vlm-training — readme, CubiCasa5K wiring, GRPO overview, JSON schema.
GRPO recipe (canonical source files on Hub): train_floorplan_grpo.py — reward R = 0.1·R_val + 0.5·R_ext + α·0.4·R_int; SFT_MODEL_ID, HUB_MODEL_ID, and OUTPUT_DIR are edited at the top of that script.
SFT stage (prompts, dataset → JSON targets): train_floorplan_vlm.py.

If you are working inside a clone of your training repo, the same files may exist locally beside this folder (relative paths).

Quick install

pip install torch torchvision transformers trl peft accelerate pillow

(Training also uses Shapely, datasets, numpy; inference does not strictly need Shapely.)

Loading the adapter

Use Hub IDs (recommended):

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE = "Qwen/Qwen2.5-VL-3B-Instruct"
ADAPTER = "mudasir13cs/qwen25-vl-3b-floorplan-grpo"

processor = AutoProcessor.from_pretrained(BASE)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    BASE, torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

If you cloned weights into this folder locally: set ADAPTER = "./floorplan-vlm-grpo" (or an absolute path) instead of the Hub repo id.

Using the model (inference)

For best alignment with training, reuse the system prompt that includes the full JSON schema from stage 1 (see SYSTEM_PROMPT in train_floorplan_vlm.py). GRPO training uses a shorter system string in train_floorplan_grpo.py; either works, but the schema-explicit SFT prompt below usually yields more stable JSON.

User text (same as both scripts): “Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms.”

Minimal pattern (aligned with the inference test in train_floorplan_vlm.py):

import json, re, torch
from PIL import Image

# Same strings as train_floorplan_vlm.py (schema-in-the-prompt; recommended for decoding).
SYSTEM_PROMPT = (
    "You are a floor plan vectorization expert. Extract wall, door, window geometry "
    "from floor plan images into structured JSON.\n\n"
    "Output ONLY valid JSON with this schema:\n"
    '{"walls":[{"id":"wall_N","start":[x,y],"end":[x,y],"thickness":T,"curvature":0,'
    '"openings":[{"type":"door"|"window","center":D,"width":W}]}],'
    '"rooms":[{"label":"room_type","walls":["wall_N",...]}]}\n\n'
    "Coordinates normalized so longer image edge = 1024."
)
USER_PROMPT = "Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms."

image = Image.open("plan.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": USER_PROMPT}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

raw = processor.batch_decode(out[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)[0]
m = re.search(r"\{[\s\S]*\}", raw)
plan = json.loads(m.group()) if m else None

Output shape: top-level walls (with optional openings) and rooms. Example JSON shape is spelled out under Output JSON Schema in the manitocross training README.

Reproducing stage 2

Finish or download stage 1: mudasir13cs/qwen25-vl-3b-floorplan-sft.
Run train_floorplan_grpo.py from a checkout with the environment described there (CubiCasa5K under ./cubicasa_data, huggingface-cli login if PUSH_TO_HUB = True; config block at top of file).

Citation

@article{floorplanvlm2026,
  title={FloorplanVLM: A Vision-Language Model for Floorplan Vectorization},
  journal={arXiv preprint arXiv:2602.06507},
  year={2026}
}

Acknowledgments

FloorplanVLM (arXiv:2602.06507)
CubiCasa5K (arXiv:1904.01920)
Qwen2.5-VL-3B-Instruct
Upstream training reference: manitocross/floorplan-vlm-training
Stage 1 adapter: mudasir13cs/qwen25-vl-3b-floorplan-sft

Author / contact

Mudasir — multimodal AI, VLM fine-tuning, retrieval/RAG research, and engineering; MS AI Convergence, 숭실대학교 — Soongsil University, Seoul. More credentials, publications, and projects: mudasir13cs.github.io

Hugging Face: @mudasir13cs
GitHub: @mudasir13cs
Email: mudasir13cs@gmail.com

Downloads last month: 605

Model tree for mudasir13cs/qwen25-vl-3b-floorplan-grpo

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

(214)

this model

Datasets used to train mudasir13cs/qwen25-vl-3b-floorplan-grpo

Papers for mudasir13cs/qwen25-vl-3b-floorplan-grpo

FloorplanVLM: A Vision-Language Model for Floorplan Vectorization

Paper • 2602.06507 • Published Feb 6 • 3

CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis

Paper • 1904.01920 • Published Apr 3, 2019