Image-to-Text
PEFT
Safetensors
English
vision-language
lora
floor-plan
vectorization
structured-json
cubicasa
grpo

Qwen2.5-VL floor plan GRPO adapter (stage 2)

Hub: mudasir13cs/qwen25-vl-3b-floorplan-grpo

Improved using Qwen — LoRA adapter continuing from mudasir13cs/qwen25-vl-3b-floorplan-sft; trained with GRPO and geometric rewards. Base checkpoint: Qwen2.5-VL-3B-Instruct (LICENSE).

Intended non-commercial / research use, consistent with CubiCasa5K (CC BY‑NC 4.0) and the Qwen research license.

Paper & upstream training material

If you are working inside a clone of your training repo, the same files may exist locally beside this folder (relative paths).

Quick install

pip install torch torchvision transformers trl peft accelerate pillow

(Training also uses Shapely, datasets, numpy; inference does not strictly need Shapely.)

Loading the adapter

Use Hub IDs (recommended):

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE = "Qwen/Qwen2.5-VL-3B-Instruct"
ADAPTER = "mudasir13cs/qwen25-vl-3b-floorplan-grpo"

processor = AutoProcessor.from_pretrained(BASE)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    BASE, torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

If you cloned weights into this folder locally: set ADAPTER = "./floorplan-vlm-grpo" (or an absolute path) instead of the Hub repo id.

Using the model (inference)

For best alignment with training, reuse the system prompt that includes the full JSON schema from stage 1 (see SYSTEM_PROMPT in train_floorplan_vlm.py). GRPO training uses a shorter system string in train_floorplan_grpo.py; either works, but the schema-explicit SFT prompt below usually yields more stable JSON.

User text (same as both scripts): “Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms.”

Minimal pattern (aligned with the inference test in train_floorplan_vlm.py):

import json, re, torch
from PIL import Image

# Same strings as train_floorplan_vlm.py (schema-in-the-prompt; recommended for decoding).
SYSTEM_PROMPT = (
    "You are a floor plan vectorization expert. Extract wall, door, window geometry "
    "from floor plan images into structured JSON.\n\n"
    "Output ONLY valid JSON with this schema:\n"
    '{"walls":[{"id":"wall_N","start":[x,y],"end":[x,y],"thickness":T,"curvature":0,'
    '"openings":[{"type":"door"|"window","center":D,"width":W}]}],'
    '"rooms":[{"label":"room_type","walls":["wall_N",...]}]}\n\n'
    "Coordinates normalized so longer image edge = 1024."
)
USER_PROMPT = "Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms."

image = Image.open("plan.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": USER_PROMPT}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

raw = processor.batch_decode(out[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)[0]
m = re.search(r"\{[\s\S]*\}", raw)
plan = json.loads(m.group()) if m else None

Output shape: top-level walls (with optional openings) and rooms. Example JSON shape is spelled out under Output JSON Schema in the manitocross training README.

Reproducing stage 2

  1. Finish or download stage 1: mudasir13cs/qwen25-vl-3b-floorplan-sft.
  2. Run train_floorplan_grpo.py from a checkout with the environment described there (CubiCasa5K under ./cubicasa_data, huggingface-cli login if PUSH_TO_HUB = True; config block at top of file).

Citation

@article{floorplanvlm2026,
  title={FloorplanVLM: A Vision-Language Model for Floorplan Vectorization},
  journal={arXiv preprint arXiv:2602.06507},
  year={2026}
}

Acknowledgments

Author / contact

Mudasir — multimodal AI, VLM fine-tuning, retrieval/RAG research, and engineering; MS AI Convergence, 숭실대학교 — Soongsil University, Seoul. More credentials, publications, and projects: mudasir13cs.github.io

Downloads last month
605
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for mudasir13cs/qwen25-vl-3b-floorplan-grpo

Adapter
(214)
this model

Datasets used to train mudasir13cs/qwen25-vl-3b-floorplan-grpo

Papers for mudasir13cs/qwen25-vl-3b-floorplan-grpo