mmrech/pitvqa-comprehensive-spatial
Viewer • Updated • 10.1k • 62
How to use mmrech/pitvqa-qwen2vl-merged with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="mmrech/pitvqa-qwen2vl-merged")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
pipe(text=messages) # Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("mmrech/pitvqa-qwen2vl-merged")
model = AutoModelForMultimodalLM.from_pretrained("mmrech/pitvqa-qwen2vl-merged")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use mmrech/pitvqa-qwen2vl-merged with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mmrech/pitvqa-qwen2vl-merged"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "mmrech/pitvqa-qwen2vl-merged",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker model run hf.co/mmrech/pitvqa-qwen2vl-merged
How to use mmrech/pitvqa-qwen2vl-merged with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "mmrech/pitvqa-qwen2vl-merged" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "mmrech/pitvqa-qwen2vl-merged",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "mmrech/pitvqa-qwen2vl-merged" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "mmrech/pitvqa-qwen2vl-merged",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'How to use mmrech/pitvqa-qwen2vl-merged with Docker Model Runner:
docker model run hf.co/mmrech/pitvqa-qwen2vl-merged
A ready-to-use version of PitVQA with the Stage 4 (Unified) adapter merged into the base weights. No adapter loading required - just load and run.
| Feature | This (Merged) | Adapter Version |
|---|---|---|
| Setup complexity | Simple | Requires PEFT |
| Load time | Faster | Slower |
| Flexibility | Single task mode | Switch adapters |
| Best for | Production deployment | Research/experimentation |
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
# Load merged model directly - no adapter loading needed!
model = Qwen2VLForConditionalGeneration.from_pretrained(
"mmrech/pitvqa-qwen2vl-merged",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("mmrech/pitvqa-qwen2vl-merged")
# Run inference
from PIL import Image
image = Image.open("surgical_frame.jpg")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Point to the suction device in this surgical image."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(output[0], skip_special_tokens=True))
# Output: <point x='75.8' y='75.1'>suction device</point>
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"mmrech/pitvqa-qwen2vl-merged",
quantization_config=bnb_config,
device_map="auto"
)
This merged model supports ALL tasks from Stage 4 training:
Prompt: "Point to the suction device in this surgical image."
Output: <point x='75.8' y='75.1'>suction device</point>
Prompt: "Draw a bounding box around the tumor region."
Output: <box x1='30' y1='30' x2='70' y2='70'>tumor region</box>
Prompt: "What surgical phase is shown?"
Output: sellar phase
Supported phases: nasal, sellar, tumor_removal, closure
Prompt: "What instruments are visible in this surgical scene?"
Output: The image shows a suction device positioned in the surgical field...
merge_and_unload()@misc{pitvqa2026,
title={PitVQA: Multi-Task Vision-Language Model for Pituitary Surgery},
author={Matheus Rech},
year={2026},
url={https://huggingface.co/mmrech/pitvqa-qwen2vl-merged}
}
Apache 2.0