Chest X-ray Report Generation — Qwen2.5-VL 7B + RAG

Fine-tuned vision-language model that reads a chest X-ray image and generates a short radiology-style findings paragraph. Built by a first-year engineering student using only a free Colab T4 GPU.

⚠️ Research prototype only — not for medical diagnosis or clinical use.

What it does

Upload a chest X-ray → the system retrieves the most visually similar historical case (CLIP + FAISS) → injects it as context → fine-tuned Qwen2.5-VL 7B generates a findings paragraph.

Example output:

"Cardiomegaly noted; no new consolidations identified in lungs bilaterally compared to prior studies."

How it works

Chest X-ray image uploaded via Gradio interface
CLIP (ViT-B/32) encodes the image into an embedding vector
FAISS searches a pre-built index of 62 similar X-ray cases
Most visually similar historical report injected as context (RAG)
Fine-tuned Qwen2.5-VL 7B generates the findings paragraph

Model details

Property	Value
Base model	Qwen/Qwen2.5-VL-7B-Instruct
Fine-tuning method	LoRA (r=8, alpha=8)
Quantization	4-bit (bitsandbytes)
Training framework	Unsloth + TRL SFTTrainer
Training data	50 examples — CheXpert-plus-RRG
Training steps	30 steps
Hardware	Google Colab T4 GPU (free tier)
RAG encoder	openai/clip-vit-base-patch32
RAG index size	62 images

Quick start

from unsloth import FastVisionModel
from PIL import Image
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="mahdisetti/xray-qwen-lora",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

image = Image.open("your_xray.jpg").convert("RGB")
image.thumbnail((512, 512))

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": (
                "Write only one short radiology findings paragraph "
                "under 50 words. Mention the main visible abnormality "
                "and its anatomical location."
            )},
        ],
    }
]

input_text = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)
inputs = tokenizer(
    image, input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=70,
        do_sample=False,
        repetition_penalty=1.3,
        no_repeat_ngram_size=5,
    )

new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

RAG files

FAISS index and report pickle stored at mahdisetti/xray-rag-files

Known limitations

Trained on only 50 examples — outputs should not be trusted clinically
Occasional hallucinations on ambiguous scans
Sensitive to prompt wording
May mislocalize findings (e.g. left vs right)
No formal evaluation metrics computed (BLEU/ROUGE planned)

Dataset used

X-iZhang/CheXpert-plus-RRG

Author

Mahdi Setti — first-year engineering student Built as a personal learning project to explore fine-tuning, multimodal models, and RAG in a real-world domain.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mahdisetti/xray-qwen-lora

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(283)

this model