Chest X-ray Report Generation β€” Qwen2.5-VL 7B + RAG

Fine-tuned vision-language model that reads a chest X-ray image and generates a short radiology-style findings paragraph. Built by a first-year engineering student using only a free Colab T4 GPU.

⚠️ Research prototype only β€” not for medical diagnosis or clinical use.


What it does

Upload a chest X-ray β†’ the system retrieves the most visually similar historical case (CLIP + FAISS) β†’ injects it as context β†’ fine-tuned Qwen2.5-VL 7B generates a findings paragraph.

Example output:

"Cardiomegaly noted; no new consolidations identified in lungs bilaterally compared to prior studies."


How it works

  1. Chest X-ray image uploaded via Gradio interface
  2. CLIP (ViT-B/32) encodes the image into an embedding vector
  3. FAISS searches a pre-built index of 62 similar X-ray cases
  4. Most visually similar historical report injected as context (RAG)
  5. Fine-tuned Qwen2.5-VL 7B generates the findings paragraph

Model details

Property Value
Base model Qwen/Qwen2.5-VL-7B-Instruct
Fine-tuning method LoRA (r=8, alpha=8)
Quantization 4-bit (bitsandbytes)
Training framework Unsloth + TRL SFTTrainer
Training data 50 examples β€” CheXpert-plus-RRG
Training steps 30 steps
Hardware Google Colab T4 GPU (free tier)
RAG encoder openai/clip-vit-base-patch32
RAG index size 62 images

Quick start

from unsloth import FastVisionModel
from PIL import Image
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="mahdisetti/xray-qwen-lora",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

image = Image.open("your_xray.jpg").convert("RGB")
image.thumbnail((512, 512))

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": (
                "Write only one short radiology findings paragraph "
                "under 50 words. Mention the main visible abnormality "
                "and its anatomical location."
            )},
        ],
    }
]

input_text = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)
inputs = tokenizer(
    image, input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=70,
        do_sample=False,
        repetition_penalty=1.3,
        no_repeat_ngram_size=5,
    )

new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

RAG files

FAISS index and report pickle stored at mahdisetti/xray-rag-files


Known limitations

  • Trained on only 50 examples β€” outputs should not be trusted clinically
  • Occasional hallucinations on ambiguous scans
  • Sensitive to prompt wording
  • May mislocalize findings (e.g. left vs right)
  • No formal evaluation metrics computed (BLEU/ROUGE planned)

Dataset used

X-iZhang/CheXpert-plus-RRG


Author

Mahdi Setti β€” first-year engineering student Built as a personal learning project to explore fine-tuning, multimodal models, and RAG in a real-world domain.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mahdisetti/xray-qwen-lora

Adapter
(283)
this model