# ViX-Ray — Fine-tuned Medical Vision-Language Models Fine-tuned weights for Vietnamese chest X-ray report generation across 3 clinical tasks and 6 model architectures. > **Best overall performance**: Qwen2-VL-7B across all 3 tasks. --- ## Tasks | # | Task | Description | |---|------|-------------| | 1 | `finding` | Generate radiology **findings** from a chest X-ray image | | 2 | `impression` | Generate the clinical **impression** (final diagnosis) from a chest X-ray image | | 3 | `multi` | **Multi-turn dialogue** — findings → impression via conversation history | --- ## Models | Key | Base model | Size | |-----|-----------|------| | `Intern` | InternVL2.5-1B | 1B | | `Vintern` | Vintern-1B-v3.5 | 1B | | `Qwen2B` | Qwen2-VL-2B-Instruct | 2B | | `Qwen7B` | Qwen2-VL-7B-Instruct ⭐ | 7B | | `MiniCPM` | MiniCPM-V-2_6 | 8B | | `LaVy` | LaVy-Instruct | 7B | --- ## Quick Start ### 1. Install ```bash pip install huggingface_hub transformers torch torchvision pillow ``` For Qwen models, also install: ```bash pip install qwen-vl-utils ``` For Intern / Vintern models, also install: ```bash pip install decord ``` For MiniCPM, pin versions: ```bash pip install Pillow==10.1.0 torch==2.1.2 torchvision==0.16.2 transformers==4.40.0 sentencepiece==0.1.99 decord ``` --- ### 2. Download a model zip ```bash # task : finding | impression | multi # model : Intern | Vintern | Qwen2B | Qwen7B | MiniCPM | LaVy huggingface-cli download presencesw/ViX-Ray /.zip \ --repo-type model --local-dir ./ ``` Example — download the best model for finding: ```bash huggingface-cli download presencesw/ViX-Ray finding/Qwen7B.zip \ --repo-type model --local-dir ./ ``` Download all models at once: ```bash huggingface-cli download presencesw/ViX-Ray \ --repo-type model --local-dir ./vix_ray_models ``` --- ### 3. Unzip ```bash unzip /.zip -d ./models// # result: ./models/// ``` Or in Python: ```python import zipfile with zipfile.ZipFile("/.zip") as zf: zf.extractall("./models//") ``` --- ### 4. Load & infer Set `model_path = "./models//"` then use the snippet for your model family. #### Qwen2-VL (Qwen2B / Qwen7B) ```python from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info import torch model_path = "./models//" model = Qwen2VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_path) messages = [ { "role": "user", "content": [ {"type": "image", "image": "your_image.jpg"}, {"type": "text", "text": "Mô tả hình ảnh X-quang ngực này."}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=512) generated_ids_trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated_ids)] print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) ``` #### InternVL / Vintern (Intern / Vintern) ```python import torch import torchvision.transforms as T from PIL import Image from torchvision.transforms.functional import InterpolationMode from transformers import AutoModel, AutoTokenizer model_path = "./models//" model = AutoModel.from_pretrained( model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True ).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False) MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225) transform = T.Compose([ T.Lambda(lambda img: img.convert("RGB")), T.Resize((448, 448), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD), ]) pixel_values = transform(Image.open("your_image.jpg")).unsqueeze(0).to(torch.bfloat16).cuda() response = model.chat(tokenizer, pixel_values, "\nMô tả hình ảnh X-quang ngực này.", dict(max_new_tokens=512, do_sample=True)) print(response) ``` #### MiniCPM-V ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer model_path = "./models//" model = AutoModel.from_pretrained( model_path, trust_remote_code=True, attn_implementation="sdpa", torch_dtype=torch.bfloat16 ).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) image = Image.open("your_image.jpg").convert("RGB") msgs = [{"role": "user", "content": [image, "Mô tả hình ảnh X-quang ngực này."]}] print(model.chat(image=None, msgs=msgs, tokenizer=tokenizer)) ``` #### LaVy ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor model_path = "./models//" model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) inputs = processor( images=Image.open("your_image.jpg").convert("RGB"), text="Mô tả hình ảnh X-quang ngực này.", return_tensors="pt" ).to("cuda") outputs = model.generate(**inputs, max_new_tokens=512) print(processor.batch_decode(outputs, skip_special_tokens=True)[0]) ``` --- ## Multi-turn (Task 3) For the `multi` task, pass conversation history between turns: ```python # Turn 1 — findings response1 = ... # run inference as above # Turn 2 — impression (append assistant turn then ask) messages.append({"role": "assistant", "content": [{"type": "text", "text": response1}]}) messages.append({"role": "user", "content": [{"type": "text", "text": "Kết luận bệnh gì?"}]}) response2 = ... # run inference again with updated messages ``` See `readme/_.md` for the full per-model multi-turn example. --- ## Full Model Table | Task | Model | Base | Zip path | |------|-------|------|----------| | finding | Intern | InternVL2.5-1B | `finding/Intern.zip` | | finding | Vintern | Vintern-1B-v3.5 | `finding/Vintern.zip` | | finding | Qwen2B | Qwen2-VL-2B | `finding/Qwen2B.zip` | | finding | Qwen7B ⭐ | Qwen2-VL-7B | `finding/Qwen7B.zip` | | finding | MiniCPM | MiniCPM-V-2_6 | `finding/MiniCPM.zip` | | finding | LaVy | LaVy-Instruct | `finding/LaVy.zip` | | impression | Intern | InternVL2.5-1B | `impression/Intern.zip` | | impression | Vintern | Vintern-1B-v3.5 | `impression/Vintern.zip` | | impression | Qwen2B | Qwen2-VL-2B | `impression/Qwen2B.zip` | | impression | Qwen7B ⭐ | Qwen2-VL-7B | `impression/Qwen7B.zip` | | impression | MiniCPM | MiniCPM-V-2_6 | `impression/MiniCPM.zip` | | impression | LaVy | LaVy-Instruct | `impression/LaVy.zip` | | multi | Intern | InternVL2.5-1B | `multi/Intern.zip` | | multi | Vintern | Vintern-1B-v3.5 | `multi/Vintern.zip` | | multi | Qwen2B | Qwen2-VL-2B | `multi/Qwen2B.zip` | | multi | Qwen7B ⭐ | Qwen2-VL-7B | `multi/Qwen7B.zip` | | multi | MiniCPM | MiniCPM-V-2_6 | `multi/MiniCPM.zip` | | multi | LaVy | LaVy-Instruct | `multi/LaVy.zip` | Per-model details (installation, full inference code) are in `readme/_.md`. --- ## Citation If you use these models or the ViX-Ray dataset in your research, please cite: ```bibtex @article{nguyen2026vix, title={ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models}, author={Nguyen, Duy Vu Minh and Truong, Chinh Thanh and Tran, Phuc Hoang and Le, Hung Tuan and Dat, Nguyen Van-Thanh and Pham, Trung Hieu and Van Nguyen, Kiet}, journal={arXiv preprint arXiv:2603.15513}, year={2026} } ```