--- license: mit tags: - vision-language - visual-question-answering - VQA - florence - microsoft - inclusive-ai datasets: - vizwiz language: - en library_name: transformers pipeline_tag: visual-question-answering model-index: - name: florence-2-vqa-vizwiz results: - task: type: visual-question-answering name: Visual Question Answering dataset: name: VizWiz type: vizwiz metrics: - name: Accuracy type: accuracy value: 58.21 - name: BLEU-1 type: bleu value: 0.6386 --- # Florence-2 VQA (Fine-tuned on VizWiz) Florence-2 is a state-of-the-art Vision-Language Model (VLM) developed by Microsoft, designed to perform a wide range of multimodal tasks, including Visual Question Answering (VQA). This version has been **fine-tuned on the VizWiz dataset**, which contains real-world visual questions submitted by blind and low-vision users. ## Model Description - **Model Type**: Vision-Language Transformer (Florence-2) - **Architecture**: Image encoder (ViT) + text decoder - **Pretrained by**: Microsoft - **Fine-tuned on**: VizWiz VQA dataset - **Framework**: PyTorch + Hugging Face Transformers ## Intended Uses This model is specifically optimised for **inclusive AI applications**, such as assistive technology for visually impaired users. Given an image and a natural language question, the model predicts a textual answer. ### Example ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig from PIL import Image import requests # Choose device (GPU preferred if available) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load model and processor from Hugging Face Hub model_id = "Zagarsuren/florence2-finetuned-vizwiz" config = AutoConfig.from_pretrained(model_id, trust_remote_code=True) # Ensure compatibility with Florence-2 vision encoder if getattr(config.vision_config, "model_type", None) != "davit": config.vision_config.model_type = "davit" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True).to(device) # Prepare inputs task_prompt = "Question answering: " text_input = "What is written on the sign?" image_url = "https://example.com/sample.jpg" image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") # Preprocess inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to(device) # Generate generated_ids = model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3, ) # Decode and post-process generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] answer = processor.post_process_generation( generated_text, task=task_prompt, image_size=(image.width, image.height) ) print("Answer:", answer) ``` ## Evaluation Results | Metric | Score | | ------------------- | ------ | | Accuracy | 58.21% | | BLEU-1 | 0.6386 | | Response Time CPU (avg) | \~10.3s | The model was benchmarked on the VizWiz sample dataset. It performs strongly across categories including **yes/no**, **number**, **other**, and **unanswerable** questions. ## Limitations * Not optimised for synthetic datasets like CLEVR. * Can be computationally heavy (requires GPU for real-time inference). * May produce hallucinated answers if the question is ambiguous or image is of low quality. ## Citation If you use this model, please cite: ``` @{ title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms}, author={Zagarsuren Sukhbaatar}, year={2025}, url={https://huggingface.co/Zagarsuren/florence-2-finetuned-vizwiz} } ``` ## License MIT License