---
license: mit
tags:
  - vision-language
  - visual-question-answering
  - VQA
  - florence
  - microsoft
  - inclusive-ai
datasets:
  - vizwiz
language:
  - en
library_name: transformers
pipeline_tag: visual-question-answering
model-index:
  - name: florence-2-vqa-vizwiz
    results:
      - task:
          type: visual-question-answering
          name: Visual Question Answering
        dataset:
          name: VizWiz
          type: vizwiz
        metrics:
          - name: Accuracy
            type: accuracy
            value: 58.21
          - name: BLEU-1
            type: bleu
            value: 0.6386
---

# Florence-2 VQA (Fine-tuned on VizWiz)

Florence-2 is a state-of-the-art Vision-Language Model (VLM) developed by Microsoft, designed to perform a wide range of multimodal tasks, including Visual Question Answering (VQA). This version has been **fine-tuned on the VizWiz dataset**, which contains real-world visual questions submitted by blind and low-vision users.

## Model Description

- **Model Type**: Vision-Language Transformer (Florence-2)
- **Architecture**: Image encoder (ViT) + text decoder
- **Pretrained by**: Microsoft
- **Fine-tuned on**: VizWiz VQA dataset
- **Framework**: PyTorch + Hugging Face Transformers

## Intended Uses

This model is specifically optimised for **inclusive AI applications**, such as assistive technology for visually impaired users. Given an image and a natural language question, the model predicts a textual answer.

### Example

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
from PIL import Image
import requests

# Choose device (GPU preferred if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and processor from Hugging Face Hub
model_id = "Zagarsuren/florence2-finetuned-vizwiz"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

# Ensure compatibility with Florence-2 vision encoder
if getattr(config.vision_config, "model_type", None) != "davit":
    config.vision_config.model_type = "davit"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True).to(device)

# Prepare inputs
task_prompt = "Question answering: "
text_input = "What is written on the sign?"
image_url = "https://example.com/sample.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

# Preprocess
inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to(device)

# Generate
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    num_beams=3,
)

# Decode and post-process
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
answer = processor.post_process_generation(
    generated_text,
    task=task_prompt,
    image_size=(image.width, image.height)
)

print("Answer:", answer)
```

## Evaluation Results

| Metric              | Score  |
| ------------------- | ------ |
| Accuracy            | 58.21%  |
| BLEU-1              | 0.6386   |
| Response Time CPU (avg) | \~10.3s |

The model was benchmarked on the VizWiz sample dataset. It performs strongly across categories including **yes/no**, **number**, **other**, and **unanswerable** questions.

## Limitations

* Not optimised for synthetic datasets like CLEVR.
* Can be computationally heavy (requires GPU for real-time inference).
* May produce hallucinated answers if the question is ambiguous or image is of low quality.

## Citation

If you use this model, please cite:

```
@{
  title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms},
  author={Zagarsuren Sukhbaatar},
  year={2025},
  url={https://huggingface.co/Zagarsuren/florence-2-finetuned-vizwiz}
}
```

## License

MIT License