---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- custom-indian-invoice-dataset
language:
- en
- hi
- ta
- ml
- te
- kn
- bn
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- qwen2.5-vl
- vision-language-model
- invoice-extraction
- document-understanding
- ocr
- indian-invoices
- gst
- lora
- peft
- unsloth
- fine-tuned
---

---

# Qwen2.5-VL 7B — Indian Invoice Extraction

Fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) specialized for extracting structured JSON from Indian GST invoices (B2B, B2C, export, IRN/ACK, multi-layout). Trained with QLoRA + Unsloth on an NVIDIA A100 80 GB. Merged via PEFT merge_and_unload().

---

## Available Versions

| Version | Link | Use case |
|---|---|---|
| Merged bfloat16 | [gouri100/Unsloth_Qwen-2.5_7B-Invoice-962](https://huggingface.co/gouri100/Unsloth_Qwen-2.5_7B-Invoice-962) | Full precision inference |
| GGUF Q4_K_M | [gouri100/Unsloth_Qwen-2.5_7B-Invoice-962-GGUF](https://huggingface.co/gouri100/Unsloth_Qwen-2.5_7B-Invoice-962-GGUF) | llama.cpp / Ollama — light GPU |
| GGUF Q8_0 | [gouri100/Unsloth_Qwen-2.5_7B-Invoice-962-GGUF](https://huggingface.co/gouri100/Unsloth_Qwen-2.5_7B-Invoice-962-GGUF) | llama.cpp / Ollama — higher quality |

---

## Model Summary

| Property | Value |
|---|---|
| **Base model** | Qwen/Qwen2.5-VL-7B-Instruct |
| **Fine-tuning method** | QLoRA (r=64, alpha=128) |
| **Merge method** | PEFT merge_and_unload() — bfloat16 safetensors |
| **Framework** | Unsloth + TRL SFTTrainer |
| **Hardware** | NVIDIA A100 80 GB |
| **Task** | Invoice image to Structured JSON |
| **Input types** | JPG, PNG, PDF (page 1 at 200 DPI) |
| **Languages** | English, Hindi, Tamil, Malayalam, Telugu, Kannada, Bengali |
| **License** | Apache 2.0 |

---

## Training Dataset

| Property | Value |
|---|---|
| **Total samples** | 962 |
| **File types** | JPG, PNG, PDF |
| **PDF handling** | Page 1 extracted at 200 DPI, resized to max 1280px |
| **Invoice types** | B2B GST, B2C, Export, IRN/ACK |
| **Annotation** | Manually labeled JSON per invoice |

---

## Output JSON Schema

```json
{
  "metadata": {
    "invoice_no": "string",
    "invoice_date": "YYYY-MM-DD",
    "irn": "string | null",
    "ack_no": "string | null",
    "ack_date": "string | null"
  },
  "supplier": {
    "name": "string",
    "gstin": "string",
    "address": "string",
    "state_code": "string"
  },
  "buyer": {
    "name": "string",
    "gstin": "string",
    "address": "string",
    "state_code": "string"
  },
  "line_items": [{
    "sl_no": "number",
    "description": "string",
    "hsn_sac": "string",
    "qty": "number",
    "unit": "string",
    "rate": "number",
    "amount": "number"
  }],
  "tax": {
    "taxable_value": "number",
    "cgst_rate": "number",
    "cgst_amount": "number",
    "sgst_rate": "number",
    "sgst_amount": "number",
    "igst_rate": "number",
    "igst_amount": "number",
    "total_tax": "number",
    "grand_total": "number",
    "round_off": "number"
  }
}
```

---

## Training Configuration

| Hyperparameter | Value |
|---|---|
| **Epochs** | 3 |
| **Learning rate** | 0.0002 |
| **LR scheduler** | Cosine |
| **Warmup ratio** | 0.05 |
| **Per device batch size** | 2 |
| **Gradient accumulation steps** | 8 |
| **Effective batch size** | 16 |
| **Max sequence length** | 2048 |
| **Precision** | bfloat16 |
| **LoRA rank (r)** | 64 |
| **LoRA alpha** | 128 |
| **LoRA dropout** | 0.05 |
| **LoRA target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Vision layers fine-tuned** | Yes |
| **Gradient checkpointing** | Unsloth optimized |

---

## Training Results

| Metric | Value |
|---|---|
| **Final training loss** | 0.2594 |
| **Total steps** | N/A |
| **Training time** | 2243.16s (37.4 min) |
| **Steps per second** | 0.082 |

---

## Inference

### With transformers (merged model)

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch, json

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "gouri100/Unsloth_Qwen-2.5_7B-Invoice-962",
    torch_dtype = torch.bfloat16,
    device_map  = 'auto',
)
processor = AutoProcessor.from_pretrained("gouri100/Unsloth_Qwen-2.5_7B-Invoice-962")

image = Image.open('invoice.jpg').convert('RGB')

SYSTEM_PROMPT = (
    'You are an expert system for extracting structured data from invoices. '
    'Return ONLY valid JSON. Do NOT include explanations or extra text.'
)

messages = [
    {'role': 'system', 'content': [{'type': 'text', 'text': SYSTEM_PROMPT}]},
    {'role': 'user', 'content': [
        {'type': 'image', 'image': image},
        {'type': 'text',  'text': 'Extract structured invoice data as JSON.'}
    ]}
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt = True,
    tokenize              = True,
    return_tensors        = 'pt',
    return_dict           = True,
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens = 1024,
        temperature    = 0.1,
        do_sample      = False,
    )

decoded = processor.decode(
    output_ids[0][inputs['input_ids'].shape[1]:],
    skip_special_tokens = True,
)
result = json.loads(decoded)
print(json.dumps(result, indent=2, ensure_ascii=False))
```

### Load in 4-bit (lighter GPUs)

```python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit              = True,
    bnb_4bit_compute_dtype    = torch.bfloat16,
    bnb_4bit_quant_type       = 'nf4',
    bnb_4bit_use_double_quant = True,
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "gouri100/Unsloth_Qwen-2.5_7B-Invoice-962",
    quantization_config = bnb_config,
    device_map          = 'auto',
)
```

### From PDF

```python
from pdf2image import convert_from_path
pages = convert_from_path('invoice.pdf', dpi=200)
image = pages[0]
# then follow inference code above
```

### With Ollama (GGUF)

```bash
ollama run gouri100/Unsloth_Qwen-2.5_7B-Invoice-962-GGUF
```

---

## Limitations

- Optimized for Indian GST invoice formats — may underperform on foreign layouts
- Scans below 100 DPI or heavily skewed images reduce accuracy
- Handwritten invoices are not supported
- Multi-page invoices: only page 1 was used during training
- Always validate extracted JSON against your business logic before use

---

## Citation

```bibtex
@misc{qwen2.5-vl-7b-indian-invoice,
  title        = {Qwen2.5-VL-7B Fine-tuned for Indian Invoice Extraction},
  author       = {Your Name},
  year         = {2025},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/gouri100/Unsloth_Qwen-2.5_7B-Invoice-962}}
}
```

*Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) · Merged with [PEFT](https://github.com/huggingface/peft) · Trained on NVIDIA A100 80 GB*