---
license: apache-2.0
base_model:
- Qwen/Qwen3.5-2B
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- qwen
- qwen3.5
- vision-language
- handwritten-math
- math-ocr
- latex-ocr
- image-to-text
- sft
- dpo
---

# Qwen3.5-2B-MathParser-pro

## Model Summary

Qwen3.5-2B-MathParser-pro is a compact vision-language model for handwritten mathematical formula OCR. It is optimized to transcribe single-line and multi-line handwritten mathematical expressions into LaTeX, with a focus on local deployment.

This 2B release is intended for lower-memory local deployment. The companion release is `Qwen3.5-4B-MathParser-pro`.

## Intended Use

- Handwritten mathematical formula recognition
- Multi-line LaTeX transcription
- OCR for mathematical expressions and derivations
- Research and application prototyping around handwritten math parsing

This model is not intended to be a general mathematical reasoning model. It should be used as an OCR/transcription model.

## Training Recipe

The model follows a two-stage MathParser training recipe:

1. **Stage 1 SFT** builds a stable handwritten mathematical OCR base and teaches direct LaTeX transcription.
2. **Stage 2 DPO v34** prefers concise, stable, line-count-faithful transcriptions and reduces malformed outputs, repetition, max-token runaway, and very low-similarity failures.

The released weights are fully merged model weights, not LoRA adapters.

## Evaluation

Evaluation set: 756 multi-line handwritten mathematical formula samples.

Metrics:

- **Avg Sim / Median Sim**: normalized edit similarity, higher is better.
- **Line Match**: exact line-count match with ground truth.
- **Within +/-1**: predicted line count differs from ground truth by at most one.
- **Runaway**: max-token or obviously overlong/repetitive generations, lower is better.
- **Bad <0.50**: samples with similarity below 0.50, lower is better.

| Model | Samples | Avg Sim | Median Sim | Line Match | Within +/-1 | Runaway | Bad <0.50 |
|---|---:|---:|---:|---:|---:|---:|---:|
| Qwen3.5-0.8B Base | 756 | 0.544843 | 0.580742 | 149 | 235 | 108 | 262 |
| Qwen3.5-2B Base | 756 | 0.599258 | 0.651649 | 252 | 392 | 19 | 236 |
| Qwen3.5-4B Base | 756 | 0.534456 | 0.541674 | 264 | 368 | 5 | 295 |
| Qwen3.5-2B SFT | 756 | 0.906516 | 0.952732 | 550 | 706 | 13 | 25 |
| Qwen3.5-2B SFT+DPO | 756 | 0.916060 | 0.951464 | 569 | 714 | 3 | 15 |
| Qwen3.5-4B SFT | 756 | 0.942045 | 0.966546 | 612 | 730 | 0 | 2 |
| Qwen3.5-4B SFT+DPO | 756 | 0.942878 | 0.968560 | 611 | 730 | 0 | 1 |

For this release, the main result is:

| Release | Avg Sim | Median Sim | Line Match | Within +/-1 | Runaway | Bad <0.50 |
|---|---:|---:|---:|---:|---:|---:|
| Qwen3.5-2B-MathParser-pro | 0.916060 | 0.951464 | 569 | 714 | 3 | 15 |

## Figures

![Overall average similarity](figures/overall_avg_similarity.png)

![Error reduction](figures/error_reduction.png)

![Bucket average similarity](figures/bucket_avg_similarity.png)

![Model size quality tradeoff](figures/model_size_quality_tradeoff.png)

## Usage

```python
from PIL import Image
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

model_id = "sugartai/Qwen3.5-2B-MathParser-pro"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
).eval()

image = Image.open("formula.png").convert("RGB")
messages = [
    {
        "role": "system",
        "content": "You are a handwritten mathematical OCR model. Return only the LaTeX transcription.",
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Transcribe the handwritten mathematical formula into LaTeX only."},
        ],
    },
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

eos_ids = [processor.tokenizer.eos_token_id]
pad_id = processor.tokenizer.pad_token_id
if pad_id is not None and pad_id not in eos_ids:
    eos_ids.append(pad_id)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1536,
        do_sample=False,
        num_beams=1,
        eos_token_id=eos_ids,
        pad_token_id=pad_id if pad_id is not None else eos_ids[0],
    )

new_ids = output_ids[:, inputs["input_ids"].shape[1]:]
print(processor.decode(new_ids[0], skip_special_tokens=True))
```

## Limitations

- The model is specialized for handwritten mathematical OCR and LaTeX transcription.
- It is not a general reasoning or theorem-proving model.
- Very noisy images, unusual notation, extreme layout variation, or out-of-distribution handwriting may degrade quality.
- The reported metrics are from an internal 756-sample multi-line handwritten formula evaluation set.

## License

This model is released under Apache 2.0, following the base model license of `Qwen/Qwen3.5-2B`.

## Citation

If you use this model, please cite or link this model page and the Qwen3.5 base model.