An OpenVINO implementation of PaddlePaddle/PaddleOCR-VL-1.5 in FP16. This repo includes a standalone script which refactors an openvino_notebook for PaddleOCR-VL-1.5 to work mostly without transformers dependency, making everything about using this model so much easier.

It's not quantized

  • vision encoder: fp16
  • language model: fp16

Though the notebook demonstrates int8_asym weight only quantization I found the speed tradeoffa not worth the drop in accuracy.

Performance

Not tested on GPU yet.

CPU Xeon Gold 6242:

100dpi png:

image_tokens: 1186
prefill: 1.08s (1095 tok/s)
decode: 56.97s (35 tok/s)
duration: 64.27s
new_tokens: 1968

150dpi png:

image_tokens: 1256
prefill: 1.19s (1051 tok/s)
decode: 58.01s (34 tok/s)
duration: 65.89s
new_tokens: 1968

I got impatient waiting for the transformers implementation, and did not want to fight paddlepaddle to test there, so the benchmark on it's own probably isn't useful; TENSOR_PARLLEL on CPU also did not work (dual socket machine).

Other notes

100dpi would sometimes fail to reach accuracy on the entire image; the ocr task would produce accurate mapping of image --> text arounf 3/4 across the image from left to right, before having issues on the last column. based on the paper, using this model alone might be the issue; historically you would use multiple paddle models together in a pipeline, to break work into smaller chunks and increase accuracy that way.

spotting task

PaddleOCR-VL-1.5 introduces a Element-level Recognition Text Spotting (Section 2.2.2 in the paper). To activate provide the model an image and place only "spotting" in the user message. Baidu trained PaddleOCR-VL to use special tokens, drawing on learned embeddings to understand spatial relationships between elements by marking element edges as Top-Left, Top-Right, Bottom-Right, Bottom-Left. Then, each is pixel normalized (which works because of image resizing). Taken together, means we can programmatically rebuild the layout of an analyzed image by parsing for tokens for use with language models.

Heres a snippet which builds an HTML from the output of "spotting" task:

#!/usr/bin/env python3
"""
Reconstruct the image layout from a LOC-tagged text file into an HTML page.
Each text block is placed independently at its exact bounding box coordinates.
"""

import re
import sys

def parse_line(line: str):
    """Parse a single line and return (text, bbox) or None."""
    pattern = r"<\|LOC_(\d+)\|><\|LOC_(\d+)\|><\|LOC_(\d+)\|><\|LOC_(\d+)\|><\|LOC_(\d+)\|><\|LOC_(\d+)\|><\|LOC_(\d+)\|><\|LOC_(\d+)\|>"
    match = re.search(pattern, line)
    if not match:
        return None
    text_end = match.start()
    text = line[:text_end].strip()
    coords = tuple(int(match.group(i)) for i in range(1, 9))
    return text, coords

def generate_html(blocks, output_file, scale=1.0, margin=20):
    """
    Generate HTML with each block placed independently at its exact coordinates.
    Uses the bounding box to set width and height of each div.
    """
    if not blocks:
        print("No data found!")
        return

    # Find the overall max dimensions
    max_x = 0
    max_y = 0
    for _, coords in blocks:
        x1, y1, x2, y2, x3, y3, x4, y4 = coords
        max_x = max(max_x, x1, x2, x3, x4)
        max_y = max(max_y, y1, y2, y3, y4)

    # Container size with margins
    width = int((max_x + margin * 2) * scale)
    height = int((max_y + margin * 2) * scale)

    html_parts = [
        '<!DOCTYPE html>',
        '<html lang="en">',
        '<head>',
        '<meta charset="UTF-8">',
        '<title>Reconstructed Layout</title>',
        '<style>',
        'body { margin:0; background:#222; }',
        f'#container {{ position:relative; width:{width}px; height:{height}px; background:white; margin:20px auto; overflow:hidden; }}',
        '.block { position:absolute; font-family:Arial, sans-serif; font-size:11px; line-height:1.1; overflow:hidden; }',
        '</style>',
        '</head>',
        '<body>',
        f'<div id="container">',
    ]

    for text, coords in blocks:
        x1, y1, x2, y2, x3, y3, x4, y4 = coords
        
        # Use the bounding box extents
        min_x = min(x1, x2, x3, x4)
        min_y = min(y1, y2, y3, y4)
        max_xb = max(x1, x2, x3, x4)
        max_yb = max(y1, y2, y3, y4)
        
        left = min_x * scale + margin
        top = min_y * scale + margin
        w = (max_xb - min_x) * scale
        h = (max_yb - min_y) * scale
        
        # Ensure minimum dimensions for readability
        if w < 10: w = None
        if h < 10: h = None

        style = f'left:{left:.1f}px; top:{top:.1f}px;'
        if w: style += f' width:{w:.1f}px;'
        if h: style += f' height:{h:.1f}px;'

        html_parts.append(f'<div class="block" style="{style}">{text}</div>')

    html_parts.append('</div>')
    html_parts.append('</body>')
    html_parts.append('</html>')

    with open(output_file, 'w', encoding='utf-8') as f:
        f.write('\n'.join(html_parts))

    print(f"HTML written to {output_file}")

def main():
    if len(sys.argv) < 2:
        print("Usage: python rebuild_layout.py <input_file.txt> [output.html]")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2] if len(sys.argv) > 2 else "reconstructed_layout.html"

    blocks = []
    with open(input_file, 'r', encoding='utf-8') as f:
        for line in f:
            parsed = parse_line(line.strip())
            if parsed:
                blocks.append(parsed)

    if not blocks:
        print("No valid LOC-tagged data found.")
        sys.exit(1)

    # Remove exact duplicates (same text and same coordinates)
    unique_blocks = list(set(blocks))
    print(f"Parsed {len(blocks)} blocks, {len(unique_blocks)} unique")

    generate_html(unique_blocks, output_file, scale=1.0, margin=20)

if __name__ == "__main__":
    main()
@misc{cui2026paddleocrvl15multitask09bvlm,
      title={PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing}, 
      author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
      year={2026},
      eprint={2601.21957},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.21957}, 
}
Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Echo9Zulu/PaddleOCR-VL-1.5-FP16-OpenVINO

Finetuned
(9)
this model

Paper for Echo9Zulu/PaddleOCR-VL-1.5-FP16-OpenVINO