--- license: mit base_model: baidu/Unlimited-OCR base_model_relation: quantized pipeline_tag: image-text-to-text tags: - mlx - mlx-vlm - quantized - apple-silicon - deepseek-ocr - ocr - vision-language - multimodal - document-parsing language: - multilingual --- # Unlimited-OCR — MLX Affine int8 (group size 64) MLX quantization of [**baidu/Unlimited-OCR**](https://huggingface.co/baidu/Unlimited-OCR), a 3B vision-language OCR model that pushes **DeepSeek-OCR** one step further (one-shot, long-horizon document parsing). This variant uses **Affine int8 (group size 64)** quantization (9.41 effective bits/weight). **Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra) > **Note on effective bpw**: mlx-vlm's quantizers only act on the language tower's linear > weights. The vision encoder and embeddings stay at bf16, so the on-disk size averages > the quantized text decoder with the full-precision vision components. ## About the model - **Architecture:** *DeepEncoder* vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → **DeepSeek-V2 MoE** text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token). - **Task:** multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). Supports *gundam* (crop) and *base* resolution modes. - **License:** MIT (inherited from the base model). ## Benchmark results Evaluated on Apple M4 Pro (24 GB) with MLX on the [FUNSD](https://huggingface.co/datasets/nielsr/funsd) test set (50 scanned form images). | | Affine int8 (group size 64) | FP16 baseline | |---|---:|---:| | **FUNSD CER ↓** | **1.5720** | 1.7588 | | Decode tok/s | 205.2 | 146.2 | | Peak memory | 5.06 GB | 7.62 GB | | Disk size | 3747 MB | 6464 MB | ### All variants compared | Variant | CER ↓ | Tok/s | Memory | Disk | |---|---:|---:|---:|---:| | FP16 (baseline) | 1.7588 | 146.2 | 7.62 GB | 6464 MB | | MXFP8 | 1.4556 | 205.6 | 4.98 GB | 3660 MB | | Int8 | 1.5720 | 205.2 | 5.06 GB | 3747 MB | | MXFP4 | 2.3944 | 251.9 | 3.61 GB | 2260 MB | | Int4 | 2.2879 | 252.6 | 3.7 GB | 2347 MB | ## Usage ```bash pip install mlx-vlm ``` ```python from mlx_vlm import load, generate model, processor = load("sahilchachra/unlimited-ocr-8bit-mlx") # Single-image OCR (Gundam mode) response = generate(model, processor, prompt="document parsing.", image="path/to/document.jpg", max_tokens=4096, verbose=True) ``` ## Prompting guide Unlimited-OCR uses the **DeepSeek-OCR** prompt vocabulary. The prompt is just an instruction; prefix it with `<|grounding|>` whenever you also want **bounding boxes** for what was read. | Task | Prompt | |---|---| | **Document → Markdown** (layout-aware, with boxes) | `<|grounding|>Convert the document to markdown.` | | **Plain text OCR** (just the text, no layout) | `Free OCR.` | | **OCR with bounding boxes** | `<|grounding|>OCR this image.` | | **Native parse** | `document parsing.` | | **Parse a figure / chart / diagram** | `Parse the figure.` | | **Describe the image** (general VQA) | `Describe this image in detail.` | > **Note:** Unlike the GGUF/llama.cpp workflow, mlx-vlm requires the literal `` token > in the prompt and a separate `image=` argument pointing to the file path. ### Understanding the output (grounding tokens) With `<|grounding|>`, the model interleaves the recognized text with detection boxes: ``` <|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623 <|det|>text [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra <|det|>text [37, 483, 329, 543]<|/det|>Total Due: $44.00 ``` Each `[x1, y1, x2, y2]` is the bounding box (top-left → bottom-right) of that span. Drop the `<|det|>...<|/det|>` tags if you only want the text, or parse them to overlay boxes / build a layout. > **Tip — long documents:** For multi-page scans, run page-by-page and concatenate. ## Important — model_type mapping The original `baidu/Unlimited-OCR` uses `model_type: "unlimited-ocr"` which is not directly recognized by mlx-vlm. This quantized variant ships with the config already patched: - `config.json` → `"model_type": "deepseekocr"` (was `"unlimited-ocr"`), `auto_map` removed - `processor_config.json` → `"processor_class": "DeepseekOCRProcessor"` (was `"UnlimitedOCRHFProcessor"`) **No manual patching needed** — just `load()` and go. If you are converting the original model yourself, apply these two changes before running `mlx_vlm convert`. ## All variants in this collection ### MLX (Apple Silicon — this collection) | Model | Variant | Disk | |---|---|---:| | [sahilchachra/unlimited-ocr-4bit-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-4bit-mlx) | Affine int4 | 2347 MB | | [sahilchachra/unlimited-ocr-8bit-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-8bit-mlx) | Affine int8 ← this model | 3747 MB | | [sahilchachra/unlimited-ocr-mxfp4-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-mxfp4-mlx) | Block float MX FP4 | 2260 MB | | [sahilchachra/unlimited-ocr-mxfp8-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-mxfp8-mlx) | Block float MX FP8 | 3660 MB | ### GGUF (llama.cpp — cross-platform) | Model | Notes | |---|---| | [sahilchachra/Unlimited-OCR-GGUF](https://huggingface.co/sahilchachra/Unlimited-OCR-GGUF) | K-quants & i-quants (BF16 → IQ2_M). Requires llama.cpp PR #17400. | ## Credits - Base model: [baidu/Unlimited-OCR](https://huggingface.co/baidu/Unlimited-OCR) (MIT) — builds on [deepseek-ai/DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR). - Quantized by [sahilchachra](https://huggingface.co/sahilchachra).