---
license: mit
base_model: baidu/Unlimited-OCR
base_model_relation: quantized
pipeline_tag: image-text-to-text
tags:
  - mlx
  - mlx-vlm
  - quantized
  - apple-silicon
  - deepseek-ocr
  - ocr
  - vision-language
  - multimodal
  - document-parsing
language:
  - multilingual
---

# Unlimited-OCR — MLX Affine int8 (group size 64)

MLX quantization of [**baidu/Unlimited-OCR**](https://huggingface.co/baidu/Unlimited-OCR),
a 3B vision-language OCR model that pushes **DeepSeek-OCR** one step further (one-shot,
long-horizon document parsing). This variant uses **Affine int8 (group size 64)** quantization
(9.41 effective bits/weight).

**Quantized by**: [sahilchachra](https://huggingface.co/sahilchachra)

> **Note on effective bpw**: mlx-vlm's quantizers only act on the language tower's linear
> weights. The vision encoder and embeddings stay at bf16, so the on-disk size averages
> the quantized text decoder with the full-precision vision components.

## About the model

- **Architecture:** *DeepEncoder* vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → **DeepSeek-V2 MoE** text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token).
- **Task:** multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). Supports *gundam* (crop) and *base* resolution modes.
- **License:** MIT (inherited from the base model).

## Benchmark results

Evaluated on Apple M4 Pro (24 GB) with MLX on the [FUNSD](https://huggingface.co/datasets/nielsr/funsd) test set (50 scanned form images).

| | Affine int8 (group size 64) | FP16 baseline |
|---|---:|---:|
| **FUNSD CER ↓** | **1.5720** | 1.7588 |
| Decode tok/s | 205.2 | 146.2 |
| Peak memory | 5.06 GB | 7.62 GB |
| Disk size | 3747 MB | 6464 MB |

### All variants compared

| Variant | CER ↓ | Tok/s | Memory | Disk |
|---|---:|---:|---:|---:|
| FP16 (baseline) | 1.7588 | 146.2 | 7.62 GB | 6464 MB |
| MXFP8 | 1.4556 | 205.6 | 4.98 GB | 3660 MB |
| Int8 | 1.5720 | 205.2 | 5.06 GB | 3747 MB |
| MXFP4 | 2.3944 | 251.9 | 3.61 GB | 2260 MB |
| Int4 | 2.2879 | 252.6 | 3.7 GB | 2347 MB |

## Usage

```bash
pip install mlx-vlm
```

```python
from mlx_vlm import load, generate

model, processor = load("sahilchachra/unlimited-ocr-8bit-mlx")

# Single-image OCR (Gundam mode)
response = generate(model, processor,
                    prompt="<image>document parsing.",
                    image="path/to/document.jpg",
                    max_tokens=4096, verbose=True)
```

## Prompting guide

Unlimited-OCR uses the **DeepSeek-OCR** prompt vocabulary. The prompt is just an instruction;
prefix it with `<|grounding|>` whenever you also want **bounding boxes** for what was read.

| Task | Prompt |
|---|---|
| **Document → Markdown** (layout-aware, with boxes) | `<image><|grounding|>Convert the document to markdown.` |
| **Plain text OCR** (just the text, no layout) | `<image>Free OCR.` |
| **OCR with bounding boxes** | `<image><|grounding|>OCR this image.` |
| **Native parse** | `<image>document parsing.` |
| **Parse a figure / chart / diagram** | `<image>Parse the figure.` |
| **Describe the image** (general VQA) | `<image>Describe this image in detail.` |

> **Note:** Unlike the GGUF/llama.cpp workflow, mlx-vlm requires the literal `<image>` token
> in the prompt and a separate `image=` argument pointing to the file path.

### Understanding the output (grounding tokens)

With `<|grounding|>`, the model interleaves the recognized text with detection boxes:

```
<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00
```

Each `[x1, y1, x2, y2]` is the bounding box (top-left → bottom-right) of that span. Drop the
`<|det|>...<|/det|>` tags if you only want the text, or parse them to overlay boxes / build a layout.

> **Tip — long documents:** For multi-page scans, run page-by-page and concatenate.

## Important — model_type mapping

The original `baidu/Unlimited-OCR` uses `model_type: "unlimited-ocr"` which is not directly
recognized by mlx-vlm. This quantized variant ships with the config already patched:

- `config.json` → `"model_type": "deepseekocr"` (was `"unlimited-ocr"`), `auto_map` removed
- `processor_config.json` → `"processor_class": "DeepseekOCRProcessor"` (was `"UnlimitedOCRHFProcessor"`)

**No manual patching needed** — just `load()` and go.

If you are converting the original model yourself, apply these two changes before running
`mlx_vlm convert`.

## All variants in this collection

### MLX (Apple Silicon — this collection)

| Model | Variant | Disk |
|---|---|---:|
| [sahilchachra/unlimited-ocr-4bit-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-4bit-mlx) | Affine int4 | 2347 MB |
| [sahilchachra/unlimited-ocr-8bit-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-8bit-mlx) | Affine int8 ← this model | 3747 MB |
| [sahilchachra/unlimited-ocr-mxfp4-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-mxfp4-mlx) | Block float MX FP4 | 2260 MB |
| [sahilchachra/unlimited-ocr-mxfp8-mlx](https://huggingface.co/sahilchachra/unlimited-ocr-mxfp8-mlx) | Block float MX FP8 | 3660 MB |

### GGUF (llama.cpp — cross-platform)

| Model | Notes |
|---|---|
| [sahilchachra/Unlimited-OCR-GGUF](https://huggingface.co/sahilchachra/Unlimited-OCR-GGUF) | K-quants & i-quants (BF16 → IQ2_M). Requires llama.cpp PR #17400. |

## Credits

- Base model: [baidu/Unlimited-OCR](https://huggingface.co/baidu/Unlimited-OCR) (MIT) — builds on [deepseek-ai/DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR).
- Quantized by [sahilchachra](https://huggingface.co/sahilchachra).