---
license: mit
base_model: baidu/Unlimited-OCR
base_model_relation: quantized
pipeline_tag: image-text-to-text
library_name: gguf
tags:
  - gguf
  - llama.cpp
  - deepseek-ocr
  - ocr
  - vision-language
  - multimodal
  - image-text-to-text
  - quantized
  - imatrix
  - document-parsing
language:
  - multilingual
---

# Unlimited-OCR — GGUF

GGUF quantizations of [**baidu/Unlimited-OCR**](https://huggingface.co/baidu/Unlimited-OCR),
a 3B vision-language OCR model that pushes **DeepSeek-OCR** one step further (one-shot,
long-horizon document parsing). This repo contains a full spread of **K-quants and
i-quants** of the language model plus the **vision projector (mmproj)** needed for image input.

> ⚠️ **Requires a DeepSeek-OCR–aware llama.cpp build (PR [#17400](https://github.com/ggml-org/llama.cpp/pull/17400)).**
> Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP *DeepEncoder* vision tower
> with a DeepSeek-V2 MoE text decoder). Support is **not yet merged into upstream `main`** —
> stock llama.cpp will not load these files. Build the PR branch (instructions below).

## Files

Every run needs **two** files: one language model GGUF (pick a quant) **plus** the shared
vision projector. The projector is fp16 and identical for all quants.

| File | Quant | Bits | Size | Notes |
|---|---|---|---|---|
| `Unlimited-OCR-BF16.gguf` | BF16 | 16 | 5.47 GiB | Full-precision conversion. The base every quant is made from; reference quality. |
| `Unlimited-OCR-Q8_0.gguf` | Q8_0 | 8 | 2.91 GiB | Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM. |
| `Unlimited-OCR-Q6_K.gguf` | Q6_K | 6 | 2.43 GiB | Very high quality, essentially indistinguishable from Q8_0 for OCR. |
| `Unlimited-OCR-Q5_K_M.gguf` | Q5_K_M | 5 | 2.07 GiB | High quality. Great balance when you can spare a bit more than Q4. |
| `Unlimited-OCR-Q5_K_S.gguf` | Q5_K_S | 5 | 1.95 GiB | High quality, slightly smaller than Q5_K_M. |
| `Unlimited-OCR-Q4_K_M.gguf` | Q4_K_M | 4 | 1.82 GiB | **Recommended default** — best overall size/quality trade-off. |
| `Unlimited-OCR-Q4_K_S.gguf` | Q4_K_S | 4 | 1.68 GiB | Slightly smaller than Q4_K_M with a small quality cost. |
| `Unlimited-OCR-Q3_K_M.gguf` | Q3_K_M | 3 | 1.45 GiB | Compact. Usable when memory is tight; some quality loss. |
| `Unlimited-OCR-IQ4_XS.gguf` | IQ4_XS | 4 | 1.53 GiB | i-quant: smaller than Q4_K_S at similar quality (built with imatrix). |
| `Unlimited-OCR-IQ4_NL.gguf` | IQ4_NL | 4 | 1.59 GiB | i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple. |
| `Unlimited-OCR-IQ3_M.gguf` | IQ3_M | 3 | 1.35 GiB | i-quant: solid 3-bit quality for the size (imatrix). |
| `Unlimited-OCR-IQ3_XXS.gguf` | IQ3_XXS | 3 | 1.24 GiB | i-quant: very small 3-bit; noticeable quality loss but runnable. |
| `Unlimited-OCR-IQ2_M.gguf` | IQ2_M | 2 | 1.15 GiB | i-quant: smallest here; experimental, lowest quality — for tight memory only. |

**Vision projector (required for all of the above):**

| File | Type | Size |
|---|---|---|
| `mmproj-Unlimited-OCR-F16.gguf` | F16 | 774.27 MiB |

*Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is
small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix
(imatrix) computed from a general-text calibration set.*

## Build llama.cpp with DeepSeek-OCR support

```bash
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/24975/head:pr24975 && git checkout pr24975
cmake -B build -DCMAKE_BUILD_TYPE=Release        # add -DGGML_CUDA=ON for NVIDIA
cmake --build build -j --target llama-mtmd-cli llama-server
```

## Quick start

Download one quant + the projector (you always need both):
```bash
huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
  --include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr
```

Run it on an image:
```bash
./build/bin/llama-mtmd-cli \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image document.png \
  -p "<|grounding|>Convert the document to markdown." \
  --chat-template deepseek-ocr --temp 0
```

> `--chat-template deepseek-ocr` and `--mmproj` are **required**. With `--image`, the image is
> injected automatically — you do **not** need to type a literal `<image>` token in `-p`.
> Use `--temp 0` for OCR (deterministic). Add `-n 4096` (or more) for long/dense documents.

---

## Prompting guide

Unlimited-OCR uses the **DeepSeek-OCR** prompt vocabulary. The prompt is just an instruction;
prefix it with `<|grounding|>` whenever you also want **bounding boxes** for what was read.

| Task | Prompt (`-p`) |
|---|---|
| **Document → Markdown** (layout-aware, with boxes) | `<|grounding|>Convert the document to markdown.` |
| **Plain text OCR** (just the text, no layout) | `Free OCR.` |
| **OCR with bounding boxes** | `<|grounding|>OCR this image.` |
| **Native Unlimited-OCR parse** | `document parsing.` |
| **Parse a figure / chart / diagram** | `Parse the figure.` |
| **Describe the image** (general VQA) | `Describe this image in detail.` |
| **Find specific text** (referring grounding) | `<|grounding|>Locate <|ref|>Total Due<|/ref|> in the image.` |

### Worked examples

**1) Document → clean Markdown (tables, headings, reading order):**
```bash
./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image invoice.png --temp 0 -n 4096 \
  -p "<|grounding|>Convert the document to markdown."
```

**2) Just the raw text, no layout / no boxes:**
```bash
./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image receipt.jpg --temp 0 -p "Free OCR."
```

**3) Locate a specific string and get its box:**
```bash
./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image form.png --temp 0 \
  -p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image."
```

### Understanding the output (grounding tokens)

With `<|grounding|>`, the model interleaves the recognized text with detection boxes:

```
<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00
```

Each `[x1, y1, x2, y2]` is the bounding box (top-left → bottom-right) of that span, in the
coordinate space of the model's input image. Drop the `<|det|>...<|/det|>` tags if you only
want the text, or parse them to overlay boxes / build a layout. Without `<|grounding|>` you get
plain text (or Markdown) with no box tags.

> **Tip — long documents:** Unlimited-OCR targets *one-shot long-horizon* parsing. For multi-page
> scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a
> mild repetition penalty, e.g. `--repeat-penalty 1.05`, and keep `--temp 0`.

---

## Serving (OpenAI-compatible API)

```bash
./build/bin/llama-server \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --chat-template deepseek-ocr -c 8192 --host 0.0.0.0 --port 8080
```

Call it with an image (base64 data URL):
```bash
IMG=$(base64 -w0 document.png)
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0,
  "messages": [{ "role": "user", "content": [
    { "type": "text", "text": "<|grounding|>Convert the document to markdown." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } }
  ]}]
}'
```

Python (OpenAI SDK) is identical — point `base_url` at `http://localhost:8080/v1`, send a
`text` part with the prompt above and an `image_url` part with the data URL.

## About the model

- **Architecture:** `DeepseekOCRForCausalLM` — *DeepEncoder* vision (SAM-ViT-B + CLIP-L/14,
  1024×1024 input, 16× downsample) → linear projector → **DeepSeek-V2 MoE** text decoder
  (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token).
- **Task:** multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot
  long-horizon parsing). The original supports *gundam* (crop) and *base* resolution modes.
- **License:** MIT (inherited from the base model).

## How these were made

1. Converted `baidu/Unlimited-OCR` to GGUF with the PR #17400 `convert_hf_to_gguf.py`. The
   converter targets DeepSeek-OCR, so the config's top-level `architectures` was set to
   `DeepseekOCRForCausalLM` and `language_config.architectures` to `DeepseekV2ForCausalLM`
   (the model is otherwise byte-identical to DeepSeek-OCR's tensor layout).
2. Exported the text decoder (BF16) and the vision tower (`--mmproj`, F16) separately.
3. Built an importance matrix from a general-text corpus and produced the K-/i-quants with
   `llama-quantize`.
4. **Verified**: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes)
   via `llama-mtmd-cli` before quantizing.

## Limitations

- Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in `main`.
- Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer **Q4_K_M** or
  higher for production OCR.
- The vision encoder runs in fp16 regardless of the chosen text quant.

## Credits

- Base model: [baidu/Unlimited-OCR](https://huggingface.co/baidu/Unlimited-OCR) (MIT) — builds on
  [deepseek-ai/DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR).
- GGUF / DeepSeek-OCR llama.cpp support: [ggml-org/llama.cpp#17400](https://github.com/ggml-org/llama.cpp/pull/17400).
- Quantized by [sahilchachra](https://huggingface.co/sahilchachra).