--- license: mit base_model: baidu/Unlimited-OCR base_model_relation: quantized pipeline_tag: image-text-to-text library_name: gguf tags: - gguf - llama.cpp - deepseek-ocr - ocr - vision-language - multimodal - image-text-to-text - quantized - imatrix - document-parsing language: - multilingual --- # Unlimited-OCR — GGUF GGUF quantizations of [**baidu/Unlimited-OCR**](https://huggingface.co/baidu/Unlimited-OCR), a 3B vision-language OCR model that pushes **DeepSeek-OCR** one step further (one-shot, long-horizon document parsing). This repo contains a full spread of **K-quants and i-quants** of the language model plus the **vision projector (mmproj)** needed for image input. > ⚠️ **Requires a DeepSeek-OCR–aware llama.cpp build (PR [#17400](https://github.com/ggml-org/llama.cpp/pull/17400)).** > Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP *DeepEncoder* vision tower > with a DeepSeek-V2 MoE text decoder). Support is **not yet merged into upstream `main`** — > stock llama.cpp will not load these files. Build the PR branch (instructions below). ## Files Every run needs **two** files: one language model GGUF (pick a quant) **plus** the shared vision projector. The projector is fp16 and identical for all quants. | File | Quant | Bits | Size | Notes | |---|---|---|---|---| | `Unlimited-OCR-BF16.gguf` | BF16 | 16 | 5.47 GiB | Full-precision conversion. The base every quant is made from; reference quality. | | `Unlimited-OCR-Q8_0.gguf` | Q8_0 | 8 | 2.91 GiB | Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM. | | `Unlimited-OCR-Q6_K.gguf` | Q6_K | 6 | 2.43 GiB | Very high quality, essentially indistinguishable from Q8_0 for OCR. | | `Unlimited-OCR-Q5_K_M.gguf` | Q5_K_M | 5 | 2.07 GiB | High quality. Great balance when you can spare a bit more than Q4. | | `Unlimited-OCR-Q5_K_S.gguf` | Q5_K_S | 5 | 1.95 GiB | High quality, slightly smaller than Q5_K_M. | | `Unlimited-OCR-Q4_K_M.gguf` | Q4_K_M | 4 | 1.82 GiB | **Recommended default** — best overall size/quality trade-off. | | `Unlimited-OCR-Q4_K_S.gguf` | Q4_K_S | 4 | 1.68 GiB | Slightly smaller than Q4_K_M with a small quality cost. | | `Unlimited-OCR-Q3_K_M.gguf` | Q3_K_M | 3 | 1.45 GiB | Compact. Usable when memory is tight; some quality loss. | | `Unlimited-OCR-IQ4_XS.gguf` | IQ4_XS | 4 | 1.53 GiB | i-quant: smaller than Q4_K_S at similar quality (built with imatrix). | | `Unlimited-OCR-IQ4_NL.gguf` | IQ4_NL | 4 | 1.59 GiB | i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple. | | `Unlimited-OCR-IQ3_M.gguf` | IQ3_M | 3 | 1.35 GiB | i-quant: solid 3-bit quality for the size (imatrix). | | `Unlimited-OCR-IQ3_XXS.gguf` | IQ3_XXS | 3 | 1.24 GiB | i-quant: very small 3-bit; noticeable quality loss but runnable. | | `Unlimited-OCR-IQ2_M.gguf` | IQ2_M | 2 | 1.15 GiB | i-quant: smallest here; experimental, lowest quality — for tight memory only. | **Vision projector (required for all of the above):** | File | Type | Size | |---|---|---| | `mmproj-Unlimited-OCR-F16.gguf` | F16 | 774.27 MiB | *Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix (imatrix) computed from a general-text calibration set.* ## Build llama.cpp with DeepSeek-OCR support ```bash git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp git fetch origin pull/24975/head:pr24975 && git checkout pr24975 cmake -B build -DCMAKE_BUILD_TYPE=Release # add -DGGML_CUDA=ON for NVIDIA cmake --build build -j --target llama-mtmd-cli llama-server ``` ## Quick start Download one quant + the projector (you always need both): ```bash huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \ --include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr ``` Run it on an image: ```bash ./build/bin/llama-mtmd-cli \ -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \ --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \ --image document.png \ -p "<|grounding|>Convert the document to markdown." \ --chat-template deepseek-ocr --temp 0 ``` > `--chat-template deepseek-ocr` and `--mmproj` are **required**. With `--image`, the image is > injected automatically — you do **not** need to type a literal `` token in `-p`. > Use `--temp 0` for OCR (deterministic). Add `-n 4096` (or more) for long/dense documents. --- ## Prompting guide Unlimited-OCR uses the **DeepSeek-OCR** prompt vocabulary. The prompt is just an instruction; prefix it with `<|grounding|>` whenever you also want **bounding boxes** for what was read. | Task | Prompt (`-p`) | |---|---| | **Document → Markdown** (layout-aware, with boxes) | `<|grounding|>Convert the document to markdown.` | | **Plain text OCR** (just the text, no layout) | `Free OCR.` | | **OCR with bounding boxes** | `<|grounding|>OCR this image.` | | **Native Unlimited-OCR parse** | `document parsing.` | | **Parse a figure / chart / diagram** | `Parse the figure.` | | **Describe the image** (general VQA) | `Describe this image in detail.` | | **Find specific text** (referring grounding) | `<|grounding|>Locate <|ref|>Total Due<|/ref|> in the image.` | ### Worked examples **1) Document → clean Markdown (tables, headings, reading order):** ```bash ./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \ --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \ --image invoice.png --temp 0 -n 4096 \ -p "<|grounding|>Convert the document to markdown." ``` **2) Just the raw text, no layout / no boxes:** ```bash ./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \ --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \ --image receipt.jpg --temp 0 -p "Free OCR." ``` **3) Locate a specific string and get its box:** ```bash ./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \ --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \ --image form.png --temp 0 \ -p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image." ``` ### Understanding the output (grounding tokens) With `<|grounding|>`, the model interleaves the recognized text with detection boxes: ``` <|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623 <|det|>text [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra <|det|>text [37, 483, 329, 543]<|/det|>Total Due: $44.00 ``` Each `[x1, y1, x2, y2]` is the bounding box (top-left → bottom-right) of that span, in the coordinate space of the model's input image. Drop the `<|det|>...<|/det|>` tags if you only want the text, or parse them to overlay boxes / build a layout. Without `<|grounding|>` you get plain text (or Markdown) with no box tags. > **Tip — long documents:** Unlimited-OCR targets *one-shot long-horizon* parsing. For multi-page > scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a > mild repetition penalty, e.g. `--repeat-penalty 1.05`, and keep `--temp 0`. --- ## Serving (OpenAI-compatible API) ```bash ./build/bin/llama-server \ -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \ --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \ --chat-template deepseek-ocr -c 8192 --host 0.0.0.0 --port 8080 ``` Call it with an image (base64 data URL): ```bash IMG=$(base64 -w0 document.png) curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "temperature": 0, "messages": [{ "role": "user", "content": [ { "type": "text", "text": "<|grounding|>Convert the document to markdown." }, { "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } } ]}] }' ``` Python (OpenAI SDK) is identical — point `base_url` at `http://localhost:8080/v1`, send a `text` part with the prompt above and an `image_url` part with the data URL. ## About the model - **Architecture:** `DeepseekOCRForCausalLM` — *DeepEncoder* vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → **DeepSeek-V2 MoE** text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token). - **Task:** multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). The original supports *gundam* (crop) and *base* resolution modes. - **License:** MIT (inherited from the base model). ## How these were made 1. Converted `baidu/Unlimited-OCR` to GGUF with the PR #17400 `convert_hf_to_gguf.py`. The converter targets DeepSeek-OCR, so the config's top-level `architectures` was set to `DeepseekOCRForCausalLM` and `language_config.architectures` to `DeepseekV2ForCausalLM` (the model is otherwise byte-identical to DeepSeek-OCR's tensor layout). 2. Exported the text decoder (BF16) and the vision tower (`--mmproj`, F16) separately. 3. Built an importance matrix from a general-text corpus and produced the K-/i-quants with `llama-quantize`. 4. **Verified**: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes) via `llama-mtmd-cli` before quantizing. ## Limitations - Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in `main`. - Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer **Q4_K_M** or higher for production OCR. - The vision encoder runs in fp16 regardless of the chosen text quant. ## Credits - Base model: [baidu/Unlimited-OCR](https://huggingface.co/baidu/Unlimited-OCR) (MIT) — builds on [deepseek-ai/DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR). - GGUF / DeepSeek-OCR llama.cpp support: [ggml-org/llama.cpp#17400](https://github.com/ggml-org/llama.cpp/pull/17400). - Quantized by [sahilchachra](https://huggingface.co/sahilchachra).