Instructions to use sahilchachra/Unlimited-OCR-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sahilchachra/Unlimited-OCR-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="sahilchachra/Unlimited-OCR-NVFP4", trust_remote_code=True)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sahilchachra/Unlimited-OCR-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sahilchachra/Unlimited-OCR-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sahilchachra/Unlimited-OCR-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Unlimited-OCR-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/sahilchachra/Unlimited-OCR-NVFP4

SGLang

How to use sahilchachra/Unlimited-OCR-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sahilchachra/Unlimited-OCR-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Unlimited-OCR-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sahilchachra/Unlimited-OCR-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Unlimited-OCR-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use sahilchachra/Unlimited-OCR-NVFP4 with Docker Model Runner:
```
docker model run hf.co/sahilchachra/Unlimited-OCR-NVFP4
```

Unlimited-OCR — NVFP4

NVFP4 (4-bit float) quantization of baidu/Unlimited-OCR, a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot, long-horizon document parsing). This repo quantizes the DeepSeek-V2 MoE text decoder to NVFP4 while keeping the vision tower in BF16, so it stays a drop-in transformers model.

⚠️ Runtime requirements. This is custom remote code, so load with trust_remote_code=True, transformers 4.57.x, and compressed-tensors installed. NVFP4 runs natively on Blackwell GPUs (Jetson Thor, RTX 50-series, B200); on other GPUs compressed-tensors transparently dequantizes the weights at load.

This quant


Scheme	`NVFP4A16` · 4-bit float · group 16 · `nvfp4-pack-quantized`
Size	~2.93 GB (vs ~6.67 GB BF16)
Quantized	text-decoder Linears — 2196 modules (2112 experts + 33 shared + 48 attention + 3 dense)
Kept in BF16	vision tower (`sam_model`, `vision_model`), projector, token embeddings, `lm_head`, the MoE router gate, all norms
Method	data-free (`llm-compressor` `model_free_ptq`) — no calibration needed
Quantized by	sahilchachra

Quick start

pip install "transformers==4.57.3" compressed-tensors accelerate torch torchvision \
            einops addict easydict matplotlib pillow

import torch
from transformers import AutoModel, AutoTokenizer

repo = "sahilchachra/Unlimited-OCR-NVFP4"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True,
                                  dtype=torch.bfloat16, device_map="cuda").eval()

text = model.infer(
    tok,
    prompt="<image>\n<|grounding|>Convert the document to markdown.",
    image_file="document.png", output_path="./out",
    base_size=1024, image_size=1024, crop_mode=False,   # "base" mode
    save_results=True, eval_mode=True,
)
print(text)

Prompting guide

Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt must contain <image>; prefix it with <|grounding|> whenever you also want bounding boxes for what was read.

Task	Prompt
Document → Markdown (layout-aware, with boxes)	`\n<
Plain text OCR (just the text, no layout)	`<image>\nFree OCR.`
OCR with bounding boxes	`\n<
Native Unlimited-OCR parse	`<image>document parsing.`
Parse a figure / chart / diagram	`<image>\nParse the figure.`
Describe the image (general VQA)	`<image>\nDescribe this image in detail.`
Find specific text (referring grounding)	`\n<
Multi-page / PDF	`<image>Multi page parsing.` via `model.infer_multi(...)`

Resolution modes

base — base_size=1024, image_size=1024, crop_mode=False. Good default for normal pages.
gundam — base_size=1024, image_size=640, crop_mode=True. Tiles the page; use for dense or large/high-resolution documents.

Understanding the output (grounding tokens)

With <|grounding|>, the model interleaves the recognized text with detection boxes:

<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00

Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only want text, or parse them to overlay boxes / rebuild layout. Without <|grounding|> you get plain text (or Markdown) with no box tags.

Serving

The original model ships an SGLang wheel and a vLLM path (see the base model card). For quantized serving, a runtime with compressed-tensors support can load the NVFP4 weights directly; otherwise use the transformers snippet above.

About the model

Architecture: UnlimitedOCRForCausalLM (DeepSeek-OCR architecture) — a DeepEncoder vision tower (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token; layer 0 dense).
Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing).
License: MIT (inherited from the base model).

How this was made

NVFP4 was applied with llm-compressor's model_free_ptq — a data-free path that streams the safetensors and quantizes weights tensor-by-tensor (no calibration, no model forward), so the custom VLM code is irrelevant. The vision tower, projector, embeddings, lm_head, MoE router and norms were excluded via ignore patterns and remain BF16.

Verified

Loaded in transformers and run on a test document — OCR output is identical to BF16, e.g.:

<|det|>title [37, 64, 464, 130]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 480, 329, 540]<|/det|>Total Due: $44.00

Limitations

Very-low-bit weight quant trades a little accuracy for size; for the highest fidelity use the original BF16 model. For OCR, NVFP4 here is effectively lossless on tested documents.
The vision encoder stays BF16 regardless (small, and accuracy-sensitive).
English-/multilingual-text centric; verify critical fields on hard scans.