Instructions to use sahilchachra/Unlimited-OCR-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sahilchachra/Unlimited-OCR-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="sahilchachra/Unlimited-OCR-AWQ", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sahilchachra/Unlimited-OCR-AWQ", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sahilchachra/Unlimited-OCR-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sahilchachra/Unlimited-OCR-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Unlimited-OCR-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/sahilchachra/Unlimited-OCR-AWQ
- SGLang
How to use sahilchachra/Unlimited-OCR-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sahilchachra/Unlimited-OCR-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Unlimited-OCR-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sahilchachra/Unlimited-OCR-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Unlimited-OCR-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use sahilchachra/Unlimited-OCR-AWQ with Docker Model Runner:
docker model run hf.co/sahilchachra/Unlimited-OCR-AWQ
Use Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "sahilchachra/Unlimited-OCR-AWQ" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "sahilchachra/Unlimited-OCR-AWQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'Unlimited-OCR — AWQ (W4A16)
AWQ 4-bit (W4A16) quantization of baidu/Unlimited-OCR,
a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot,
long-horizon document parsing). This repo quantizes the DeepSeek-V2 MoE text decoder with
activation-aware scaling (AWQ) while keeping the vision tower in BF16, so it stays a drop-in
transformers model.
⚠️ Runtime requirements. This is custom remote code, so load with
trust_remote_code=True,transformers4.57.x, andcompressed-tensorsinstalled. W4A16 (int4) runs on any CUDA GPU;compressed-tensorshandles the 4-bit unpacking at load.
This quant
| Scheme | W4A16 · int4 symmetric · group 128 · pack-quantized |
| Method | AWQ (llm-compressor) — activation-aware, text-calibrated |
| Calibration | 64 × 512-token general-text sequences (text-only forward) |
| Quantized | text-decoder Linears (attention q/k/v/o, all experts + shared gate/up/down, dense gate/up) |
| Kept in BF16 | vision tower (sam_model, vision_model), projector, token embeddings, lm_head, the MoE router gate, all norms, and the single dense layer-0 down_proj (width 6848 not divisible by group 128) |
| Quantized by | sahilchachra |
Quick start
pip install "transformers==4.57.3" compressed-tensors accelerate torch torchvision \
einops addict easydict matplotlib pillow
import torch
from transformers import AutoModel, AutoTokenizer
repo = "sahilchachra/Unlimited-OCR-AWQ"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True,
dtype=torch.bfloat16, device_map="cuda").eval()
text = model.infer(
tok,
prompt="<image>\n<|grounding|>Convert the document to markdown.",
image_file="document.png", output_path="./out",
base_size=1024, image_size=1024, crop_mode=False, # "base" mode
save_results=True, eval_mode=True,
)
print(text)
Prompting guide
Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt must contain <image>;
prefix it with <|grounding|> whenever you also want bounding boxes for what was read.
| Task | Prompt |
|---|---|
| Document → Markdown (layout-aware, with boxes) | ` |
| Plain text OCR (just the text, no layout) | <image>\nFree OCR. |
| OCR with bounding boxes | ` |
| Native Unlimited-OCR parse | <image>document parsing. |
| Parse a figure / chart / diagram | <image>\nParse the figure. |
| Describe the image (general VQA) | <image>\nDescribe this image in detail. |
| Find specific text (referring grounding) | ` |
| Multi-page / PDF | <image>Multi page parsing. via model.infer_multi(...) |
Resolution modes
- base —
base_size=1024, image_size=1024, crop_mode=False. Good default for normal pages. - gundam —
base_size=1024, image_size=640, crop_mode=True. Tiles the page; use for dense or large/high-resolution documents.
Understanding the output (grounding tokens)
With <|grounding|>, the model interleaves the recognized text with detection boxes:
<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text [37, 483, 329, 543]<|/det|>Total Due: $44.00
Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the
coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only want
text, or parse them to overlay boxes / rebuild layout. Without <|grounding|> you get plain text
(or Markdown) with no box tags.
Serving
The original model ships an SGLang wheel and a vLLM path (see the
base model card). W4A16 / compressed-tensors
weights load directly in runtimes with compressed-tensors support (e.g. vLLM); otherwise use the
transformers snippet above.
About the model
- Architecture:
UnlimitedOCRForCausalLM(DeepSeek-OCR architecture) — a DeepEncoder vision tower (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token; layer 0 dense). - Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing).
- License: MIT (inherited from the base model).
How this was made
Unlimited-OCR is custom remote code whose forward only runs the vision tower when images are
passed, so AWQ calibration feeds text only (images=None), exercising the pure DeepSeek-V2
decoder. Per-layer AWQ mappings were built from the live module tree (attention
input_layernorm→q,k,v and v→o; MoE post_attention_layernorm→ every expert + shared-expert
gate/up, plus per-expert up→down). The fx-based "sequential" pipeline can't trace this custom
model, so the basic pipeline (real end-to-end forward + activation hooks) was used.
Verified
Loaded in transformers and run on a test document — OCR output matches BF16, e.g.:
<|det|>title [37, 64, 464, 130]<|/det|>INVOICE #2026-0623
<|det|>text [37, 480, 329, 540]<|/det|>Total Due: $44.00
Limitations
- 4-bit weights trade a little accuracy for size; for the highest fidelity use the original BF16 model. For OCR, this AWQ build is effectively lossless on tested documents.
- The vision encoder and MoE router stay BF16 (small, accuracy-sensitive).
- English-/multilingual-text centric; verify critical fields on hard scans.
Other formats
- NVFP4: sahilchachra/Unlimited-OCR-NVFP4
- GGUF (llama.cpp): sahilchachra/Unlimited-OCR-GGUF
- MLX: sahilchachra/unlimited-ocr-8bit-mlx
Credits
Base model baidu/Unlimited-OCR (MIT), built on DeepSeek-OCR. Quantized with llm-compressor. License: MIT.
- Downloads last month
- 23,734
Model tree for sahilchachra/Unlimited-OCR-AWQ
Base model
baidu/Unlimited-OCR
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sahilchachra/Unlimited-OCR-AWQ" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Unlimited-OCR-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'