Instructions to use Wigtn/Qwen3-VL-2B-WigtnOCR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Wigtn/Qwen3-VL-2B-WigtnOCR with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Wigtn/Qwen3-VL-2B-WigtnOCR")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Wigtn/Qwen3-VL-2B-WigtnOCR")
model = AutoModelForMultimodalLM.from_pretrained("Wigtn/Qwen3-VL-2B-WigtnOCR")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Wigtn/Qwen3-VL-2B-WigtnOCR with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Wigtn/Qwen3-VL-2B-WigtnOCR"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Wigtn/Qwen3-VL-2B-WigtnOCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Wigtn/Qwen3-VL-2B-WigtnOCR

SGLang

How to use Wigtn/Qwen3-VL-2B-WigtnOCR with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Wigtn/Qwen3-VL-2B-WigtnOCR" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Wigtn/Qwen3-VL-2B-WigtnOCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Wigtn/Qwen3-VL-2B-WigtnOCR" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Wigtn/Qwen3-VL-2B-WigtnOCR",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Wigtn/Qwen3-VL-2B-WigtnOCR with Docker Model Runner:
```
docker model run hf.co/Wigtn/Qwen3-VL-2B-WigtnOCR
```

Qwen3-VL-2B-WigtnOCR / README.md

HarrisonKim

docs: remove paper/EMNLP references (not yet submitted)

17e4017 3 months ago

preview code

Raw

History Blame Contribute Delete

22.7 kB

metadata

license: apache-2.0
language:
  - ko
  - en
  - zh
library_name: transformers
base_model: Qwen/Qwen3-VL-2B-Instruct
tags:
  - document-parsing
  - document-intelligence
  - ocr
  - vlm
  - vision-language-model
  - lora
  - distillation
  - korean
  - qwen3-vl
  - structure-preserving
  - rag
  - government-document
  - table-extraction
pipeline_tag: image-text-to-text
datasets:
  - Wigtn/KoGovDoc-Bench
metrics:
  - teds
  - hit@1
model-index:
  - name: WigtnOCR-2B
    results:
      - task:
          type: document-parsing
        dataset:
          name: OmniDocBench
          type: opendatalab/OmniDocBench
        metrics:
          - name: Text NED
            type: ned
            value: 0.288
          - name: Table TEDS
            type: teds
            value: 0.649
          - name: Table TEDS-S
            type: teds
            value: 0.732
          - name: Formula CDM F1
            type: f1
            value: 0.884
          - name: Reading Order NED
            type: ned
            value: 0.211
      - task:
          type: document-parsing
        dataset:
          name: KoGovDoc-Bench
          type: Wigtn/KoGovDoc-Bench
        metrics:
          - name: NED
            type: ned
            value: 0.285
          - name: Hit@1
            type: accuracy
            value: 0.739
          - name: MRR@10
            type: mrr
            value: 0.788

WigtnOCR-2B: Pseudo-Label Distillation for Structure-Preserving Document Parsing

Built by WIGTN Crew

A 2B VLM distilled from 30B teacher that matches its document parsing quality — and achieves #1 retrieval among 6 parsers on Korean government documents.

⭐️ Base Model: Qwen3-VL-2B-Instruct
⭐️ Dataset: huggingface.co/datasets/Wigtn/KoGovDoc-Bench
⭐️ GitHub: github.com/Hyeongseob91/research-vlm-based-document-parsing

Key Features

30B → 2B Distillation: Matches or exceeds 30B teacher in 4/5 OmniDocBench categories via quality-filtered pseudo-labeling
Table TEDS +12.6pp: Surpasses teacher on table structure recognition through selective training on high-quality GT
#1 Retrieval: Best Hit@1 (0.739) and MRR@10 (0.788) among 6 parsers — proving structured parsing improves RAG
Korean Government Documents: Optimized for complex Korean government layouts (tables, forms, multi-column)
Production-Ready: Single GPU serving via vLLM, 2B params, fast inference

Highlights

Category	Metric	WigtnOCR-2B	vs 30B Teacher	vs PaddleOCR
Parsing	Text NED ↓	0.288	-0.001 (matches)	—
Tables	Table TEDS ↑	0.649	+12.6pp	—
Retrieval	Hit@1 ↑	0.739	+2.3pp	+22.7pp
Retrieval	MRR@10 ↑	0.788	+1.7pp	+19.6pp
Reliability	Skip Rate ↓	5.8%	-13.0pp from base	—

Quick Start

Transformers (Direct Inference)

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Wigtn/Qwen3-VL-2B-WigtnOCR",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Wigtn/Qwen3-VL-2B-WigtnOCR")

image = Image.open("document_page.png")

messages = [
    {"role": "system", "content": "You are WigtnOCR, a document parser. Convert the document image to well-structured Markdown."},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Convert this document page to Markdown. Preserve all headings, tables, formulas, and reading order."},
    ]},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=4096)
output = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(output)

vLLM (Production Serving)

vllm serve Wigtn/Qwen3-VL-2B-WigtnOCR \
    --max-model-len 16384 \
    --trust-remote-code

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

with open("document_page.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Wigtn/Qwen3-VL-2B-WigtnOCR",
    messages=[
        {"role": "system", "content": "You are WigtnOCR, a document parser. Convert the document image to well-structured Markdown."},
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "Convert this document page to Markdown."},
        ]},
    ],
    max_tokens=4096,
)
print(response.choices[0].message.content)

OmniDocBench Results

Evaluated on OmniDocBench (CVPR 2025) — 1,355 pages across 9 document types.

Metric	Qwen3-VL-2B	WigtnOCR-2B	Qwen3-VL-30B	Marker	Direction
Text NED	0.364	0.288	0.289	0.218	lower=better
Table TEDS	0.561	0.649	0.523	0.586	higher=better
Table TEDS-S	0.667	0.732	0.657	0.658	higher=better
Formula CDM F1	0.865	0.884	0.939	0.863	higher=better
Formula ExpRate	0.504	0.600	0.692	0.582	higher=better
Reading Order NED	0.300	0.211	0.227	0.165	lower=better
Skip Rate	18.8%	5.8%	5.5%	0.4%	lower=better

Student matches or exceeds 30B teacher in 4/5 metric categories. Table TEDS surpasses teacher by +12.6pp, suggesting quality-filtered distillation produces a stronger training signal than the teacher's average output.

KoGovDoc Retrieval Results

Semantic chunking (BGE-M3) → FAISS retrieval on KoGovDoc-Bench — 294 val pages, 564 queries, 6 parsers compared.

Model	Type	Hit@1 ↑	Hit@5 ↑	MRR@10 ↑	nDCG@10 ↑
WigtnOCR-2B	VLM (ours)	0.739	0.855	0.788	0.437
Qwen3-VL-30B	VLM (teacher)	0.716	0.839	0.771	0.411
Marker	PDF parser	0.711	0.853	0.771	0.412
Qwen3-VL-2B	VLM (base)	0.709	0.814	0.756	0.444
MinerU	PDF parser	0.608	0.789	0.682	0.384
PaddleOCR	Pure OCR	0.512	0.693	0.592	0.293

WigtnOCR-2B ranks #1 in Hit@1, Hit@5, and MRR@10 — proving structured VLM parsing directly improves RAG retrieval over traditional OCR pipelines.

BC vs. Retrieval: An Interesting Finding

Chunk quality (BC/CS, MoC framework) does not predict retrieval performance.

Model	BC ↑	CS ↓	Hit@1 ↑
MinerU	0.735	2.711	0.608 (5th)
WigtnOCR-2B	0.706	2.859	0.739 (1st)
PaddleOCR	0.654	3.420	0.512 (6th)

MinerU produces the cleanest chunk boundaries but ranks 5th in retrieval. Text richness and structural fidelity matter more than boundary quality for end-to-end RAG.

KoGovDoc Parsing Quality

Model	NED ↓	Evaluated
WigtnOCR-2B	0.285	289/294
Qwen3-VL-30B (Teacher)	0.334	294/294
Qwen3-VL-2B (Base)	0.390	294/294

WigtnOCR-2B surpasses its 30B teacher on Korean government documents.

Ablation Study

Config	LoRA r	Epochs	Text NED ↓	Table TEDS ↑	TEDS-S ↑	CDM F1 ↑	RO NED ↓	Skip % ↓	Verdict
v1 (final)	8	3	0.288	0.649	0.732	0.884	0.211	5.8%	Best overall
v2-best	32	3	0.309	0.600	0.697	—	0.215	0.7%	Table regression
v2-last	32	5	0.306	0.610	0.695	0.892	0.214	0.0%	Overfitting on text

Key findings:

LoRA rank 8 outperforms rank 32 — larger capacity leads to table structure regression (-4.9pp TEDS) despite marginally better formula recognition
3 epochs optimal — 5 epochs causes overfitting (eval loss rises after epoch 3)
v2 improves skip rate to 0% but at the cost of core parsing quality
v1 selected as final model due to superior table/text quality which matters most for downstream RAG

Training Details

Parameter	Value
Base model	Qwen3-VL-2B-Instruct
Teacher	Qwen3-VL-30B-A3B-Instruct (FP8)
Judge	Qwen3.5-122B-A10B-NVFP4 (text-only, 5-dim scoring)
Method	LoRA (rank=8, alpha=32, target=all linear layers)
Training samples	2,667 (filtered from 4,501 pages, score ≥ 3/5)
Validation samples	294 (held out)
Training time	31 minutes
Framework	ms-swift + DeepSpeed ZeRO-2
Epochs	3
Learning rate	1e-4
Batch size	1 (gradient accumulation 8)
Hardware	2 × NVIDIA RTX PRO 6000 (98GB each)
Trainable params	8.7M (0.4% of total)

Training Data

Dataset	Documents	Pages	Language	Source
KoGovDoc	10	3,637	Korean	Government publications
ArXivPapers	39	864	English	arXiv (cs.CL, cs.CV, cs.LG)
Total	49	4,501	Bilingual	—

GT generated by Qwen3-VL-30B, validated by Qwen3.5-122B with 74–75% pass rate. Quality filtering removes hallucinations, repetitions, and chain-of-thought contamination.

Evaluation Stack

Component	Tool	Purpose
Preprocessing	PyMuPDF	PDF → page images (200 DPI)
Chunking	BGE-M3 (semantic)	Embedding-based boundary detection
BC/CS Metrics	Qwen2.5-1.5B	Perplexity computation (MoC, ACL 2025)
Embedding	BAAI/bge-m3	Chunk → vector
Retrieval	FAISS	Cosine similarity search

Intended Use

Korean government document digitization and parsing
RAG pipeline preprocessing (PDF → structured Markdown → chunks → retrieval)
Academic paper parsing (tables, formulas, reading order)
Bilingual (Korean + English) document processing

Limitations

Optimized for Korean and English; other languages may have reduced quality
Formula recognition still trails 30B teacher (CDM F1: 0.884 vs 0.939)
Best results at 200 DPI; lower resolution degrades quality
Skip rate 5.8% — some complex pages may fail (v2 achieves 0% but with quality trade-offs)

Example Output

Comparison on a complex Korean government document page (kogov_001 p.9 — survey tables + statistical charts + mixed layout).

	30B Teacher	WigtnOCR-2B (Ours)
Charts	`[Figure: ...]` placeholder	Extracts data into tables
Content	1,582 chars	1,912 chars (+21%)
Tables	3 tables	4 tables (chart → table)

PDF Original

30B Teacher Output (Qwen3-VL-30B) — 1,582 chars

- 지역 주민 의견 및 수요

## [군민 설문조사] 군민 478명 대상 설문조사로 도시문제 도출
- 군민 대상 설문조사 사항

| No. | 설문 항목 |
|-----|-----------|
| Q1 | 성별 / 연령 / 지역 / 불편사항 |
| Q2 | 안전 / 환경 / 에너지 / 교통 / 산업 / 행정 / 복지 / 문화 / 관광 / 농업 / 교육 |
| Q3 | 스마트도시 요소 / 지역 / 서비스 / 리빙랩 |

### - 군민 설문결과

[Figure: 보다 안전한 부여를 위해 개선해야 할 문제]
[Figure: 스마트도시 우선도입 서비스]

자료 : 부여군 스마트도시계획(2023)

## [농어업인 복지실태조사] 생활안전 개선을 위해 필요한 사항 설문결과

| 특성 | 도로안전시설 | 보행자길 정비 | 가로등 확충 | CCTV 설치 | 주민 방범 순찰 | 노후시설 | 안심 귀가 서비스 | 기타 |
|------|-------------|-------------|------------|----------|--------------|---------|----------------|------|
| 농어촌 | 10.1 | 21.0 | 23.1 | 25.7 | 8.1 | 8.2 | 3.4 | 0.3 |
| 읍 | 10.7 | 20.8 | 20.5 | 28.1 | 8.4 | 7.2 | 4.2 | 0.1 |
| 면 | 9.5 | 21.2 | 25.8 | 23.3 | 7.8 | 9.3 | 2.7 | 0.4 |
| 농어가 | 8.7 | 22.3 | 23.2 | 23.1 | 7.9 | 12.1 | 2.5 | 0.2 |
| 비농어가 | 10.6 | 20.5 | 23.1 | 26.6 | 8.2 | 6.9 | 3.7 | 0.3 |
| 30대 이하 | 14.6 | 16.5 | 27.6 | 25.2 | 6.4 | 5.8 | 3.6 | 0.2 |
| 40대 | 6.3 | 20.1 | 19.6 | 33.1 | 10.9 | 4.6 | 5.1 | 0.2 |
| 50대 | 10.8 | 19.4 | 23.0 | 27.2 | 6.8 | 8.4 | 4.1 | 0.3 |
| 60대 | 10.5 | 22.9 | 22.8 | 23.4 | 7.2 | 10.2 | 2.6 | 0.4 |
| 70대 이상 | 9.9 | 23.5 | 24.0 | 21.1 | 8.7 | 10.4 | 2.2 | 0.2 |

자료 : 농촌진흥청 2023 농어업인등에 대한 복지실태조사

| 구분 | 도시문제 | 주민 수요 | 수요 주민 |
|------|----------|-----------|-----------|
| 복지 | 독거노인 돌봄 | - 부여군 보건복지 분야 개선사항으로 지적 | 70대 남성 |
| 복지 | 독거노인 돌봄 | - 인공지능 돌봄서비스 시범 사용 희망 | 60대 여성 |
| 복지 | 시설노후화 | - 부여군 읍면 경로당 내 시설 노후화 | 80대 남성 |
| 복지 | 여가 콘텐츠 부족 | - 경로당 내 여가 콘텐츠 부족 | 60대 여성 |
| 안전 | 안전 인프라 부족 | - 부여시장 근교 노인 보행자 교통사고 위험 높음 | 60대 남성 |
| 관광 | 관광 콘텐츠 부족 | - 수년 동안 업데이트되지 않은 박물관 내 콘텐츠 | 50대 남성 |

WigtnOCR-2B Output (Ours) — 1,912 chars

- 지역 주민 의견 및 수요

[균민 설문조사] 균민 478명 대상 설문조사로 도시문제 도출
- 균민 대상 설문조사 사항

| No. | 설문 항목 |
| --- | --- |
| Q1 | 성별 / 연령 / 지역 / 불편사항 |
| Q2 | 안전 / 환경 / 에너지 / 교통 / 산업 / 행정 / 보건 / 복지 / 문화 / 관광 / 농업 / 교육 |
| Q3 | 스마트도시 요소 / 지역 / 서비스 / 리빙랩 |

- 균민 설문결과

| 보다 안전한 부여를 위해 개선해야 할 문제 | 스마트도시 우선도입 서비스 |
| --- | --- |
| 시설 노후화 | 34.1% |
| 교통사고 다발구간 | 13.7% |
| 자연재해감시 | 12.8% |
| 심야시간 범죄 | 10.0% |
| 통학 안전 | 9.3% |
| 인재 | 8.2% |
| 재난 예경보 | 8.7% |
| 기타 | 3.4% |
| 스마트 보건/의료/복지 | 17.4% |
| 스마트 교통 | 15.7% |
| 스마트 환경/에너지/수자원 | 10.5% |
| 스마트 문화/관광/스포츠 | 10.1% |
| 스마트 근로/고용 | 9.9% |
| 스마트 행정 | 8.9% |
| 스마트 교육 | 7.6% |
| 스마트 방법/방재 | 6.4% |
| 스마트 시설물관리 | 4.5% |
| 스마트 주거 | 3.2% |
| 스마트 물류 | 2.8% |
| 기타 | 2.9% |

자료 : 부여군 스마트도시계획(2023)

[농어업인 복지실례조사] 생활안전 개선을 위해 필요한 사항 설문결과

| 특성 | 도로안전시설 | 보행자길 정비 | 가로등 확충 | CCTV 설치 | 주민 방법순찰 | 노후시설 | 안심 귀가 서비스 | 기타 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 농어촌 | 10.1 | 21.0 | 23.1 | 25.7 | 8.1 | 8.2 | 3.4 | 0.3 |
| 읍 | 10.7 | 20.8 | 20.5 | 28.1 | 8.4 | 7.2 | 4.2 | 0.1 |
| 면 | 9.5 | 21.2 | 25.8 | 23.3 | 7.8 | 9.3 | 2.7 | 0.4 |
| 농어가 | 8.7 | 22.3 | 23.2 | 23.1 | 7.9 | 12.1 | 2.5 | 0.2 |
| 비농어가 | 10.6 | 20.5 | 23.1 | 26.6 | 8.2 | 6.9 | 3.7 | 0.3 |
| 30대 이하 | 14.6 | 16.5 | 27.6 | 25.2 | 6.4 | 5.8 | 3.6 | 0.2 |
| 40대 | 6.3 | 20.1 | 19.6 | 33.1 | 10.9 | 4.6 | 5.1 | 0.2 |
| 50대 | 10.8 | 19.4 | 23.0 | 27.2 | 6.8 | 8.4 | 4.1 | 0.3 |
| 60대 | 10.5 | 22.9 | 22.8 | 23.4 | 7.2 | 10.2 | 2.6 | 0.4 |
| 70대 이상 | 9.9 | 23.5 | 24.0 | 21.1 | 8.7 | 10.4 | 2.2 | 0.2 |

자료 : 농촌진흥청 2023 농어업인등에 대한 복지실례조사

| 구분 | 도시문제 | 주민 수요 | 수요 주민 |
| --- | --- | --- | --- |
| 복지 | 독거노인 돌봄 | - 부여군 보건복지 분야 개선사항으로 지적 | 70대 남성 |
| 복지 | 독거노인 돌봄 | - 인공지능 돌봄서비스 시범 사용 호평 | 60대 여성 |
| 복지 | 시설노후화 | - 부여군 읍면 경로당 내 시설 노후화 | 80대 남성 |
| 복지 | 여가 콘텐츠 부족 | - 경로당 내 여가 콘텐츠 부족 | 60대 여성 |
| 안전 | 안전 인프라 부족 | - 부여시장 근교 노인 보행자 교통사고 위험 높음 | 60대 남성 |
| 관광 | 관광 콘텐츠 부족 | - 수년 동안 업데이트되지 않은 박물관 내 콘텐츠 | 50대 남성 |

Key difference: The 30B teacher replaces charts with [Figure: ...] placeholders, while WigtnOCR-2B extracts the actual data from charts into structured markdown tables — producing 21% more content from the same page.

📎 Citation

If you use WigtnOCR in your research, please cite:

@software{wigtnocr2026,
  title   = {WigtnOCR: VLM-based Korean Government Document Parser using Teacher-Student Pseudo-GT Pipeline},
  author  = {WIGTN Crew},
  year    = {2026},
  url     = {https://huggingface.co/Wigtn/Qwen3-VL-2B-WigtnOCR}
}

🏢 About WIGTN Crew

WIGTN Crew is an AI-native open-source research crew based in Korea.
We build practical, domain-specialized AI tools — starting with document intelligence for Korean government documents.

🌐 Website: https://wigtn.com
🐙 GitHub: https://github.com/wigtn
🤗 HuggingFace: https://huggingface.co/Wigtn