Instructions to use robotflowlabs/qwen3-vl-4b-instruct-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use robotflowlabs/qwen3-vl-4b-instruct-int4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="robotflowlabs/qwen3-vl-4b-instruct-int4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-4b-instruct-int4")
model = AutoModelForImageTextToText.from_pretrained("robotflowlabs/qwen3-vl-4b-instruct-int4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use robotflowlabs/qwen3-vl-4b-instruct-int4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "robotflowlabs/qwen3-vl-4b-instruct-int4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robotflowlabs/qwen3-vl-4b-instruct-int4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/robotflowlabs/qwen3-vl-4b-instruct-int4

SGLang

How to use robotflowlabs/qwen3-vl-4b-instruct-int4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "robotflowlabs/qwen3-vl-4b-instruct-int4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robotflowlabs/qwen3-vl-4b-instruct-int4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "robotflowlabs/qwen3-vl-4b-instruct-int4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robotflowlabs/qwen3-vl-4b-instruct-int4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use robotflowlabs/qwen3-vl-4b-instruct-int4 with Docker Model Runner:
```
docker model run hf.co/robotflowlabs/qwen3-vl-4b-instruct-int4
```

ilessio-aiflowlab commited on Mar 19

Commit

f2760c2

verified ·

1 Parent(s): 20ee228

Add Qwen3-VL-4B-Instruct INT4 NF4 quantized (2.7GB)

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +205 -0
chat_template.jinja +120 -0
config.json +81 -0
generation_config.json +14 -0
model.safetensors +3 -0
processor_config.json +63 -0
tokenizer.json +3 -0
tokenizer_config.json +30 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,205 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-VL-4B-Instruct
+tags:
+  - robotics
+  - edge-deployment
+  - anima
+  - forge
+  - int4
+  - nf4
+  - quantized
+  - qwen3
+  - vision-language
+  - multimodal
+  - text-generation
+  - image-text-to-text
+  - instruction-following
+  - ros2
+  - jetson
+  - real-time
+library_name: transformers
+pipeline_tag: image-text-to-text
+model-index:
+  - name: qwen3-vl-4b-instruct-int4
+    results:
+      - task:
+          type: image-text-to-text
+        metrics:
+          - name: Model Size (MB)
+            type: model_size
+            value: 2741
+          - name: Compression Ratio
+            type: compression
+            value: 3.1
+          - name: Original Size (MB)
+            type: original_size
+            value: 8465
+---
+# Qwen3-VL-4B-Instruct — INT4 NF4 Quantized
+> Alibaba's latest Qwen3-VL-4B-Instruct quantized to 4-bit NF4 with double quantization for high-quality robotic visual reasoning. **3.1x smaller** — from 8.5 GB to 2.7 GB — delivering stronger visual understanding than the 2B variant while still fitting on edge GPUs.
+This model is part of the **[RobotFlowLabs](https://huggingface.co/robotflowlabs)** model library, built for the **ANIMA** agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.
+## Why This Model Exists
+When robotic tasks demand higher visual reasoning quality — complex scene descriptions, multi-step visual planning, or precise spatial grounding — the 4B variant provides a significant accuracy boost over the 2B. Qwen3-VL-4B features a deeper language model (36 layers vs 28) with wider hidden dimensions (2560 vs 2048), delivering better performance on visual grounding, counting, and reasoning benchmarks. At 2.7 GB quantized, it fits on an L4 24GB alongside a vision encoder and action model.
+## Model Details
+| Property | Value |
+|----------|-------|
+| **Architecture** | Qwen3-VL (vision encoder + language decoder) |
+| **Total Parameters** | 4B |
+| **Text Hidden Dimension** | 2560 |
+| **Text Layers** | 36 |
+| **Text Attention Heads** | 32 (8 KV heads, GQA) |
+| **Text MLP Dimension** | 9728 (SiLU activation) |
+| **Vision Encoder** | 24-layer ViT (1024d, 16 heads, patch 16) |
+| **Vision Features** | DeepStack at layers [5, 11, 17] |
+| **Spatial Merge** | 2×2 (4 patches → 1 token) |
+| **Temporal Patch** | 2 frames per token |
+| **Context Length** | 262,144 tokens |
+| **Vocabulary** | 151,936 tokens |
+| **RoPE** | M-RoPE (interleaved, θ = 5,000,000) |
+| **Quantization** | NF4 double quantization (bitsandbytes) |
+| **Original Model** | [`Qwen/Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
+| **License** | Apache-2.0 |
+## Compression Results
+Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.
+| Metric | Original | INT4 Quantized | Change |
+|--------|----------|----------------|--------|
+| **Total Size** | 8,465 MB | 2,741 MB | **3.1x smaller** |
+| **Quantization** | BF16 | NF4 + double quant | 4-bit weights |
+| **Compute Dtype** | BF16 | BF16 | Preserved at inference |
+| **Format** | SafeTensors | SafeTensors | Direct HF loading |
+## Quick Start
+```python
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from PIL import Image
+model = AutoModelForImageTextToText.from_pretrained(
+    "robotflowlabs/qwen3-vl-4b-instruct-int4",
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-4b-instruct-int4")
+image = Image.open("workspace.jpg")
+messages = [
+    {"role": "system", "content": "You are a robotic vision assistant specialized in manipulation tasks."},
+    {"role": "user", "content": [
+        {"type": "image", "image": image},
+        {"type": "text", "text": "List all graspable objects, their approximate positions, and suggest a pick order."}
+    ]}
+]
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=512)
+print(processor.decode(outputs[0], skip_special_tokens=True))
+```
+### With FORGE (ANIMA Integration)
+```python
+from forge.vlm import VLMRegistry
+vlm = VLMRegistry.load("qwen3-vl-4b-instruct-int4")
+plan = vlm.describe(image, "List all graspable objects and suggest a manipulation sequence.")
+```
+## Use Cases in ANIMA
+Qwen3-VL-4B serves as the **high-quality visual reasoning engine** in ANIMA:
+- **Complex Scene Analysis** — Detailed spatial reasoning about cluttered workspaces
+- **Visual Task Planning** — Multi-step manipulation plans from scene observation
+- **Precise Grounding** — Fine-grained object localization and counting
+- **Structured Output** — JSON scene graphs, object inventories, spatial relationship maps
+- **Video Reasoning** — Temporal understanding of task progress from camera feeds
+- **Safety Assessment** — Visual evaluation of workspace hazards before execution
+## Qwen3-VL Family on RobotFlowLabs
+| Model | Params | Quantized Size | Best For |
+|-------|--------|----------------|----------|
+| [qwen3-vl-2b-instruct-int4](https://huggingface.co/robotflowlabs/qwen3-vl-2b-instruct-int4) | 2B | 1.5 GB | Edge deployment, real-time |
+| **[qwen3-vl-4b-instruct-int4](https://huggingface.co/robotflowlabs/qwen3-vl-4b-instruct-int4)** | **4B** | **2.7 GB** | **Higher accuracy visual reasoning** |
+## About ANIMA
+**ANIMA** is a modular, ROS2-native agentic robotics platform developed by [RobotFlowLabs](https://huggingface.co/robotflowlabs). It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.
+### Other Collections
+- **[ANIMA Vision](https://huggingface.co/collections/robotflowlabs/anima-vision-69bc77ca7ce15b06bbdd21bd)** — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
+- **[ANIMA Language](https://huggingface.co/collections/robotflowlabs/anima-language-69bc77ca29dccc3f68f8c7fd)** — Qwen2.5, SmolLM2
+- **[ANIMA VLM](https://huggingface.co/collections/robotflowlabs/anima-vlm-69bc77ca53ae84ac21b0f012)** — Qwen3-VL, Qwen2.5-VL
+- **[ANIMA VLA](https://huggingface.co/collections/robotflowlabs/anima-vla-69bc77cbf1b8aa40002920bb)** — SmolVLA, RDT2-FM, FORGE students
+## Intended Use
+### Designed For
+- High-accuracy visual scene understanding for robotic manipulation
+- Complex visual task planning requiring spatial reasoning
+- Precise object grounding and counting in cluttered environments
+- Multi-turn visual dialogue with detailed scene descriptions
+### Limitations
+- INT4 quantization may slightly reduce fine-grained visual grounding precision
+- 262K context window is generous but may not cover extremely long video sequences
+- Requires GPU (bitsandbytes NF4 does not run on CPU)
+- Inherits biases from Qwen3-VL training data
+### Out of Scope
+- Safety-critical autonomous decision making without human oversight
+- Medical image analysis
+- Surveillance applications
+## Technical Details
+### Compression Pipeline
+```
+Original Qwen3-VL-4B-Instruct (BF16, 8.5 GB)
+    │
+    └─→ bitsandbytes NF4 double quantization
+        ├─→ bnb_4bit_quant_type: nf4
+        ├─→ bnb_4bit_use_double_quant: true
+        ├─→ bnb_4bit_compute_dtype: bfloat16
+        └─→ model.safetensors (2.7 GB)
+```
+- **Quantization**: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
+- **Compute**: BF16 at inference — weights dequantized on-the-fly
+- **Hardware**: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14
+## Attribution
+- **Original Model**: [`Qwen/Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) by Alibaba Cloud
+- **License**: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
+- **Compressed by**: [RobotFlowLabs](https://huggingface.co/robotflowlabs) using [FORGE](https://github.com/robotflowlabs/forge)
+## Citation
+```bibtex
+@article{qwen3vl,
+  title={Qwen3-VL Technical Report},
+  author={Qwen Team},
+  year={2025}
+}
+```
+---
+<p align="center">
+  <b>Built with FORGE by <a href="https://huggingface.co/robotflowlabs">RobotFlowLabs</a></b><br>
+  Optimizing foundation models for real robots.
+</p>

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,120 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' }}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- for message in messages %}
+    {%- if message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content_item in message.content %}
+                {%- if 'text' in content_item %}
+                    {{- content_item.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and message.content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "architectures": [
+    "Qwen3VLForConditionalGeneration"
+  ],
+  "dtype": "bfloat16",
+  "image_token_id": 151655,
+  "model_type": "qwen3_vl",
+  "quantization_config": {
+    "_load_in_4bit": true,
+    "_load_in_8bit": false,
+    "bnb_4bit_compute_dtype": "bfloat16",
+    "bnb_4bit_quant_storage": "uint8",
+    "bnb_4bit_quant_type": "nf4",
+    "bnb_4bit_use_double_quant": true,
+    "llm_int8_enable_fp32_cpu_offload": false,
+    "llm_int8_has_fp16_weight": false,
+    "llm_int8_skip_modules": null,
+    "llm_int8_threshold": 6.0,
+    "load_in_4bit": true,
+    "load_in_8bit": false,
+    "quant_method": "bitsandbytes"
+  },
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "dtype": "bfloat16",
+    "eos_token_id": 151645,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2560,
+    "initializer_range": 0.02,
+    "intermediate_size": 9728,
+    "max_position_embeddings": 262144,
+    "model_type": "qwen3_vl_text",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 8,
+    "pad_token_id": null,
+    "rms_norm_eps": 1e-06,
+    "rope_parameters": {
+      "mrope_interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_theta": 5000000,
+      "rope_type": "default"
+    },
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "vocab_size": 151936
+  },
+  "tie_word_embeddings": true,
+  "transformers_version": "5.3.0",
+  "video_token_id": 151656,
+  "vision_config": {
+    "deepstack_visual_indexes": [
+      5,
+      11,
+      17
+    ],
+    "depth": 24,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1024,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "model_type": "qwen3_vl",
+    "num_heads": 16,
+    "num_position_embeddings": 2304,
+    "out_hidden_size": 2560,
+    "patch_size": 16,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "bos_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "repetition_penalty": 1.0,
+  "temperature": 0.7,
+  "top_k": 20,
+  "top_p": 0.8,
+  "transformers_version": "5.3.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e63de5678f982c34fbebb1fe8d6c258e85e31a0311f5ae4bd604b3ef597a6c0
+size 2874045882

processor_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "image_processor": {
+    "data_format": "channels_first",
+    "do_convert_rgb": true,
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_processor_type": "Qwen2VLImageProcessorFast",
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "merge_size": 2,
+    "patch_size": 16,
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "size": {
+      "longest_edge": 16777216,
+      "shortest_edge": 65536
+    },
+    "temporal_patch_size": 2
+  },
+  "processor_class": "Qwen3VLProcessor",
+  "video_processor": {
+    "data_format": "channels_first",
+    "default_to_square": true,
+    "do_convert_rgb": true,
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "do_sample_frames": true,
+    "fps": 2,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "max_frames": 768,
+    "merge_size": 2,
+    "min_frames": 4,
+    "patch_size": 16,
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "return_metadata": false,
+    "size": {
+      "longest_edge": 25165824,
+      "shortest_edge": 4096
+    },
+    "temporal_patch_size": 2,
+    "video_processor_type": "Qwen3VLVideoProcessor"
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
+size 11422650

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": true,
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "Qwen3VLProcessor",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}