ilessio-aiflowlab commited on
Commit
f2760c2
·
verified ·
1 Parent(s): 20ee228

Add Qwen3-VL-4B-Instruct INT4 NF4 quantized (2.7GB)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-VL-4B-Instruct
4
+ tags:
5
+ - robotics
6
+ - edge-deployment
7
+ - anima
8
+ - forge
9
+ - int4
10
+ - nf4
11
+ - quantized
12
+ - qwen3
13
+ - vision-language
14
+ - multimodal
15
+ - text-generation
16
+ - image-text-to-text
17
+ - instruction-following
18
+ - ros2
19
+ - jetson
20
+ - real-time
21
+ library_name: transformers
22
+ pipeline_tag: image-text-to-text
23
+ model-index:
24
+ - name: qwen3-vl-4b-instruct-int4
25
+ results:
26
+ - task:
27
+ type: image-text-to-text
28
+ metrics:
29
+ - name: Model Size (MB)
30
+ type: model_size
31
+ value: 2741
32
+ - name: Compression Ratio
33
+ type: compression
34
+ value: 3.1
35
+ - name: Original Size (MB)
36
+ type: original_size
37
+ value: 8465
38
+ ---
39
+
40
+ # Qwen3-VL-4B-Instruct — INT4 NF4 Quantized
41
+
42
+ > Alibaba's latest Qwen3-VL-4B-Instruct quantized to 4-bit NF4 with double quantization for high-quality robotic visual reasoning. **3.1x smaller** — from 8.5 GB to 2.7 GB — delivering stronger visual understanding than the 2B variant while still fitting on edge GPUs.
43
+
44
+ This model is part of the **[RobotFlowLabs](https://huggingface.co/robotflowlabs)** model library, built for the **ANIMA** agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.
45
+
46
+ ## Why This Model Exists
47
+
48
+ When robotic tasks demand higher visual reasoning quality — complex scene descriptions, multi-step visual planning, or precise spatial grounding — the 4B variant provides a significant accuracy boost over the 2B. Qwen3-VL-4B features a deeper language model (36 layers vs 28) with wider hidden dimensions (2560 vs 2048), delivering better performance on visual grounding, counting, and reasoning benchmarks. At 2.7 GB quantized, it fits on an L4 24GB alongside a vision encoder and action model.
49
+
50
+ ## Model Details
51
+
52
+ | Property | Value |
53
+ |----------|-------|
54
+ | **Architecture** | Qwen3-VL (vision encoder + language decoder) |
55
+ | **Total Parameters** | 4B |
56
+ | **Text Hidden Dimension** | 2560 |
57
+ | **Text Layers** | 36 |
58
+ | **Text Attention Heads** | 32 (8 KV heads, GQA) |
59
+ | **Text MLP Dimension** | 9728 (SiLU activation) |
60
+ | **Vision Encoder** | 24-layer ViT (1024d, 16 heads, patch 16) |
61
+ | **Vision Features** | DeepStack at layers [5, 11, 17] |
62
+ | **Spatial Merge** | 2×2 (4 patches → 1 token) |
63
+ | **Temporal Patch** | 2 frames per token |
64
+ | **Context Length** | 262,144 tokens |
65
+ | **Vocabulary** | 151,936 tokens |
66
+ | **RoPE** | M-RoPE (interleaved, θ = 5,000,000) |
67
+ | **Quantization** | NF4 double quantization (bitsandbytes) |
68
+ | **Original Model** | [`Qwen/Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
69
+ | **License** | Apache-2.0 |
70
+
71
+ ## Compression Results
72
+
73
+ Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.
74
+
75
+ | Metric | Original | INT4 Quantized | Change |
76
+ |--------|----------|----------------|--------|
77
+ | **Total Size** | 8,465 MB | 2,741 MB | **3.1x smaller** |
78
+ | **Quantization** | BF16 | NF4 + double quant | 4-bit weights |
79
+ | **Compute Dtype** | BF16 | BF16 | Preserved at inference |
80
+ | **Format** | SafeTensors | SafeTensors | Direct HF loading |
81
+
82
+ ## Quick Start
83
+
84
+ ```python
85
+ from transformers import AutoModelForImageTextToText, AutoProcessor
86
+ from PIL import Image
87
+
88
+ model = AutoModelForImageTextToText.from_pretrained(
89
+ "robotflowlabs/qwen3-vl-4b-instruct-int4",
90
+ device_map="auto",
91
+ )
92
+ processor = AutoProcessor.from_pretrained("robotflowlabs/qwen3-vl-4b-instruct-int4")
93
+
94
+ image = Image.open("workspace.jpg")
95
+ messages = [
96
+ {"role": "system", "content": "You are a robotic vision assistant specialized in manipulation tasks."},
97
+ {"role": "user", "content": [
98
+ {"type": "image", "image": image},
99
+ {"type": "text", "text": "List all graspable objects, their approximate positions, and suggest a pick order."}
100
+ ]}
101
+ ]
102
+
103
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
104
+ inputs = processor(text=text, images=[image], return_tensors="pt").to("cuda")
105
+ outputs = model.generate(**inputs, max_new_tokens=512)
106
+ print(processor.decode(outputs[0], skip_special_tokens=True))
107
+ ```
108
+
109
+ ### With FORGE (ANIMA Integration)
110
+
111
+ ```python
112
+ from forge.vlm import VLMRegistry
113
+
114
+ vlm = VLMRegistry.load("qwen3-vl-4b-instruct-int4")
115
+ plan = vlm.describe(image, "List all graspable objects and suggest a manipulation sequence.")
116
+ ```
117
+
118
+ ## Use Cases in ANIMA
119
+
120
+ Qwen3-VL-4B serves as the **high-quality visual reasoning engine** in ANIMA:
121
+
122
+ - **Complex Scene Analysis** — Detailed spatial reasoning about cluttered workspaces
123
+ - **Visual Task Planning** — Multi-step manipulation plans from scene observation
124
+ - **Precise Grounding** — Fine-grained object localization and counting
125
+ - **Structured Output** — JSON scene graphs, object inventories, spatial relationship maps
126
+ - **Video Reasoning** — Temporal understanding of task progress from camera feeds
127
+ - **Safety Assessment** — Visual evaluation of workspace hazards before execution
128
+
129
+ ## Qwen3-VL Family on RobotFlowLabs
130
+
131
+ | Model | Params | Quantized Size | Best For |
132
+ |-------|--------|----------------|----------|
133
+ | [qwen3-vl-2b-instruct-int4](https://huggingface.co/robotflowlabs/qwen3-vl-2b-instruct-int4) | 2B | 1.5 GB | Edge deployment, real-time |
134
+ | **[qwen3-vl-4b-instruct-int4](https://huggingface.co/robotflowlabs/qwen3-vl-4b-instruct-int4)** | **4B** | **2.7 GB** | **Higher accuracy visual reasoning** |
135
+
136
+ ## About ANIMA
137
+
138
+ **ANIMA** is a modular, ROS2-native agentic robotics platform developed by [RobotFlowLabs](https://huggingface.co/robotflowlabs). It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.
139
+
140
+ ### Other Collections
141
+
142
+ - **[ANIMA Vision](https://huggingface.co/collections/robotflowlabs/anima-vision-69bc77ca7ce15b06bbdd21bd)** — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
143
+ - **[ANIMA Language](https://huggingface.co/collections/robotflowlabs/anima-language-69bc77ca29dccc3f68f8c7fd)** — Qwen2.5, SmolLM2
144
+ - **[ANIMA VLM](https://huggingface.co/collections/robotflowlabs/anima-vlm-69bc77ca53ae84ac21b0f012)** — Qwen3-VL, Qwen2.5-VL
145
+ - **[ANIMA VLA](https://huggingface.co/collections/robotflowlabs/anima-vla-69bc77cbf1b8aa40002920bb)** — SmolVLA, RDT2-FM, FORGE students
146
+
147
+ ## Intended Use
148
+
149
+ ### Designed For
150
+ - High-accuracy visual scene understanding for robotic manipulation
151
+ - Complex visual task planning requiring spatial reasoning
152
+ - Precise object grounding and counting in cluttered environments
153
+ - Multi-turn visual dialogue with detailed scene descriptions
154
+
155
+ ### Limitations
156
+ - INT4 quantization may slightly reduce fine-grained visual grounding precision
157
+ - 262K context window is generous but may not cover extremely long video sequences
158
+ - Requires GPU (bitsandbytes NF4 does not run on CPU)
159
+ - Inherits biases from Qwen3-VL training data
160
+
161
+ ### Out of Scope
162
+ - Safety-critical autonomous decision making without human oversight
163
+ - Medical image analysis
164
+ - Surveillance applications
165
+
166
+ ## Technical Details
167
+
168
+ ### Compression Pipeline
169
+
170
+ ```
171
+ Original Qwen3-VL-4B-Instruct (BF16, 8.5 GB)
172
+
173
+ └─→ bitsandbytes NF4 double quantization
174
+ ├─→ bnb_4bit_quant_type: nf4
175
+ ├─→ bnb_4bit_use_double_quant: true
176
+ ├─→ bnb_4bit_compute_dtype: bfloat16
177
+ └─→ model.safetensors (2.7 GB)
178
+ ```
179
+
180
+ - **Quantization**: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
181
+ - **Compute**: BF16 at inference — weights dequantized on-the-fly
182
+ - **Hardware**: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14
183
+
184
+ ## Attribution
185
+
186
+ - **Original Model**: [`Qwen/Qwen3-VL-4B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) by Alibaba Cloud
187
+ - **License**: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
188
+ - **Compressed by**: [RobotFlowLabs](https://huggingface.co/robotflowlabs) using [FORGE](https://github.com/robotflowlabs/forge)
189
+
190
+ ## Citation
191
+
192
+ ```bibtex
193
+ @article{qwen3vl,
194
+ title={Qwen3-VL Technical Report},
195
+ author={Qwen Team},
196
+ year={2025}
197
+ }
198
+ ```
199
+
200
+ ---
201
+
202
+ <p align="center">
203
+ <b>Built with FORGE by <a href="https://huggingface.co/robotflowlabs">RobotFlowLabs</a></b><br>
204
+ Optimizing foundation models for real robots.
205
+ </p>
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3VLForConditionalGeneration"
4
+ ],
5
+ "dtype": "bfloat16",
6
+ "image_token_id": 151655,
7
+ "model_type": "qwen3_vl",
8
+ "quantization_config": {
9
+ "_load_in_4bit": true,
10
+ "_load_in_8bit": false,
11
+ "bnb_4bit_compute_dtype": "bfloat16",
12
+ "bnb_4bit_quant_storage": "uint8",
13
+ "bnb_4bit_quant_type": "nf4",
14
+ "bnb_4bit_use_double_quant": true,
15
+ "llm_int8_enable_fp32_cpu_offload": false,
16
+ "llm_int8_has_fp16_weight": false,
17
+ "llm_int8_skip_modules": null,
18
+ "llm_int8_threshold": 6.0,
19
+ "load_in_4bit": true,
20
+ "load_in_8bit": false,
21
+ "quant_method": "bitsandbytes"
22
+ },
23
+ "text_config": {
24
+ "attention_bias": false,
25
+ "attention_dropout": 0.0,
26
+ "bos_token_id": 151643,
27
+ "dtype": "bfloat16",
28
+ "eos_token_id": 151645,
29
+ "head_dim": 128,
30
+ "hidden_act": "silu",
31
+ "hidden_size": 2560,
32
+ "initializer_range": 0.02,
33
+ "intermediate_size": 9728,
34
+ "max_position_embeddings": 262144,
35
+ "model_type": "qwen3_vl_text",
36
+ "num_attention_heads": 32,
37
+ "num_hidden_layers": 36,
38
+ "num_key_value_heads": 8,
39
+ "pad_token_id": null,
40
+ "rms_norm_eps": 1e-06,
41
+ "rope_parameters": {
42
+ "mrope_interleaved": true,
43
+ "mrope_section": [
44
+ 24,
45
+ 20,
46
+ 20
47
+ ],
48
+ "rope_theta": 5000000,
49
+ "rope_type": "default"
50
+ },
51
+ "tie_word_embeddings": true,
52
+ "use_cache": true,
53
+ "vocab_size": 151936
54
+ },
55
+ "tie_word_embeddings": true,
56
+ "transformers_version": "5.3.0",
57
+ "video_token_id": 151656,
58
+ "vision_config": {
59
+ "deepstack_visual_indexes": [
60
+ 5,
61
+ 11,
62
+ 17
63
+ ],
64
+ "depth": 24,
65
+ "dtype": "bfloat16",
66
+ "hidden_act": "gelu_pytorch_tanh",
67
+ "hidden_size": 1024,
68
+ "in_channels": 3,
69
+ "initializer_range": 0.02,
70
+ "intermediate_size": 4096,
71
+ "model_type": "qwen3_vl",
72
+ "num_heads": 16,
73
+ "num_position_embeddings": 2304,
74
+ "out_hidden_size": 2560,
75
+ "patch_size": 16,
76
+ "spatial_merge_size": 2,
77
+ "temporal_patch_size": 2
78
+ },
79
+ "vision_end_token_id": 151653,
80
+ "vision_start_token_id": 151652
81
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.0,
10
+ "temperature": 0.7,
11
+ "top_k": 20,
12
+ "top_p": 0.8,
13
+ "transformers_version": "5.3.0"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e63de5678f982c34fbebb1fe8d6c258e85e31a0311f5ae4bd604b3ef597a6c0
3
+ size 2874045882
processor_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_processor": {
3
+ "data_format": "channels_first",
4
+ "do_convert_rgb": true,
5
+ "do_normalize": true,
6
+ "do_rescale": true,
7
+ "do_resize": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "Qwen2VLImageProcessorFast",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "merge_size": 2,
20
+ "patch_size": 16,
21
+ "resample": 3,
22
+ "rescale_factor": 0.00392156862745098,
23
+ "size": {
24
+ "longest_edge": 16777216,
25
+ "shortest_edge": 65536
26
+ },
27
+ "temporal_patch_size": 2
28
+ },
29
+ "processor_class": "Qwen3VLProcessor",
30
+ "video_processor": {
31
+ "data_format": "channels_first",
32
+ "default_to_square": true,
33
+ "do_convert_rgb": true,
34
+ "do_normalize": true,
35
+ "do_rescale": true,
36
+ "do_resize": true,
37
+ "do_sample_frames": true,
38
+ "fps": 2,
39
+ "image_mean": [
40
+ 0.5,
41
+ 0.5,
42
+ 0.5
43
+ ],
44
+ "image_std": [
45
+ 0.5,
46
+ 0.5,
47
+ 0.5
48
+ ],
49
+ "max_frames": 768,
50
+ "merge_size": 2,
51
+ "min_frames": 4,
52
+ "patch_size": 16,
53
+ "resample": 3,
54
+ "rescale_factor": 0.00392156862745098,
55
+ "return_metadata": false,
56
+ "size": {
57
+ "longest_edge": 25165824,
58
+ "shortest_edge": 4096
59
+ },
60
+ "temporal_patch_size": 2,
61
+ "video_processor_type": "Qwen3VLVideoProcessor"
62
+ }
63
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
3
+ size 11422650
tokenizer_config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": true,
24
+ "model_max_length": 262144,
25
+ "pad_token": "<|endoftext|>",
26
+ "processor_class": "Qwen3VLProcessor",
27
+ "split_special_tokens": false,
28
+ "tokenizer_class": "Qwen2Tokenizer",
29
+ "unk_token": null
30
+ }