Yaphabates commited on about 11 hours ago

Commit

52f434f

verified ·

1 Parent(s): 7ccdd11

Initial release of MOSAIC-4B

Browse files

Files changed (19) hide show

.gitattributes +1 -0
README.md +149 -0
added_tokens.json +28 -0
chat_template.jinja +120 -0
config.json +424 -0
config_nas_vl.py +34 -0
generation_config.json +13 -0
merges.txt +0 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +834 -0
modeling_nas_child_vl.py +467 -0
nas_vl_layer.py +690 -0
preprocessor_config.json +39 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +240 -0
video_preprocessor_config.json +41 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# MOSAIC-4B
+**MOSAIC-4B** is an efficient heterogeneous Vision-Language Model derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) via the **MOSAIC** (**M**ulti-**O**bjective **S**earch for **A**daptive **I**nter-layer **C**omposition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search.
+> **Paper:** *MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models* (CVPR 2026)
+> **Authors:** Yuncheng Yang\*, Feiyang Ye\*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc.
+---
+## Highlights
+| Metric | Value |
+|--------|-------|
+| **Decoding speedup (TPOT)** | **2.54×** vs. Qwen3-VL-4B-Instruct |
+| **Prefilling speedup (TTFT @ 96k tokens)** | **1.76×** vs. Qwen3-VL-4B-Instruct |
+| **Performance gap (19 benchmarks avg)** | **−0.6%** on image, **−0.8%** on video |
+| **Training cost** | **< 2%** of original Qwen3-VL-4B-Instruct |
+### Key Advantages
+- **Hardware-aware automatic architecture search.** MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed.
+- **Heterogeneous operator mixing.** Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot.
+- **Matches teacher performance at a fraction of the training cost.** MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute.
+- **Scalable inference acceleration.** The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads.
+- **Principled two-stage parameter recovery.** Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability.
+---
+## Architecture
+MOSAIC-4B has 36 decoder layers with the following per-layer operator assignment discovered by the search:
+- **Shallow layers (0–6):** Full attention (GQA) and Sliding Window Attention (SWA) for local context
+- **Middle layers (7–28):** Linear attention mechanisms (GDN, KDA) for efficient long-range modeling
+- **Deep layers (29–35):** Multi-head Latent Attention (MLA) for high-quality global representations + full attention in final layers
+This pattern, discovered purely through data-driven optimization, independently validates known architectural intuitions: local mechanisms suffice early, linear attention handles the bulk of computation, and global attention is preserved where representations matter most.
+---
+## Installation
+```bash
+pip install transformers torch
+pip install flash-linear-attention  # required for linear attention operators (KDA, GDN, MLA)
+```
+---
+## Usage
+This model uses a custom architecture and requires `trust_remote_code=True`.
+### Basic Text + Image Inference
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+import torch
+model_id = "LiAuto-DSR/MOSAIC-4B"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "https://example.com/image.jpg"},
+            {"type": "text", "text": "Describe this image in detail."},
+        ],
+    }
+]
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(**inputs, max_new_tokens=512)
+response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print(response)
+```
+### Text-Only Inference
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+import torch
+model_id = "LiAuto-DSR/MOSAIC-4B"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+messages = [{"role": "user", "content": "Explain the concept of neural architecture search."}]
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(**inputs, max_new_tokens=512)
+response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print(response)
+```
+---
+## Dependencies
+| Package | Version |
+|---------|---------|
+| transformers | ≥ 4.57.0 |
+| torch | ≥ 2.0 |
+| flash-linear-attention (fla) | latest |
+---
+## Citation
+```bibtex
+@inproceedings{yang2026mosaic,
+  title     = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models},
+  author    = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie},
+  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year      = {2026}
+}
+```
+---
+## License
+This model is released under the **Apache 2.0** license.
+The base model weights are derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct), which is licensed under [Qwen Research License](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct/blob/main/LICENSE).

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,120 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' }}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- for message in messages %}
+    {%- if message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content_item in message.content %}
+                {%- if 'text' in content_item %}
+                    {{- content_item.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and message.content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,424 @@

+{
+  "architectures": [
+    "NasChildVLModelForCausalLM"
+  ],
+  "dtype": "bfloat16",
+  "eos_token_id": 151645,
+  "image_token_id": 151655,
+  "model_type": "nas-child-vl",
+  "nas_layer_config": [
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "swa",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true,
+      "sliding_window": 1024
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "swa",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true,
+      "sliding_window": 1024
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "block_metric": "mse",
+      "child_intermediate_size": 8192,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "swa",
+      "block_metric": "mse",
+      "child_intermediate_size": 8192,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true,
+      "sliding_window": 1024
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 8192,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "block_metric": "mse",
+      "child_intermediate_size": 8192,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "kda",
+      "block_metric": "mse",
+      "child_intermediate_size": 8192,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 8192,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "kda",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "kda",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "gdn",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "mla",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "kda",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "mla",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "kda",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "mla",
+      "block_metric": "mse",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 0,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    },
+    {
+      "attention_type": "full_attention",
+      "child_intermediate_size": 9728,
+      "child_num_attention_heads": 32,
+      "ffn_type": "ffn",
+      "gqa_num_kv_heads": 8,
+      "inherit": true
+    }
+  ],
+  "pad_token_id": 151643,
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "dtype": "bfloat16",
+    "eos_token_id": 151645,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2560,
+    "initializer_range": 0.02,
+    "intermediate_size": 9728,
+    "layer_types": [
+      "full_attention",
+      "full_attention",
+      "sliding_attention",
+      "full_attention",
+      "full_attention",
+      "sliding_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "sliding_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 262144,
+    "model_type": "qwen3_vl_text",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_type": "default"
+    },
+    "rope_theta": 5000000,
+    "sliding_window": 1024,
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "vocab_size": 151936
+  },
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.0",
+  "video_token_id": 151656,
+  "vision_config": {
+    "deepstack_visual_indexes": [
+      5,
+      11,
+      17
+    ],
+    "depth": 24,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1024,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "model_type": "qwen3_vl",
+    "num_heads": 16,
+    "num_position_embeddings": 2304,
+    "out_hidden_size": 2560,
+    "patch_size": 16,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652,
+  "auto_map": {
+    "AutoConfig": "config_nas_vl.NasChildVLConfig",
+    "AutoModelForCausalLM": "modeling_nas_child_vl.NasChildVLModelForCausalLM"
+  }
+}

config_nas_vl.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from transformers import Qwen3VLConfig, AutoConfig
+class NasChildVLConfig(Qwen3VLConfig):
+    model_type = "nas-child-vl"
+    def __init__(self, nas_layer_config=None, **kwargs):
+        kwargs.pop("model_type", None)
+        _from_kwargs = kwargs.pop("nas_layer_config", None)
+        if nas_layer_config is None:
+            nas_layer_config = _from_kwargs
+        super().__init__(**kwargs)
+        if nas_layer_config is not None and len(nas_layer_config) > 0:
+            self.nas_layer_config = nas_layer_config
+        elif hasattr(self, "nas_layer_config") and self.nas_layer_config:
+            pass
+        else:
+            self.nas_layer_config = []
+    def to_dict(self):
+        output = super().to_dict()
+        output["nas_layer_config"] = self.nas_layer_config
+        if "text_config" in output and isinstance(output["text_config"], dict):
+            output["text_config"].pop("nas_layer_config", None)
+        return output
+try:
+    AutoConfig.register("nas-child-vl", NasChildVLConfig)
+except Exception:
+    pass

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "temperature": 0.7,
+  "top_k": 20,
+  "top_p": 0.8,
+  "transformers_version": "4.57.0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:001f2ca85f0cf1a1586fb80881b6d00f5de0ef03e82a8d3fd01463c573bd6a87
+size 4969968120

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35d966a49853f0718bb33d6129eebbd75b5d54b49d94fa4a332a1a478e7112b2
+size 4484711536

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,834 @@

+{
+  "metadata": {
+    "total_parameters": 4338338800,
+    "total_size": 9454589920
+  },
+  "weight_map": {
+    "embed_tokens.weight": "model-00001-of-00002.safetensors",
+    "layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.A_log": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.dt_bias": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.A_log": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.dt_bias": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.16.self_attn.A_log": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.dt_bias": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.17.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.18.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.19.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.k_rope.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.kv_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.kv_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.kv_proj.2.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.q_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.q_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.29.self_attn.q_proj.2.weight": "model-00002-of-00002.safetensors",
+    "layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.k_rope.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.kv_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.kv_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.kv_proj.2.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.q_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.q_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.31.self_attn.q_proj.2.weight": "model-00002-of-00002.safetensors",
+    "layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.A_log": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.dt_bias": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
+    "layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.k_rope.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.kv_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.kv_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.kv_proj.2.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.q_proj.0.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.q_proj.1.weight": "model-00002-of-00002.safetensors",
+    "layers.33.self_attn.q_proj.2.weight": "model-00002-of-00002.safetensors",
+    "layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
+    "layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.A_log": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.dt_bias": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
+    "layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
+    "layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "lm_head.weight": "model-00002-of-00002.safetensors",
+    "norm.weight": "model-00002-of-00002.safetensors",
+    "visual.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.norm1.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.norm2.bias": "model-00001-of-00002.safetensors",
+    "visual.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.0.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.0.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.0.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.0.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.0.norm.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.0.norm.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.1.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.1.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.1.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.1.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.1.norm.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.1.norm.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.2.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.2.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.2.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.2.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.2.norm.bias": "model-00001-of-00002.safetensors",
+    "visual.deepstack_merger_list.2.norm.weight": "model-00001-of-00002.safetensors",
+    "visual.merger.linear_fc1.bias": "model-00001-of-00002.safetensors",
+    "visual.merger.linear_fc1.weight": "model-00001-of-00002.safetensors",
+    "visual.merger.linear_fc2.bias": "model-00001-of-00002.safetensors",
+    "visual.merger.linear_fc2.weight": "model-00001-of-00002.safetensors",
+    "visual.merger.norm.bias": "model-00001-of-00002.safetensors",
+    "visual.merger.norm.weight": "model-00001-of-00002.safetensors",
+    "visual.patch_embed.proj.bias": "model-00001-of-00002.safetensors",
+    "visual.patch_embed.proj.weight": "model-00001-of-00002.safetensors",
+    "visual.pos_embed.weight": "model-00001-of-00002.safetensors"
+  }
+}

modeling_nas_child_vl.py ADDED Viewed

	@@ -0,0 +1,467 @@

+import torch
+import torch.nn as nn
+from typing import List, Optional, Any
+from dataclasses import dataclass
+from transformers.models.qwen3_vl import Qwen3VLPreTrainedModel
+from transformers.models.qwen3_vl.modeling_qwen3_vl import (
+    Qwen3VLModel,
+    Qwen3VLTextModel,
+    Qwen3VLVisionModel,
+    Qwen3VLTextRMSNorm,
+    Qwen3VLTextRotaryEmbedding,
+)
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.cache_utils import DynamicCache
+from transformers.utils import is_torchdynamo_compiling
+from transformers.generation import GenerationMixin
+from .config_nas_vl import NasChildVLConfig
+from .nas_vl_layer import NasVLDecoderLayer, ChildLayerVLConfig
+@dataclass
+class Qwen3VLCausalLMOutputWithPast(CausalLMOutputWithPast):
+    rope_deltas: Optional[torch.Tensor] = None
+class NasChildVLModelForCausalLM(Qwen3VLPreTrainedModel, GenerationMixin):
+    config_class = NasChildVLConfig
+    _checkpoint_conversion_mapping = {}
+    _tied_weights_keys = ["lm_head.weight"]
+    get_image_features   = Qwen3VLModel.get_image_features
+    get_video_features   = Qwen3VLModel.get_video_features
+    get_placeholder_mask = Qwen3VLModel.get_placeholder_mask
+    get_rope_index       = Qwen3VLModel.get_rope_index
+    _deepstack_process   = Qwen3VLTextModel._deepstack_process
+    def __init__(self, config: NasChildVLConfig):
+        super().__init__(config)
+        self.parent_config = config
+        self.is_vl = True
+        self.rope_deltas = None
+        text_config = config.text_config
+        self.parent_hidden_size = text_config.hidden_size
+        self.child_hidden_size = self.parent_hidden_size
+        self.vocab_size = text_config.vocab_size
+        self.visual = Qwen3VLVisionModel._from_config(config.vision_config)
+        self.embed_tokens = nn.Embedding(
+            text_config.vocab_size, text_config.hidden_size
+        )
+        if self.child_hidden_size != self.parent_hidden_size:
+            self.input_proj  = nn.Linear(self.parent_hidden_size, self.child_hidden_size, bias=False)
+            self.output_proj = nn.Linear(self.child_hidden_size, self.parent_hidden_size, bias=False)
+        else:
+            self.input_proj  = nn.Identity()
+            self.output_proj = nn.Identity()
+        layer_types = []
+        global_sliding_window = None
+        for i in range(text_config.num_hidden_layers):
+            cfg = config.nas_layer_config[i]
+            if isinstance(cfg, dict):
+                cfg = ChildLayerVLConfig(**cfg)
+            attn_type = str(cfg.attention_type).split('.')[-1].lower()
+            if attn_type == "swa":
+                layer_types.append("sliding_attention")
+                if global_sliding_window is None:
+                    sw_val = getattr(cfg, "sliding_window", 1024)
+                    global_sliding_window = int(sw_val) if sw_val else 1024
+            else:
+                layer_types.append("full_attention")
+        if hasattr(self.config, "text_config"):
+            self.config.text_config.layer_types = layer_types
+            if global_sliding_window is not None:
+                self.config.text_config.sliding_window = global_sliding_window
+        else:
+            self.config.layer_types = layer_types
+            if global_sliding_window is not None:
+                self.config.sliding_window = global_sliding_window
+        self.layers = nn.ModuleList()
+        for i in range(text_config.num_hidden_layers):
+            cfg = config.nas_layer_config[i]
+            self.layers.append(
+                NasVLDecoderLayer(
+                    layer_idx=i,
+                    nas_config=cfg,
+                    parent_config=config,
+                    parent_model=None,
+                )
+            )
+        self.norm = Qwen3VLTextRMSNorm(
+            self.child_hidden_size, eps=text_config.rms_norm_eps
+        )
+        self.lm_head = nn.Linear(
+            self.parent_hidden_size, self.vocab_size, bias=False
+        )
+        if config.tie_word_embeddings:
+            self.lm_head.weight = self.embed_tokens.weight
+        self.rotary_emb = Qwen3VLTextRotaryEmbedding(config=text_config)
+        self.has_sliding_layers = False
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None,
+        inputs_embeds=None, cache_position=None, position_ids=None,
+        use_cache=True, pixel_values=None, pixel_values_videos=None,
+        image_grid_thw=None, video_grid_thw=None, **kwargs,
+    ):
+        model_inputs = super().prepare_inputs_for_generation(
+            input_ids, past_key_values=past_key_values,
+            attention_mask=attention_mask, inputs_embeds=inputs_embeds,
+            cache_position=cache_position, position_ids=position_ids,
+            use_cache=use_cache, **kwargs,
+        )
+        model_inputs.update({
+            "pixel_values": pixel_values,
+            "pixel_values_videos": pixel_values_videos,
+            "image_grid_thw": image_grid_thw,
+            "video_grid_thw": video_grid_thw,
+        })
+        model_inputs["position_ids"] = None
+        if cache_position[0] != 0:
+            model_inputs["pixel_values"] = None
+            model_inputs["pixel_values_videos"] = None
+        return model_inputs
+    def _get_image_nums_and_video_nums(self, input_ids, inputs_embeds=None):
+        image_token_id = self.config.image_token_id
+        video_token_id = self.config.video_token_id
+        vision_start_token_id = self.config.vision_start_token_id
+        if inputs_embeds is not None:
+            dev = inputs_embeds.device
+            _embed = lambda tid: self.embed_tokens(
+                torch.tensor(tid, dtype=torch.long, device=dev)
+            )
+            vision_start_mask = (inputs_embeds == _embed(vision_start_token_id))[..., 0]
+            image_mask = (inputs_embeds == _embed(image_token_id))[..., 0]
+            video_mask = (inputs_embeds == _embed(video_token_id))[..., 0]
+        else:
+            vision_start_mask = input_ids == vision_start_token_id
+            image_mask = input_ids == image_token_id
+            video_mask = input_ids == video_token_id
+        vision_first_mask = torch.roll(vision_start_mask, shifts=1, dims=1)
+        image_nums = torch.sum(vision_first_mask & image_mask, dim=1)
+        video_nums = torch.sum(vision_first_mask & video_mask, dim=1)
+        return image_nums, video_nums
+    def _expand_inputs_for_generation(
+        self, expand_size=1, is_encoder_decoder=False, input_ids=None,
+        **model_kwargs,
+    ):
+        if expand_size == 1:
+            return input_ids, model_kwargs
+        visual_keys = [
+            "pixel_values", "image_grid_thw",
+            "pixel_values_videos", "video_grid_thw",
+            "second_per_grid_ts",
+        ]
+        def _repeat_interleave_samples(x, lengths, repeat_times):
+            samples = torch.split(x, lengths)
+            repeat_args = [repeat_times] + [1] * (x.dim() - 1)
+            return torch.cat([s.repeat(*repeat_args) for s in samples], dim=0)
+        def _expand_visual(d):
+            image_grid_thw = model_kwargs.get("image_grid_thw")
+            video_grid_thw = model_kwargs.get("video_grid_thw")
+            image_nums, video_nums = self._get_image_nums_and_video_nums(
+                input_ids, inputs_embeds=model_kwargs.get("inputs_embeds")
+            )
+            for key in list(d.keys()):
+                if d[key] is None:
+                    continue
+                if key == "pixel_values":
+                    lens = [torch.prod(s, dim=1).sum()
+                            for s in torch.split(image_grid_thw, list(image_nums))]
+                    d[key] = _repeat_interleave_samples(d[key], lens, expand_size)
+                elif key == "image_grid_thw":
+                    d[key] = _repeat_interleave_samples(d[key], list(image_nums), expand_size)
+                elif key == "pixel_values_videos":
+                    lens = [torch.prod(s, dim=1).sum()
+                            for s in torch.split(video_grid_thw, list(video_nums))]
+                    d[key] = _repeat_interleave_samples(d[key], lens, expand_size)
+                elif key == "video_grid_thw":
+                    d[key] = _repeat_interleave_samples(d[key], list(video_nums), expand_size)
+                elif key == "second_per_grid_ts":
+                    d[key] = _repeat_interleave_samples(d[key], list(video_nums), expand_size)
+            return d
+        def _expand_general(d):
+            for key in d:
+                if (key != "cache_position" and d[key] is not None
+                        and isinstance(d[key], torch.Tensor) and key not in visual_keys):
+                    d[key] = d[key].repeat_interleave(expand_size, dim=0)
+            return d
+        model_kwargs = _expand_visual(model_kwargs)
+        if input_ids is not None:
+            input_ids = input_ids.repeat_interleave(expand_size, dim=0)
+        model_kwargs = _expand_general(model_kwargs)
+        if is_encoder_decoder:
+            if model_kwargs.get("encoder_outputs") is None:
+                raise ValueError("encoder_outputs required for encoder-decoder")
+            model_kwargs["encoder_outputs"] = _expand_general(
+                model_kwargs["encoder_outputs"]
+            )
+        return input_ids, model_kwargs
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: torch.Tensor = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values=None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        pixel_values: Optional[torch.Tensor] = None,
+        pixel_values_videos: Optional[torch.FloatTensor] = None,
+        image_grid_thw: Optional[torch.Tensor] = None,
+        video_grid_thw: Optional[torch.Tensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ):
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        return_dict = (
+            return_dict if return_dict is not None
+            else self.config.use_return_dict
+        )
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        image_mask = video_mask = None
+        deepstack_image_embeds = deepstack_video_embeds = None
+        if pixel_values is not None and self.visual is not None:
+            image_embeds, deepstack_image_embeds = self.get_image_features(
+                pixel_values, image_grid_thw
+            )
+            image_embeds = torch.cat(image_embeds, dim=0).to(
+                inputs_embeds.device, inputs_embeds.dtype
+            )
+            image_mask, _ = self.get_placeholder_mask(
+                input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
+            )
+            inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+        if pixel_values_videos is not None and self.visual is not None:
+            video_embeds, deepstack_video_embeds = self.get_video_features(
+                pixel_values_videos, video_grid_thw
+            )
+            video_embeds = torch.cat(video_embeds, dim=0).to(
+                inputs_embeds.device, inputs_embeds.dtype
+            )
+            _, video_mask = self.get_placeholder_mask(
+                input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
+            )
+            inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
+        visual_pos_masks = None
+        deepstack_visual_embeds = None
+        if image_mask is not None and video_mask is not None:
+            image_mask = image_mask[..., 0]
+            video_mask = video_mask[..., 0]
+            visual_pos_masks = image_mask | video_mask
+            deepstack_visual_embeds = []
+            img_joint = image_mask[visual_pos_masks]
+            vid_joint = video_mask[visual_pos_masks]
+            for img_e, vid_e in zip(deepstack_image_embeds, deepstack_video_embeds):
+                joint = img_e.new_zeros(
+                    visual_pos_masks.sum(), img_e.shape[-1]
+                ).to(img_e.device)
+                joint[img_joint, :] = img_e
+                joint[vid_joint, :] = vid_e
+                deepstack_visual_embeds.append(joint)
+        elif image_mask is not None:
+            image_mask = image_mask[..., 0]
+            visual_pos_masks = image_mask
+            deepstack_visual_embeds = deepstack_image_embeds
+        elif video_mask is not None:
+            video_mask = video_mask[..., 0]
+            visual_pos_masks = video_mask
+            deepstack_visual_embeds = deepstack_video_embeds
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache(config=self.config)
+        if cache_position is None:
+            past_seen = (
+                past_key_values.get_seq_length()
+                if past_key_values is not None else 0
+            )
+            cache_position = torch.arange(
+                past_seen, past_seen + inputs_embeds.shape[1],
+                device=inputs_embeds.device,
+            )
+        current_seq_len = inputs_embeds.shape[1]
+        if (current_seq_len == 1 and cache_position[0] == 0
+                and attention_mask is not None):
+            real_past_seen = attention_mask.shape[-1] - 1
+            if real_past_seen > 0:
+                cache_position = torch.tensor(
+                    [real_past_seen], device=inputs_embeds.device
+                )
+        if position_ids is None:
+            attn_mask_tensor = (
+                attention_mask
+                if not isinstance(attention_mask, dict)
+                else attention_mask.get("full_attention")
+            )
+            if attn_mask_tensor is not None and attn_mask_tensor.ndim == 4:
+                attn_mask_tensor = torch.diagonal(
+                    attn_mask_tensor[:, 0], dim1=1, dim2=2
+                )
+                if attn_mask_tensor.dtype.is_floating_point:
+                    attn_mask_tensor = (
+                        attn_mask_tensor
+                        / torch.finfo(attn_mask_tensor.dtype).min
+                    )
+                    attn_mask_tensor = (1.0 - attn_mask_tensor).int()
+            is_real_prefill = (
+                (input_ids is not None and input_ids.shape[1] > 1)
+                or (inputs_embeds is not None and inputs_embeds.shape[1] > 1)
+            )
+            prefill_compiled = is_torchdynamo_compiling() and is_real_prefill
+            prefill_noncompiled = not is_torchdynamo_compiling() and (
+                (cache_position is not None and cache_position[0] == 0)
+                or (past_key_values is None
+                    or past_key_values.get_seq_length() == 0)
+            )
+            should_calc_rope = (
+                (prefill_compiled or prefill_noncompiled)
+                or self.rope_deltas is None
+            )
+            if (should_calc_rope and not is_real_prefill
+                    and self.rope_deltas is not None):
+                should_calc_rope = False
+            if should_calc_rope:
+                position_ids, rope_deltas = self.get_rope_index(
+                    input_ids, image_grid_thw, video_grid_thw,
+                    attention_mask=attn_mask_tensor,
+                )
+                self.rope_deltas = rope_deltas
+            else:
+                batch_size = inputs_embeds.shape[0]
+                seq_length = inputs_embeds.shape[1]
+                delta = (
+                    (cache_position[0] + self.rope_deltas).to(
+                        inputs_embeds.device
+                    )
+                    if cache_position is not None else 0
+                )
+                position_ids = torch.arange(
+                    seq_length, device=inputs_embeds.device
+                ).view(1, -1).expand(batch_size, -1)
+                if cache_position is not None:
+                    delta = delta.repeat_interleave(
+                        batch_size // delta.shape[0], dim=0
+                    )
+                position_ids = position_ids.add(delta)
+                position_ids = position_ids.unsqueeze(0).expand(3, -1, -1)
+        if position_ids.ndim == 3 and position_ids.shape[0] == 4:
+            text_position_ids = position_ids[0]
+            rope_position_ids = position_ids[1:]
+        elif position_ids.ndim == 3:
+            text_position_ids = position_ids[0]
+            rope_position_ids = position_ids
+        else:
+            text_position_ids = position_ids
+            rope_position_ids = position_ids
+        rotary_emb = self.rotary_emb(inputs_embeds, rope_position_ids)
+        hidden_states = self.input_proj(inputs_embeds)
+        all_hidden_states = () if output_hidden_states else None
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        for i, layer in enumerate(self.layers):
+            layer_outputs = layer(
+                hidden_states,
+                attention_mask=attention_mask,
+                position_ids=text_position_ids,
+                position_embeddings=rotary_emb,
+                use_cache=use_cache,
+                past_key_values=past_key_values,
+                cache_position=cache_position,
+                **kwargs,
+            )
+            hidden_states = (
+                layer_outputs[0]
+                if isinstance(layer_outputs, tuple)
+                else layer_outputs
+            )
+            if (deepstack_visual_embeds is not None
+                    and i < len(deepstack_visual_embeds)):
+                hidden_states = self._deepstack_process(
+                    hidden_states,
+                    visual_pos_masks,
+                    deepstack_visual_embeds[i],
+                )
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.output_proj(hidden_states)
+        logits = self.lm_head(hidden_states)
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss = nn.CrossEntropyLoss()(
+                shift_logits.view(-1, self.vocab_size),
+                shift_labels.view(-1),
+            )
+        return Qwen3VLCausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            rope_deltas=self.rope_deltas,
+        )

nas_vl_layer.py ADDED Viewed

	@@ -0,0 +1,690 @@

+from typing import Optional, Tuple
+from enum import Enum
+from dataclasses import dataclass, field
+from types import SimpleNamespace
+import torch
+import copy
+from transformers import Qwen3Config
+from transformers import GradientCheckpointingLayer, Cache
+from transformers.masking_utils import (
+    create_causal_mask,
+    create_sliding_window_causal_mask,
+)
+from transformers.models.qwen3.modeling_qwen3 import Qwen3Attention, Qwen3MLP, Qwen3RMSNorm
+from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLTextAttention, Qwen3VLTextMLP, Qwen3VLTextRMSNorm
+from fla.layers.delta_net import DeltaNet
+from fla.models.delta_net.configuration_delta_net import DeltaNetConfig
+from fla.layers.gated_deltanet import GatedDeltaNet
+from fla.models.gated_deltanet.configuration_gated_deltanet import GatedDeltaNetConfig
+from fla.layers.kda import KimiDeltaAttention
+from fla.models.kda.configuration_kda import KDAConfig
+from fla.models.kda.modeling_kda import KDAPreTrainedModel
+from fla.layers.mamba2 import Mamba2
+from fla.models.mamba2.configuration_mamba2 import Mamba2Config
+from fla.models.mamba2.modeling_mamba2 import Mamba2Block
+from fla.layers.gla import GatedLinearAttention
+from fla.models.gla.configuration_gla import GLAConfig
+from fla.layers.nsa import NativeSparseAttention
+from fla.models.nsa.configuration_nsa import NSAConfig
+from fla.layers.mla import MultiheadLatentAttention
+from fla.models.mla.configuration_mla import MLAConfig
+import copy
+class FLACacheAdapter:
+    def __init__(self, cache):
+        self.cache = cache
+        if not hasattr(self.cache, 'fla_states'):
+            self.cache.fla_states = {}
+    def get_seq_length(self, layer_idx=None):
+        if layer_idx is not None and layer_idx in self.cache.fla_states:
+            state = self.cache.fla_states[layer_idx]
+            if 'attn_state' in state:
+                attn_state = state['attn_state']
+                if (isinstance(attn_state, tuple) and len(attn_state) == 2
+                        and isinstance(attn_state[0], torch.Tensor)):
+                    return attn_state[0].shape[1]
+        return 0
+    def update(self, attn_state=None, layer_idx=None, offset=None,
+               cache_kwargs=None, **kwargs):
+        if layer_idx is None:
+            layer_idx = kwargs.pop('layer_idx', None)
+        if layer_idx is None:
+            return {}
+        if layer_idx not in self.cache.fla_states:
+            self.cache.fla_states[layer_idx] = {}
+        state = self.cache.fla_states[layer_idx]
+        if attn_state is not None:
+            if (isinstance(attn_state, tuple) and len(attn_state) == 2
+                    and isinstance(attn_state[0], torch.Tensor)
+                    and isinstance(attn_state[1], torch.Tensor)):
+                new_k, new_v = attn_state
+                if 'attn_state' in state:
+                    old_k, old_v = state['attn_state']
+                    new_k = torch.cat([old_k, new_k], dim=1)
+                    new_v = torch.cat([old_v, new_v], dim=1)
+                state['attn_state'] = (new_k, new_v)
+            else:
+                state['attn_state'] = attn_state
+        for key, value in kwargs.items():
+            if key != 'layer_idx':
+                state[key] = value
+        return state
+    def __getitem__(self, layer_idx):
+        return self.cache.fla_states.get(layer_idx, None)
+    def __setitem__(self, layer_idx, value):
+        self.cache.fla_states[layer_idx] = value
+    def __contains__(self, layer_idx):
+        return layer_idx in self.cache.fla_states
+    def __len__(self):
+        if not self.cache.fla_states:
+            return 0
+        return max(self.cache.fla_states.keys()) + 1
+class AttentionType(str, Enum):
+    FULL = "full_attention"
+    SWA = "swa"
+    MAMBA2 = "mamba2"
+    GLA = "gla"
+    GDN = "gdn"
+    DN = "dn"
+    KDA = "kda"
+    NSA = "nsa"
+    MLA = "mla"
+    NOOP = "no-op"
+    LINEAR = "linear"
+class FFNType(str, Enum):
+    FFN = "ffn"
+    MOE = "moe"
+    NOOP = "no-op"
+    LINEAR = "linear"
+    NFFN = "nffn"
+class MetricType(str, Enum):
+    mse = "mse"
+    cosine = "cosine"
+    kl = "kl"
+@dataclass
+class ChildLayerVLConfig:
+    attention_type: Optional[AttentionType] = field(default=None)
+    ffn_type: Optional[FFNType] = field(default=None)
+    block_metric: Optional[MetricType] = field(default=None)
+    child_hidden_size: Optional[int] = field(default=None)
+    child_intermediate_size: Optional[int] = field(default=None)
+    gqa_num_kv_heads: Optional[int] = field(default=None)
+    child_num_attention_heads: Optional[int] = field(default=None)
+    inherit: str = field(default="false")
+    sliding_window: Optional[int] = field(default=1024)
+    def __post_init__(self):
+        if self.inherit is not None:
+            cleaned = str(self.inherit).strip().lower()
+            self.inherit = cleaned in ("true", "yes", "1")
+        else:
+            self.inherit = False
+class NonGatedFFN(torch.nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.up_proj = torch.nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = torch.nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = torch.nn.ReLU()
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.up_proj(x)))
+class NasVLDecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, layer_idx: int, nas_config, parent_config, parent_model=None):
+        super().__init__()
+        self.parent_config = parent_config
+        self.parent_text_config = parent_config.text_config
+        self.layer_idx = layer_idx
+        if isinstance(nas_config, dict):
+            nas_config = ChildLayerVLConfig(**nas_config)
+        elif not isinstance(nas_config, ChildLayerVLConfig):
+            nas_config = ChildLayerVLConfig(**vars(nas_config))
+        self.nas_config = nas_config
+        self.attention_type = nas_config.attention_type
+        self.inherit = nas_config.inherit
+        self.child_attn_heads = int(
+            getattr(nas_config, "child_num_attention_heads", 0)
+            or self.parent_text_config.num_attention_heads
+        )
+        self.child_kv_heads = int(
+            getattr(nas_config, "gqa_num_kv_heads", 0)
+            or self.parent_text_config.num_key_value_heads
+        )
+        self.child_inter_size = int(
+            getattr(nas_config, "child_intermediate_size", 0)
+            or self.parent_text_config.intermediate_size
+        )
+        self.hidden_size = self.parent_text_config.hidden_size
+        if nas_config.attention_type == AttentionType.FULL:
+            attn_config = copy.deepcopy(self.parent_text_config)
+            attn_config.num_attention_heads = self.child_attn_heads
+            attn_config.num_key_value_heads = self.child_kv_heads
+            attn_config._attn_implementation = "sdpa"
+            self.self_attn = Qwen3VLTextAttention(config=attn_config, layer_idx=layer_idx)
+            if parent_model is not None and self.inherit:
+                teacher_attn = parent_model.model.language_model.layers[layer_idx].self_attn
+                if (self.child_attn_heads == self.parent_text_config.num_attention_heads
+                        and self.child_kv_heads == self.parent_text_config.num_key_value_heads):
+                    self.self_attn.load_state_dict(teacher_attn.state_dict(), strict=True)
+                else:
+                    prune_qwen_attention_head(
+                        student_attn=self.self_attn,
+                        teacher_attn=teacher_attn,
+                        teacher_config=self.parent_text_config,
+                        target_q_heads=self.child_attn_heads,
+                        target_kv_heads=self.child_kv_heads,
+                    )
+        elif nas_config.attention_type == AttentionType.SWA:
+            self.sliding_window = int(
+                getattr(nas_config, "sliding_window", 1024) or 1024
+            )
+            self._swa_mask_config = copy.deepcopy(parent_config)
+            self._swa_mask_config.sliding_window = self.sliding_window
+            if hasattr(self._swa_mask_config, "text_config"):
+                self._swa_mask_config.text_config.sliding_window = self.sliding_window
+            self._swa_mask_config._attn_implementation = "sdpa"
+            if hasattr(self._swa_mask_config, "text_config"):
+                self._swa_mask_config.text_config._attn_implementation = "sdpa"
+            attn_config = copy.deepcopy(self.parent_text_config)
+            attn_config.num_attention_heads = self.child_attn_heads
+            attn_config.num_key_value_heads = self.child_kv_heads
+            attn_config._attn_implementation = "sdpa"
+            self.self_attn = Qwen3VLTextAttention(config=attn_config, layer_idx=layer_idx)
+            if parent_model is not None and self.inherit:
+                teacher_attn = parent_model.model.language_model.layers[layer_idx].self_attn
+                if (self.child_attn_heads == self.parent_text_config.num_attention_heads
+                        and self.child_kv_heads == self.parent_text_config.num_key_value_heads):
+                    self.self_attn.load_state_dict(teacher_attn.state_dict(), strict=True)
+                else:
+                    prune_qwen_attention_head(
+                        student_attn=self.self_attn,
+                        teacher_attn=teacher_attn,
+                        teacher_config=self.parent_text_config,
+                        target_q_heads=self.child_attn_heads,
+                        target_kv_heads=self.child_kv_heads,
+                    )
+        elif nas_config.attention_type == AttentionType.LINEAR:
+            self.self_attn = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False)
+            if parent_model is not None and self.inherit:
+                prune_qwen_attention_head_linear(
+                    student_attn=self.self_attn,
+                    teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
+                    teacher_config=parent_config.text_config,
+                )
+        elif nas_config.attention_type == AttentionType.KDA:
+            config = KDAConfig(hidden_size=self.hidden_size)
+            config.expand_v = 1
+            self.self_attn = KimiDeltaAttention(
+                mode=config.attn_mode,
+                hidden_size=config.hidden_size,
+                expand_v=config.expand_v,
+                head_dim=config.head_dim,
+                num_heads=config.num_heads,
+                num_v_heads=config.num_v_heads,
+                use_short_conv=config.use_short_conv,
+                allow_neg_eigval=config.allow_neg_eigval,
+                conv_size=config.conv_size,
+                norm_eps=config.norm_eps,
+                layer_idx=layer_idx,
+            )
+            if parent_model is not None and self.inherit:
+                prune_qwen_attention_head_kda(
+                    student_attn=self.self_attn,
+                    teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
+                    teacher_config=parent_config.text_config,
+                )
+        elif nas_config.attention_type == AttentionType.GDN:
+            config = GatedDeltaNetConfig(hidden_size=self.hidden_size)
+            self.self_attn = GatedDeltaNet(
+                mode=config.attn_mode,
+                hidden_size=config.hidden_size,
+                expand_v=config.expand_v,
+                head_dim=config.head_dim,
+                num_heads=config.num_heads,
+                num_v_heads=config.num_v_heads,
+                use_gate=config.use_gate,
+                use_short_conv=config.use_short_conv,
+                allow_neg_eigval=config.allow_neg_eigval,
+                conv_size=config.conv_size,
+                norm_eps=config.norm_eps,
+                layer_idx=layer_idx,
+            )
+            if parent_model is not None and self.inherit:
+                prune_qwen_attention_head_gdn(
+                    student_attn=self.self_attn,
+                    teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
+                    teacher_config=parent_config.text_config,
+                )
+        elif nas_config.attention_type == AttentionType.NSA:
+            config = NSAConfig(hidden_size=self.hidden_size)
+            self.self_attn = NativeSparseAttention(
+                hidden_size=config.hidden_size,
+                num_heads=config.num_heads,
+                num_kv_heads=config.num_kv_heads,
+                head_dim=config.head_dim,
+                qkv_bias=config.qkv_bias,
+                block_size=config.block_size,
+                block_counts=config.block_counts,
+                window_size=config.window_size,
+                rope_theta=config.rope_theta,
+                max_position_embeddings=config.max_position_embeddings,
+                layer_idx=layer_idx,
+            )
+            if parent_model is not None and self.inherit:
+                prune_qwen_attention_head_nsa(
+                    student_attn=self.self_attn,
+                    teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
+                    teacher_config=parent_config.text_config,
+                )
+        elif nas_config.attention_type == AttentionType.MLA:
+            config = MLAConfig(hidden_size=self.hidden_size)
+            self.self_attn = MultiheadLatentAttention(
+                hidden_size=config.hidden_size,
+                num_heads=config.num_heads,
+                q_lora_rank=config.q_lora_rank,
+                qk_rope_head_dim=config.qk_rope_head_dim,
+                kv_lora_rank=config.kv_lora_rank,
+                v_head_dim=config.v_head_dim,
+                qk_nope_head_dim=config.qk_nope_head_dim,
+                qk_head_dim=config.qk_head_dim,
+                window_size=config.window_size,
+                rope_theta=config.rope_theta,
+                max_position_embeddings=config.max_position_embeddings,
+                rope_scaling=config.rope_scaling,
+                layer_idx=layer_idx,
+            )
+            if parent_model is not None and self.inherit:
+                prune_qwen_attention_head_mla(
+                    student_attn=self.self_attn,
+                    teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
+                    teacher_config=parent_config.text_config,
+                )
+        elif nas_config.attention_type == AttentionType.NOOP:
+            self.self_attn = None
+        else:
+            raise Exception(f"Attention Type Not Define: {nas_config.attention_type}")
+        if nas_config.ffn_type == FFNType.FFN:
+            mlp_config = copy.deepcopy(self.parent_text_config)
+            mlp_config.intermediate_size = self.child_inter_size
+            self.mlp = Qwen3VLTextMLP(mlp_config)
+            if parent_model is not None and self.inherit:
+                teacher_mlp = parent_model.model.language_model.layers[layer_idx].mlp
+                teacher_inter_size = teacher_mlp.up_proj.weight.shape[0]
+                if self.child_inter_size < teacher_inter_size:
+                    init_student_ffn(self.mlp, teacher_mlp, self.child_inter_size)
+                elif self.child_inter_size == teacher_inter_size:
+                    self.mlp.load_state_dict(teacher_mlp.state_dict(), strict=True)
+                else:
+                    raise ValueError(
+                        f"Layer {layer_idx}: Student intermediate size ({self.child_inter_size}) "
+                        f"is larger than Teacher ({teacher_inter_size})."
+                    )
+        elif nas_config.ffn_type == FFNType.LINEAR:
+            self.mlp = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False)
+            if parent_model is not None and self.inherit:
+                init_student_ffn_linear(
+                    self.mlp, parent_model.model.language_model.layers[layer_idx].mlp
+                )
+        elif nas_config.ffn_type == FFNType.NFFN:
+            nffn_config = copy.deepcopy(self.parent_text_config)
+            nffn_config.intermediate_size = self.child_inter_size
+            self.mlp = NonGatedFFN(nffn_config)
+        elif nas_config.ffn_type == FFNType.NOOP:
+            self.mlp = None
+        else:
+            raise Exception(f"FFN Type Not Define: {nas_config.ffn_type}")
+        norm_eps = self.parent_text_config.rms_norm_eps
+        if self.self_attn is not None:
+            self.input_layernorm = Qwen3VLTextRMSNorm(self.hidden_size, eps=norm_eps)
+            if parent_model is not None:
+                self.input_layernorm.load_state_dict(
+                    parent_model.model.language_model.layers[layer_idx].input_layernorm.state_dict()
+                )
+        else:
+            self.input_layernorm = None
+        if self.mlp is not None:
+            self.post_attention_layernorm = Qwen3VLTextRMSNorm(self.hidden_size, eps=norm_eps)
+            if parent_model is not None:
+                self.post_attention_layernorm.load_state_dict(
+                    parent_model.model.language_model.layers[layer_idx].post_attention_layernorm.state_dict()
+                )
+        else:
+            self.post_attention_layernorm = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[Cache]]:
+        residual = hidden_states
+        present_key_values = past_key_values
+        mask_2d = None
+        mask_4d = None
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                mask_2d = attention_mask[:, 0, -1, :]
+            else:
+                mask_2d = attention_mask
+            if self.nas_config.attention_type == AttentionType.FULL:
+                if attention_mask.ndim == 4:
+                    mask_4d = attention_mask
+                else:
+                    if cache_position is None:
+                        past_seen_tokens = (
+                            past_key_values.get_seq_length()
+                            if past_key_values is not None
+                            else 0
+                        )
+                        cache_position = torch.arange(
+                            past_seen_tokens,
+                            past_seen_tokens + hidden_states.shape[1],
+                            device=hidden_states.device,
+                        )
+                    mask_4d = create_causal_mask(
+                        input_embeds=hidden_states,
+                        attention_mask=attention_mask,
+                        cache_position=cache_position,
+                        past_key_values=past_key_values,
+                        config=self.parent_config,
+                    )
+            elif self.nas_config.attention_type == AttentionType.SWA:
+                if attention_mask.ndim == 4:
+                    mask_4d = attention_mask
+                else:
+                    if cache_position is None:
+                        past_seen_tokens = (
+                            past_key_values.get_seq_length()
+                            if past_key_values is not None
+                            else 0
+                        )
+                        cache_position = torch.arange(
+                            past_seen_tokens,
+                            past_seen_tokens + hidden_states.shape[1],
+                            device=hidden_states.device,
+                        )
+                    mask_4d = create_sliding_window_causal_mask(
+                        config=self._swa_mask_config,
+                        input_embeds=hidden_states,
+                        attention_mask=attention_mask,
+                        cache_position=cache_position,
+                        past_key_values=past_key_values,
+                    )
+        if self.nas_config.attention_type == AttentionType.SWA and mask_4d is None:
+            if cache_position is None:
+                past_seen_tokens = (
+                    past_key_values.get_seq_length()
+                    if past_key_values is not None
+                    else 0
+                )
+                cache_position = torch.arange(
+                    past_seen_tokens,
+                    past_seen_tokens + hidden_states.shape[1],
+                    device=hidden_states.device,
+                )
+            mask_4d = create_sliding_window_causal_mask(
+                config=self._swa_mask_config,
+                input_embeds=hidden_states,
+                attention_mask=None,
+                cache_position=cache_position,
+                past_key_values=past_key_values,
+            )
+        if self.nas_config.attention_type == AttentionType.FULL:
+            hidden_states = self.input_layernorm(hidden_states)
+            hidden_states, _ = self.self_attn(
+                hidden_states=hidden_states,
+                attention_mask=mask_4d,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **kwargs,
+            )
+            hidden_states = residual + hidden_states
+        elif self.nas_config.attention_type == AttentionType.SWA:
+            hidden_states = self.input_layernorm(hidden_states)
+            hidden_states, _ = self.self_attn(
+                hidden_states=hidden_states,
+                attention_mask=mask_4d,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **kwargs,
+            )
+            hidden_states = residual + hidden_states
+        elif self.nas_config.attention_type == AttentionType.LINEAR:
+            hidden_states = self.input_layernorm(hidden_states)
+            hidden_states = self.self_attn(hidden_states)
+            hidden_states = residual + hidden_states
+        elif self.nas_config.attention_type == AttentionType.NOOP:
+            hidden_states = residual
+        elif self.nas_config.attention_type in [
+            AttentionType.KDA,
+            AttentionType.GDN
+        ]:
+            fla_cache_proxy = None
+            if use_cache and past_key_values is not None:
+                fla_cache_proxy = FLACacheAdapter(past_key_values)
+            if self.training:
+                mode = "chunk"
+            else:
+                mode = "fused_recurrent" if use_cache else "chunk"
+            batch_size, q_len, _ = hidden_states.shape
+            if q_len > 64 or use_cache:
+                hidden_states = self.input_layernorm(hidden_states)
+                outputs = self.self_attn(
+                    hidden_states=hidden_states,
+                    attention_mask=mask_2d,
+                    past_key_values=fla_cache_proxy,
+                    use_cache=use_cache,
+                    mode=mode,
+                    **kwargs,
+                )
+                if isinstance(outputs, tuple):
+                    hidden_states = outputs[0]
+                else:
+                    hidden_states = outputs
+                hidden_states = residual + hidden_states
+            else:
+                hidden_states = residual
+        elif self.nas_config.attention_type == AttentionType.NSA:
+            hidden_states = self.input_layernorm(hidden_states)
+            if self.training:
+                nsa_kwargs = {k: v for k, v in kwargs.items() if k in ("cu_seqlens",)}
+                outputs = self.self_attn(
+                    hidden_states=hidden_states,
+                    attention_mask=mask_2d,
+                    past_key_values=None,
+                    use_cache=False,
+                    **nsa_kwargs,
+                )
+                if isinstance(outputs, tuple):
+                    hidden_states = outputs[0]
+                else:
+                    hidden_states = outputs
+            else:
+                if past_key_values is not None and use_cache:
+                    if not hasattr(past_key_values, "fla_states"):
+                        past_key_values.fla_states = {}
+                    nsa_state = past_key_values.fla_states.get(
+                        f"nsa_hidden_{self.layer_idx}", None
+                    )
+                    if nsa_state is not None:
+                        full_hidden = torch.cat([nsa_state, hidden_states], dim=1)
+                    else:
+                        full_hidden = hidden_states
+                    past_key_values.fla_states[f"nsa_hidden_{self.layer_idx}"] = (
+                        full_hidden.detach()
+                    )
+                    full_mask = None
+                    if mask_2d is not None:
+                        cached_len = full_hidden.shape[1] - hidden_states.shape[1]
+                        if cached_len > 0:
+                            prefix_mask = torch.ones(
+                                mask_2d.shape[0],
+                                cached_len,
+                                dtype=mask_2d.dtype,
+                                device=mask_2d.device,
+                            )
+                            full_mask = torch.cat([prefix_mask, mask_2d], dim=1)
+                        else:
+                            full_mask = mask_2d
+                    outputs = self.self_attn(
+                        hidden_states=full_hidden,
+                        attention_mask=full_mask,
+                        past_key_values=None,
+                        use_cache=False,
+                        **{k: v for k, v in kwargs.items() if k in ("cu_seqlens",)},
+                    )
+                    if isinstance(outputs, tuple):
+                        full_output = outputs[0]
+                    else:
+                        full_output = outputs
+                    hidden_states = full_output[:, -hidden_states.shape[1] :, :]
+                else:
+                    outputs = self.self_attn(
+                        hidden_states=hidden_states,
+                        attention_mask=mask_2d,
+                        past_key_values=None,
+                        use_cache=False,
+                    )
+                    if isinstance(outputs, tuple):
+                        hidden_states = outputs[0]
+                    else:
+                        hidden_states = outputs
+            if isinstance(hidden_states, tuple):
+                hidden_states = hidden_states[0]
+            hidden_states = residual + hidden_states
+        elif self.nas_config.attention_type == AttentionType.MLA:
+            hidden_states = self.input_layernorm(hidden_states)
+            fla_cache_proxy = None
+            if past_key_values is not None:
+                fla_cache_proxy = FLACacheAdapter(past_key_values)
+            outputs = self.self_attn(
+                hidden_states=hidden_states,
+                attention_mask=mask_2d,
+                past_key_values=fla_cache_proxy,
+                use_cache=use_cache,
+                **kwargs,
+            )
+            if isinstance(outputs, tuple):
+                hidden_states = outputs[0]
+            else:
+                hidden_states = outputs
+            hidden_states = residual + hidden_states
+        else:
+            raise Exception(f"Attention Type Not Define: {self.self_attn}")
+        if self.nas_config.ffn_type in [FFNType.FFN, FFNType.NFFN, FFNType.LINEAR]:
+            residual = hidden_states
+            hidden_states = self.post_attention_layernorm(hidden_states)
+            hidden_states = self.mlp(hidden_states)
+            hidden_states = residual + hidden_states
+        elif self.nas_config.ffn_type == FFNType.NOOP:
+            pass
+        else:
+            raise Exception(f"FFN Type Not Define: {self.nas_config.ffn_type}")
+        return hidden_states, present_key_values

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "crop_size": null,
+  "data_format": "channels_first",
+  "default_to_square": true,
+  "device": null,
+  "disable_grouping": null,
+  "do_center_crop": null,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_pad": null,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "Qwen2VLImageProcessorFast",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_data_format": null,
+  "max_pixels": 4194304,
+  "merge_size": 2,
+  "min_pixels": 4096,
+  "pad_size": null,
+  "patch_size": 16,
+  "processor_class": "Qwen3VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "return_tensors": null,
+  "size": {
+    "longest_edge": 1605632,
+    "shortest_edge": 3136
+  },
+  "temporal_patch_size": 2
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
+size 11422654

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,240 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "Qwen3VLProcessor",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

video_preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "crop_size": null,
+  "data_format": "channels_first",
+  "default_to_square": true,
+  "device": null,
+  "do_center_crop": null,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "do_sample_frames": true,
+  "fps": null,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "input_data_format": null,
+  "max_frames": 64,
+  "merge_size": 2,
+  "min_frames": 4,
+  "num_frames": null,
+  "pad_size": null,
+  "patch_size": 16,
+  "processor_class": "Qwen3VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "return_metadata": false,
+  "size": {
+    "longest_edge": 14680064,
+    "shortest_edge": 1228800
+  },
+  "temporal_patch_size": 2,
+  "video_metadata": null,
+  "video_processor_type": "Qwen3VLVideoProcessor"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff