Yaphabates commited on
Commit
52f434f
·
verified ·
1 Parent(s): 7ccdd11

Initial release of MOSAIC-4B

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MOSAIC-4B
2
+
3
+ **MOSAIC-4B** is an efficient heterogeneous Vision-Language Model derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) via the **MOSAIC** (**M**ulti-**O**bjective **S**earch for **A**daptive **I**nter-layer **C**omposition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search.
4
+
5
+ > **Paper:** *MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models* (CVPR 2026)
6
+ > **Authors:** Yuncheng Yang\*, Feiyang Ye\*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc.
7
+
8
+ ---
9
+
10
+ ## Highlights
11
+
12
+ | Metric | Value |
13
+ |--------|-------|
14
+ | **Decoding speedup (TPOT)** | **2.54×** vs. Qwen3-VL-4B-Instruct |
15
+ | **Prefilling speedup (TTFT @ 96k tokens)** | **1.76×** vs. Qwen3-VL-4B-Instruct |
16
+ | **Performance gap (19 benchmarks avg)** | **−0.6%** on image, **−0.8%** on video |
17
+ | **Training cost** | **< 2%** of original Qwen3-VL-4B-Instruct |
18
+
19
+ ### Key Advantages
20
+
21
+ - **Hardware-aware automatic architecture search.** MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed.
22
+
23
+ - **Heterogeneous operator mixing.** Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot.
24
+
25
+ - **Matches teacher performance at a fraction of the training cost.** MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute.
26
+
27
+ - **Scalable inference acceleration.** The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads.
28
+
29
+ - **Principled two-stage parameter recovery.** Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability.
30
+
31
+ ---
32
+
33
+ ## Architecture
34
+
35
+ MOSAIC-4B has 36 decoder layers with the following per-layer operator assignment discovered by the search:
36
+
37
+ - **Shallow layers (0–6):** Full attention (GQA) and Sliding Window Attention (SWA) for local context
38
+ - **Middle layers (7–28):** Linear attention mechanisms (GDN, KDA) for efficient long-range modeling
39
+ - **Deep layers (29–35):** Multi-head Latent Attention (MLA) for high-quality global representations + full attention in final layers
40
+
41
+ This pattern, discovered purely through data-driven optimization, independently validates known architectural intuitions: local mechanisms suffice early, linear attention handles the bulk of computation, and global attention is preserved where representations matter most.
42
+
43
+ ---
44
+
45
+ ## Installation
46
+
47
+ ```bash
48
+ pip install transformers torch
49
+ pip install flash-linear-attention # required for linear attention operators (KDA, GDN, MLA)
50
+ ```
51
+
52
+ ---
53
+
54
+ ## Usage
55
+
56
+ This model uses a custom architecture and requires `trust_remote_code=True`.
57
+
58
+ ### Basic Text + Image Inference
59
+
60
+ ```python
61
+ from transformers import AutoProcessor, AutoModelForCausalLM
62
+ import torch
63
+
64
+ model_id = "LiAuto-DSR/MOSAIC-4B"
65
+
66
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
67
+ model = AutoModelForCausalLM.from_pretrained(
68
+ model_id,
69
+ trust_remote_code=True,
70
+ torch_dtype=torch.bfloat16,
71
+ device_map="auto",
72
+ )
73
+
74
+ messages = [
75
+ {
76
+ "role": "user",
77
+ "content": [
78
+ {"type": "image", "image": "https://example.com/image.jpg"},
79
+ {"type": "text", "text": "Describe this image in detail."},
80
+ ],
81
+ }
82
+ ]
83
+
84
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
85
+ inputs = processor(text=[text], return_tensors="pt").to(model.device)
86
+
87
+ with torch.no_grad():
88
+ output_ids = model.generate(**inputs, max_new_tokens=512)
89
+
90
+ response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
91
+ print(response)
92
+ ```
93
+
94
+ ### Text-Only Inference
95
+
96
+ ```python
97
+ from transformers import AutoProcessor, AutoModelForCausalLM
98
+ import torch
99
+
100
+ model_id = "LiAuto-DSR/MOSAIC-4B"
101
+
102
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ model_id,
105
+ trust_remote_code=True,
106
+ torch_dtype=torch.bfloat16,
107
+ device_map="auto",
108
+ )
109
+
110
+ messages = [{"role": "user", "content": "Explain the concept of neural architecture search."}]
111
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
112
+ inputs = processor(text=[text], return_tensors="pt").to(model.device)
113
+
114
+ with torch.no_grad():
115
+ output_ids = model.generate(**inputs, max_new_tokens=512)
116
+
117
+ response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
118
+ print(response)
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Dependencies
124
+
125
+ | Package | Version |
126
+ |---------|---------|
127
+ | transformers | ≥ 4.57.0 |
128
+ | torch | ≥ 2.0 |
129
+ | flash-linear-attention (fla) | latest |
130
+
131
+ ---
132
+
133
+ ## Citation
134
+
135
+ ```bibtex
136
+ @inproceedings{yang2026mosaic,
137
+ title = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models},
138
+ author = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie},
139
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
140
+ year = {2026}
141
+ }
142
+ ```
143
+
144
+ ---
145
+
146
+ ## License
147
+
148
+ This model is released under the **Apache 2.0** license.
149
+ The base model weights are derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct), which is licensed under [Qwen Research License](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct/blob/main/LICENSE).
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,424 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NasChildVLModelForCausalLM"
4
+ ],
5
+ "dtype": "bfloat16",
6
+ "eos_token_id": 151645,
7
+ "image_token_id": 151655,
8
+ "model_type": "nas-child-vl",
9
+ "nas_layer_config": [
10
+ {
11
+ "attention_type": "full_attention",
12
+ "child_intermediate_size": 9728,
13
+ "child_num_attention_heads": 32,
14
+ "ffn_type": "ffn",
15
+ "gqa_num_kv_heads": 8,
16
+ "inherit": true
17
+ },
18
+ {
19
+ "attention_type": "full_attention",
20
+ "child_intermediate_size": 9728,
21
+ "child_num_attention_heads": 32,
22
+ "ffn_type": "ffn",
23
+ "gqa_num_kv_heads": 8,
24
+ "inherit": true
25
+ },
26
+ {
27
+ "attention_type": "swa",
28
+ "block_metric": "mse",
29
+ "child_intermediate_size": 9728,
30
+ "child_num_attention_heads": 32,
31
+ "ffn_type": "ffn",
32
+ "gqa_num_kv_heads": 8,
33
+ "inherit": true,
34
+ "sliding_window": 1024
35
+ },
36
+ {
37
+ "attention_type": "full_attention",
38
+ "child_intermediate_size": 9728,
39
+ "child_num_attention_heads": 32,
40
+ "ffn_type": "ffn",
41
+ "gqa_num_kv_heads": 8,
42
+ "inherit": true
43
+ },
44
+ {
45
+ "attention_type": "full_attention",
46
+ "child_intermediate_size": 9728,
47
+ "child_num_attention_heads": 32,
48
+ "ffn_type": "ffn",
49
+ "gqa_num_kv_heads": 8,
50
+ "inherit": true
51
+ },
52
+ {
53
+ "attention_type": "swa",
54
+ "block_metric": "mse",
55
+ "child_intermediate_size": 9728,
56
+ "child_num_attention_heads": 32,
57
+ "ffn_type": "ffn",
58
+ "gqa_num_kv_heads": 8,
59
+ "inherit": true,
60
+ "sliding_window": 1024
61
+ },
62
+ {
63
+ "attention_type": "full_attention",
64
+ "child_intermediate_size": 9728,
65
+ "child_num_attention_heads": 32,
66
+ "ffn_type": "ffn",
67
+ "gqa_num_kv_heads": 8,
68
+ "inherit": true
69
+ },
70
+ {
71
+ "attention_type": "gdn",
72
+ "block_metric": "mse",
73
+ "child_intermediate_size": 9728,
74
+ "child_num_attention_heads": 32,
75
+ "ffn_type": "ffn",
76
+ "gqa_num_kv_heads": 0,
77
+ "inherit": true
78
+ },
79
+ {
80
+ "attention_type": "full_attention",
81
+ "child_intermediate_size": 9728,
82
+ "child_num_attention_heads": 32,
83
+ "ffn_type": "ffn",
84
+ "gqa_num_kv_heads": 8,
85
+ "inherit": true
86
+ },
87
+ {
88
+ "attention_type": "full_attention",
89
+ "child_intermediate_size": 9728,
90
+ "child_num_attention_heads": 32,
91
+ "ffn_type": "ffn",
92
+ "gqa_num_kv_heads": 8,
93
+ "inherit": true
94
+ },
95
+ {
96
+ "attention_type": "full_attention",
97
+ "block_metric": "mse",
98
+ "child_intermediate_size": 8192,
99
+ "child_num_attention_heads": 32,
100
+ "ffn_type": "ffn",
101
+ "gqa_num_kv_heads": 8,
102
+ "inherit": true
103
+ },
104
+ {
105
+ "attention_type": "gdn",
106
+ "block_metric": "mse",
107
+ "child_intermediate_size": 9728,
108
+ "child_num_attention_heads": 32,
109
+ "ffn_type": "ffn",
110
+ "gqa_num_kv_heads": 0,
111
+ "inherit": true
112
+ },
113
+ {
114
+ "attention_type": "full_attention",
115
+ "child_intermediate_size": 9728,
116
+ "child_num_attention_heads": 32,
117
+ "ffn_type": "ffn",
118
+ "gqa_num_kv_heads": 8,
119
+ "inherit": true
120
+ },
121
+ {
122
+ "attention_type": "full_attention",
123
+ "child_intermediate_size": 9728,
124
+ "child_num_attention_heads": 32,
125
+ "ffn_type": "ffn",
126
+ "gqa_num_kv_heads": 8,
127
+ "inherit": true
128
+ },
129
+ {
130
+ "attention_type": "swa",
131
+ "block_metric": "mse",
132
+ "child_intermediate_size": 8192,
133
+ "child_num_attention_heads": 32,
134
+ "ffn_type": "ffn",
135
+ "gqa_num_kv_heads": 8,
136
+ "inherit": true,
137
+ "sliding_window": 1024
138
+ },
139
+ {
140
+ "attention_type": "gdn",
141
+ "block_metric": "mse",
142
+ "child_intermediate_size": 9728,
143
+ "child_num_attention_heads": 32,
144
+ "ffn_type": "ffn",
145
+ "gqa_num_kv_heads": 0,
146
+ "inherit": true
147
+ },
148
+ {
149
+ "attention_type": "gdn",
150
+ "block_metric": "mse",
151
+ "child_intermediate_size": 8192,
152
+ "child_num_attention_heads": 32,
153
+ "ffn_type": "ffn",
154
+ "gqa_num_kv_heads": 0,
155
+ "inherit": true
156
+ },
157
+ {
158
+ "attention_type": "gdn",
159
+ "block_metric": "mse",
160
+ "child_intermediate_size": 9728,
161
+ "child_num_attention_heads": 32,
162
+ "ffn_type": "ffn",
163
+ "gqa_num_kv_heads": 0,
164
+ "inherit": true
165
+ },
166
+ {
167
+ "attention_type": "full_attention",
168
+ "child_intermediate_size": 9728,
169
+ "child_num_attention_heads": 32,
170
+ "ffn_type": "ffn",
171
+ "gqa_num_kv_heads": 8,
172
+ "inherit": true
173
+ },
174
+ {
175
+ "attention_type": "full_attention",
176
+ "block_metric": "mse",
177
+ "child_intermediate_size": 8192,
178
+ "child_num_attention_heads": 32,
179
+ "ffn_type": "ffn",
180
+ "gqa_num_kv_heads": 8,
181
+ "inherit": true
182
+ },
183
+ {
184
+ "attention_type": "kda",
185
+ "block_metric": "mse",
186
+ "child_intermediate_size": 8192,
187
+ "child_num_attention_heads": 32,
188
+ "ffn_type": "ffn",
189
+ "gqa_num_kv_heads": 0,
190
+ "inherit": true
191
+ },
192
+ {
193
+ "attention_type": "gdn",
194
+ "block_metric": "mse",
195
+ "child_intermediate_size": 8192,
196
+ "child_num_attention_heads": 32,
197
+ "ffn_type": "ffn",
198
+ "gqa_num_kv_heads": 0,
199
+ "inherit": true
200
+ },
201
+ {
202
+ "attention_type": "full_attention",
203
+ "child_intermediate_size": 9728,
204
+ "child_num_attention_heads": 32,
205
+ "ffn_type": "ffn",
206
+ "gqa_num_kv_heads": 8,
207
+ "inherit": true
208
+ },
209
+ {
210
+ "attention_type": "full_attention",
211
+ "child_intermediate_size": 9728,
212
+ "child_num_attention_heads": 32,
213
+ "ffn_type": "ffn",
214
+ "gqa_num_kv_heads": 8,
215
+ "inherit": true
216
+ },
217
+ {
218
+ "attention_type": "full_attention",
219
+ "child_intermediate_size": 9728,
220
+ "child_num_attention_heads": 32,
221
+ "ffn_type": "ffn",
222
+ "gqa_num_kv_heads": 8,
223
+ "inherit": true
224
+ },
225
+ {
226
+ "attention_type": "kda",
227
+ "block_metric": "mse",
228
+ "child_intermediate_size": 9728,
229
+ "child_num_attention_heads": 32,
230
+ "ffn_type": "ffn",
231
+ "gqa_num_kv_heads": 0,
232
+ "inherit": true
233
+ },
234
+ {
235
+ "attention_type": "kda",
236
+ "block_metric": "mse",
237
+ "child_intermediate_size": 9728,
238
+ "child_num_attention_heads": 32,
239
+ "ffn_type": "ffn",
240
+ "gqa_num_kv_heads": 0,
241
+ "inherit": true
242
+ },
243
+ {
244
+ "attention_type": "gdn",
245
+ "block_metric": "mse",
246
+ "child_intermediate_size": 9728,
247
+ "child_num_attention_heads": 32,
248
+ "ffn_type": "ffn",
249
+ "gqa_num_kv_heads": 0,
250
+ "inherit": true
251
+ },
252
+ {
253
+ "attention_type": "gdn",
254
+ "block_metric": "mse",
255
+ "child_intermediate_size": 9728,
256
+ "child_num_attention_heads": 32,
257
+ "ffn_type": "ffn",
258
+ "gqa_num_kv_heads": 0,
259
+ "inherit": true
260
+ },
261
+ {
262
+ "attention_type": "mla",
263
+ "block_metric": "mse",
264
+ "child_intermediate_size": 9728,
265
+ "child_num_attention_heads": 32,
266
+ "ffn_type": "ffn",
267
+ "gqa_num_kv_heads": 0,
268
+ "inherit": true
269
+ },
270
+ {
271
+ "attention_type": "kda",
272
+ "block_metric": "mse",
273
+ "child_intermediate_size": 9728,
274
+ "child_num_attention_heads": 32,
275
+ "ffn_type": "ffn",
276
+ "gqa_num_kv_heads": 0,
277
+ "inherit": true
278
+ },
279
+ {
280
+ "attention_type": "mla",
281
+ "block_metric": "mse",
282
+ "child_intermediate_size": 9728,
283
+ "child_num_attention_heads": 32,
284
+ "ffn_type": "ffn",
285
+ "gqa_num_kv_heads": 0,
286
+ "inherit": true
287
+ },
288
+ {
289
+ "attention_type": "kda",
290
+ "block_metric": "mse",
291
+ "child_intermediate_size": 9728,
292
+ "child_num_attention_heads": 32,
293
+ "ffn_type": "ffn",
294
+ "gqa_num_kv_heads": 0,
295
+ "inherit": true
296
+ },
297
+ {
298
+ "attention_type": "mla",
299
+ "block_metric": "mse",
300
+ "child_intermediate_size": 9728,
301
+ "child_num_attention_heads": 32,
302
+ "ffn_type": "ffn",
303
+ "gqa_num_kv_heads": 0,
304
+ "inherit": true
305
+ },
306
+ {
307
+ "attention_type": "full_attention",
308
+ "child_intermediate_size": 9728,
309
+ "child_num_attention_heads": 32,
310
+ "ffn_type": "ffn",
311
+ "gqa_num_kv_heads": 8,
312
+ "inherit": true
313
+ },
314
+ {
315
+ "attention_type": "full_attention",
316
+ "child_intermediate_size": 9728,
317
+ "child_num_attention_heads": 32,
318
+ "ffn_type": "ffn",
319
+ "gqa_num_kv_heads": 8,
320
+ "inherit": true
321
+ }
322
+ ],
323
+ "pad_token_id": 151643,
324
+ "text_config": {
325
+ "attention_bias": false,
326
+ "attention_dropout": 0.0,
327
+ "bos_token_id": 151643,
328
+ "dtype": "bfloat16",
329
+ "eos_token_id": 151645,
330
+ "head_dim": 128,
331
+ "hidden_act": "silu",
332
+ "hidden_size": 2560,
333
+ "initializer_range": 0.02,
334
+ "intermediate_size": 9728,
335
+ "layer_types": [
336
+ "full_attention",
337
+ "full_attention",
338
+ "sliding_attention",
339
+ "full_attention",
340
+ "full_attention",
341
+ "sliding_attention",
342
+ "full_attention",
343
+ "full_attention",
344
+ "full_attention",
345
+ "full_attention",
346
+ "full_attention",
347
+ "full_attention",
348
+ "full_attention",
349
+ "full_attention",
350
+ "sliding_attention",
351
+ "full_attention",
352
+ "full_attention",
353
+ "full_attention",
354
+ "full_attention",
355
+ "full_attention",
356
+ "full_attention",
357
+ "full_attention",
358
+ "full_attention",
359
+ "full_attention",
360
+ "full_attention",
361
+ "full_attention",
362
+ "full_attention",
363
+ "full_attention",
364
+ "full_attention",
365
+ "full_attention",
366
+ "full_attention",
367
+ "full_attention",
368
+ "full_attention",
369
+ "full_attention",
370
+ "full_attention",
371
+ "full_attention"
372
+ ],
373
+ "max_position_embeddings": 262144,
374
+ "model_type": "qwen3_vl_text",
375
+ "num_attention_heads": 32,
376
+ "num_hidden_layers": 36,
377
+ "num_key_value_heads": 8,
378
+ "rms_norm_eps": 1e-06,
379
+ "rope_scaling": {
380
+ "mrope_interleaved": true,
381
+ "mrope_section": [
382
+ 24,
383
+ 20,
384
+ 20
385
+ ],
386
+ "rope_type": "default"
387
+ },
388
+ "rope_theta": 5000000,
389
+ "sliding_window": 1024,
390
+ "tie_word_embeddings": true,
391
+ "use_cache": true,
392
+ "vocab_size": 151936
393
+ },
394
+ "tie_word_embeddings": true,
395
+ "transformers_version": "4.57.0",
396
+ "video_token_id": 151656,
397
+ "vision_config": {
398
+ "deepstack_visual_indexes": [
399
+ 5,
400
+ 11,
401
+ 17
402
+ ],
403
+ "depth": 24,
404
+ "dtype": "bfloat16",
405
+ "hidden_act": "gelu_pytorch_tanh",
406
+ "hidden_size": 1024,
407
+ "in_channels": 3,
408
+ "initializer_range": 0.02,
409
+ "intermediate_size": 4096,
410
+ "model_type": "qwen3_vl",
411
+ "num_heads": 16,
412
+ "num_position_embeddings": 2304,
413
+ "out_hidden_size": 2560,
414
+ "patch_size": 16,
415
+ "spatial_merge_size": 2,
416
+ "temporal_patch_size": 2
417
+ },
418
+ "vision_end_token_id": 151653,
419
+ "vision_start_token_id": 151652,
420
+ "auto_map": {
421
+ "AutoConfig": "config_nas_vl.NasChildVLConfig",
422
+ "AutoModelForCausalLM": "modeling_nas_child_vl.NasChildVLModelForCausalLM"
423
+ }
424
+ }
config_nas_vl.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import Qwen3VLConfig, AutoConfig
2
+
3
+
4
+ class NasChildVLConfig(Qwen3VLConfig):
5
+ model_type = "nas-child-vl"
6
+
7
+ def __init__(self, nas_layer_config=None, **kwargs):
8
+ kwargs.pop("model_type", None)
9
+
10
+ _from_kwargs = kwargs.pop("nas_layer_config", None)
11
+ if nas_layer_config is None:
12
+ nas_layer_config = _from_kwargs
13
+
14
+ super().__init__(**kwargs)
15
+
16
+ if nas_layer_config is not None and len(nas_layer_config) > 0:
17
+ self.nas_layer_config = nas_layer_config
18
+ elif hasattr(self, "nas_layer_config") and self.nas_layer_config:
19
+ pass
20
+ else:
21
+ self.nas_layer_config = []
22
+
23
+ def to_dict(self):
24
+ output = super().to_dict()
25
+ output["nas_layer_config"] = self.nas_layer_config
26
+ if "text_config" in output and isinstance(output["text_config"], dict):
27
+ output["text_config"].pop("nas_layer_config", None)
28
+ return output
29
+
30
+
31
+ try:
32
+ AutoConfig.register("nas-child-vl", NasChildVLConfig)
33
+ except Exception:
34
+ pass
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 151645,
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "temperature": 0.7,
10
+ "top_k": 20,
11
+ "top_p": 0.8,
12
+ "transformers_version": "4.57.0"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:001f2ca85f0cf1a1586fb80881b6d00f5de0ef03e82a8d3fd01463c573bd6a87
3
+ size 4969968120
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35d966a49853f0718bb33d6129eebbd75b5d54b49d94fa4a332a1a478e7112b2
3
+ size 4484711536
model.safetensors.index.json ADDED
@@ -0,0 +1,834 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 4338338800,
4
+ "total_size": 9454589920
5
+ },
6
+ "weight_map": {
7
+ "embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
14
+ "layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
15
+ "layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
16
+ "layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
17
+ "layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
18
+ "layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
19
+ "layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
20
+ "layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
21
+ "layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
22
+ "layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
23
+ "layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
24
+ "layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
25
+ "layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
26
+ "layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
27
+ "layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
28
+ "layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
29
+ "layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
30
+ "layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
31
+ "layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
32
+ "layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
33
+ "layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
34
+ "layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
35
+ "layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
36
+ "layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
37
+ "layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
38
+ "layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
39
+ "layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
40
+ "layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
41
+ "layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
42
+ "layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
43
+ "layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
44
+ "layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
45
+ "layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
46
+ "layers.11.self_attn.A_log": "model-00001-of-00002.safetensors",
47
+ "layers.11.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
48
+ "layers.11.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
49
+ "layers.11.self_attn.dt_bias": "model-00001-of-00002.safetensors",
50
+ "layers.11.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
51
+ "layers.11.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
52
+ "layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
53
+ "layers.11.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
54
+ "layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
55
+ "layers.11.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
56
+ "layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
57
+ "layers.11.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
58
+ "layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
59
+ "layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
60
+ "layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
61
+ "layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
62
+ "layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
63
+ "layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
64
+ "layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
65
+ "layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
66
+ "layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
67
+ "layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
68
+ "layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
69
+ "layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
70
+ "layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
71
+ "layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
72
+ "layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
73
+ "layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
74
+ "layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
75
+ "layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
76
+ "layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
77
+ "layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
78
+ "layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
79
+ "layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
80
+ "layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
81
+ "layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
82
+ "layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
83
+ "layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
84
+ "layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
85
+ "layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
86
+ "layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
87
+ "layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
88
+ "layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
89
+ "layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
90
+ "layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
91
+ "layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
92
+ "layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
93
+ "layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
94
+ "layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
95
+ "layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
96
+ "layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
97
+ "layers.15.self_attn.A_log": "model-00001-of-00002.safetensors",
98
+ "layers.15.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
99
+ "layers.15.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
100
+ "layers.15.self_attn.dt_bias": "model-00001-of-00002.safetensors",
101
+ "layers.15.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
102
+ "layers.15.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
103
+ "layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
104
+ "layers.15.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
105
+ "layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
106
+ "layers.15.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
107
+ "layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
108
+ "layers.15.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
109
+ "layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
110
+ "layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
111
+ "layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
112
+ "layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
113
+ "layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
114
+ "layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
115
+ "layers.16.self_attn.A_log": "model-00001-of-00002.safetensors",
116
+ "layers.16.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
117
+ "layers.16.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
118
+ "layers.16.self_attn.dt_bias": "model-00001-of-00002.safetensors",
119
+ "layers.16.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
120
+ "layers.16.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
121
+ "layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
122
+ "layers.16.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
123
+ "layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
124
+ "layers.16.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
125
+ "layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
126
+ "layers.16.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
127
+ "layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
128
+ "layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
129
+ "layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
130
+ "layers.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
131
+ "layers.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
132
+ "layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
133
+ "layers.17.self_attn.A_log": "model-00002-of-00002.safetensors",
134
+ "layers.17.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
135
+ "layers.17.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
136
+ "layers.17.self_attn.dt_bias": "model-00002-of-00002.safetensors",
137
+ "layers.17.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
138
+ "layers.17.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
139
+ "layers.17.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
140
+ "layers.17.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
141
+ "layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
142
+ "layers.17.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
143
+ "layers.17.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
144
+ "layers.17.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
145
+ "layers.17.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
146
+ "layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
147
+ "layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
148
+ "layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
149
+ "layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
150
+ "layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
151
+ "layers.18.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
152
+ "layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
153
+ "layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
154
+ "layers.18.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
155
+ "layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
156
+ "layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
157
+ "layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
158
+ "layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
159
+ "layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
160
+ "layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
161
+ "layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
162
+ "layers.19.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
163
+ "layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
164
+ "layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
165
+ "layers.19.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
166
+ "layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
167
+ "layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
168
+ "layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
169
+ "layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
170
+ "layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
171
+ "layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
172
+ "layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
173
+ "layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
174
+ "layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
175
+ "layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
176
+ "layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
177
+ "layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
178
+ "layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
179
+ "layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
180
+ "layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
181
+ "layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
182
+ "layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
183
+ "layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
184
+ "layers.20.self_attn.A_log": "model-00002-of-00002.safetensors",
185
+ "layers.20.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
186
+ "layers.20.self_attn.dt_bias": "model-00002-of-00002.safetensors",
187
+ "layers.20.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
188
+ "layers.20.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
189
+ "layers.20.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
190
+ "layers.20.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
191
+ "layers.20.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
192
+ "layers.20.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
193
+ "layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
194
+ "layers.20.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
195
+ "layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
196
+ "layers.20.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
197
+ "layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
198
+ "layers.20.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
199
+ "layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
200
+ "layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
201
+ "layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
202
+ "layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
203
+ "layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
204
+ "layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
205
+ "layers.21.self_attn.A_log": "model-00002-of-00002.safetensors",
206
+ "layers.21.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
207
+ "layers.21.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
208
+ "layers.21.self_attn.dt_bias": "model-00002-of-00002.safetensors",
209
+ "layers.21.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
210
+ "layers.21.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
211
+ "layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
212
+ "layers.21.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
213
+ "layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
214
+ "layers.21.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
215
+ "layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
216
+ "layers.21.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
217
+ "layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
218
+ "layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
219
+ "layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
220
+ "layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
221
+ "layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
222
+ "layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
223
+ "layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
224
+ "layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
225
+ "layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
226
+ "layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
227
+ "layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
228
+ "layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
229
+ "layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
230
+ "layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
231
+ "layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
232
+ "layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
233
+ "layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
234
+ "layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
235
+ "layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
236
+ "layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
237
+ "layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
238
+ "layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
239
+ "layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
240
+ "layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
241
+ "layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
242
+ "layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
243
+ "layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
244
+ "layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
245
+ "layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
246
+ "layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
247
+ "layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
248
+ "layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
249
+ "layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
250
+ "layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
251
+ "layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
252
+ "layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
253
+ "layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
254
+ "layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
255
+ "layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
256
+ "layers.25.self_attn.A_log": "model-00002-of-00002.safetensors",
257
+ "layers.25.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
258
+ "layers.25.self_attn.dt_bias": "model-00002-of-00002.safetensors",
259
+ "layers.25.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
260
+ "layers.25.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
261
+ "layers.25.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
262
+ "layers.25.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
263
+ "layers.25.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
264
+ "layers.25.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
265
+ "layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
266
+ "layers.25.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
267
+ "layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
268
+ "layers.25.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
269
+ "layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
270
+ "layers.25.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
271
+ "layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
272
+ "layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
273
+ "layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
274
+ "layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
275
+ "layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
276
+ "layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
277
+ "layers.26.self_attn.A_log": "model-00002-of-00002.safetensors",
278
+ "layers.26.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
279
+ "layers.26.self_attn.dt_bias": "model-00002-of-00002.safetensors",
280
+ "layers.26.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
281
+ "layers.26.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
282
+ "layers.26.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
283
+ "layers.26.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
284
+ "layers.26.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
285
+ "layers.26.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
286
+ "layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
287
+ "layers.26.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
288
+ "layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
289
+ "layers.26.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
290
+ "layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
291
+ "layers.26.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
292
+ "layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
293
+ "layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
294
+ "layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
295
+ "layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
296
+ "layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
297
+ "layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
298
+ "layers.27.self_attn.A_log": "model-00002-of-00002.safetensors",
299
+ "layers.27.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
300
+ "layers.27.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
301
+ "layers.27.self_attn.dt_bias": "model-00002-of-00002.safetensors",
302
+ "layers.27.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
303
+ "layers.27.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
304
+ "layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
305
+ "layers.27.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
306
+ "layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
307
+ "layers.27.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
308
+ "layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
309
+ "layers.27.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
310
+ "layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
311
+ "layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
312
+ "layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
313
+ "layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
314
+ "layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
315
+ "layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
316
+ "layers.28.self_attn.A_log": "model-00002-of-00002.safetensors",
317
+ "layers.28.self_attn.a_proj.weight": "model-00002-of-00002.safetensors",
318
+ "layers.28.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
319
+ "layers.28.self_attn.dt_bias": "model-00002-of-00002.safetensors",
320
+ "layers.28.self_attn.g_proj.weight": "model-00002-of-00002.safetensors",
321
+ "layers.28.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
322
+ "layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
323
+ "layers.28.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
324
+ "layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
325
+ "layers.28.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
326
+ "layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
327
+ "layers.28.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
328
+ "layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
329
+ "layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
330
+ "layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
331
+ "layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
332
+ "layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
333
+ "layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
334
+ "layers.29.self_attn.k_rope.weight": "model-00002-of-00002.safetensors",
335
+ "layers.29.self_attn.kv_proj.0.weight": "model-00002-of-00002.safetensors",
336
+ "layers.29.self_attn.kv_proj.1.weight": "model-00002-of-00002.safetensors",
337
+ "layers.29.self_attn.kv_proj.2.weight": "model-00002-of-00002.safetensors",
338
+ "layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
339
+ "layers.29.self_attn.q_proj.0.weight": "model-00002-of-00002.safetensors",
340
+ "layers.29.self_attn.q_proj.1.weight": "model-00002-of-00002.safetensors",
341
+ "layers.29.self_attn.q_proj.2.weight": "model-00002-of-00002.safetensors",
342
+ "layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
343
+ "layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
344
+ "layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
345
+ "layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
346
+ "layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
347
+ "layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
348
+ "layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
349
+ "layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
350
+ "layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
351
+ "layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
352
+ "layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
353
+ "layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
354
+ "layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
355
+ "layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
356
+ "layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
357
+ "layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
358
+ "layers.30.self_attn.A_log": "model-00002-of-00002.safetensors",
359
+ "layers.30.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
360
+ "layers.30.self_attn.dt_bias": "model-00002-of-00002.safetensors",
361
+ "layers.30.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
362
+ "layers.30.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
363
+ "layers.30.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
364
+ "layers.30.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
365
+ "layers.30.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
366
+ "layers.30.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
367
+ "layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
368
+ "layers.30.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
369
+ "layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
370
+ "layers.30.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
371
+ "layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
372
+ "layers.30.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
373
+ "layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
374
+ "layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
375
+ "layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
376
+ "layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
377
+ "layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
378
+ "layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
379
+ "layers.31.self_attn.k_rope.weight": "model-00002-of-00002.safetensors",
380
+ "layers.31.self_attn.kv_proj.0.weight": "model-00002-of-00002.safetensors",
381
+ "layers.31.self_attn.kv_proj.1.weight": "model-00002-of-00002.safetensors",
382
+ "layers.31.self_attn.kv_proj.2.weight": "model-00002-of-00002.safetensors",
383
+ "layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
384
+ "layers.31.self_attn.q_proj.0.weight": "model-00002-of-00002.safetensors",
385
+ "layers.31.self_attn.q_proj.1.weight": "model-00002-of-00002.safetensors",
386
+ "layers.31.self_attn.q_proj.2.weight": "model-00002-of-00002.safetensors",
387
+ "layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
388
+ "layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
389
+ "layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
390
+ "layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
391
+ "layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
392
+ "layers.32.self_attn.A_log": "model-00002-of-00002.safetensors",
393
+ "layers.32.self_attn.b_proj.weight": "model-00002-of-00002.safetensors",
394
+ "layers.32.self_attn.dt_bias": "model-00002-of-00002.safetensors",
395
+ "layers.32.self_attn.f_proj.0.weight": "model-00002-of-00002.safetensors",
396
+ "layers.32.self_attn.f_proj.1.weight": "model-00002-of-00002.safetensors",
397
+ "layers.32.self_attn.g_proj.0.weight": "model-00002-of-00002.safetensors",
398
+ "layers.32.self_attn.g_proj.1.bias": "model-00002-of-00002.safetensors",
399
+ "layers.32.self_attn.g_proj.1.weight": "model-00002-of-00002.safetensors",
400
+ "layers.32.self_attn.k_conv1d.weight": "model-00002-of-00002.safetensors",
401
+ "layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
402
+ "layers.32.self_attn.o_norm.weight": "model-00002-of-00002.safetensors",
403
+ "layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
404
+ "layers.32.self_attn.q_conv1d.weight": "model-00002-of-00002.safetensors",
405
+ "layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
406
+ "layers.32.self_attn.v_conv1d.weight": "model-00002-of-00002.safetensors",
407
+ "layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
408
+ "layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
409
+ "layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
410
+ "layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
411
+ "layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
412
+ "layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
413
+ "layers.33.self_attn.k_rope.weight": "model-00002-of-00002.safetensors",
414
+ "layers.33.self_attn.kv_proj.0.weight": "model-00002-of-00002.safetensors",
415
+ "layers.33.self_attn.kv_proj.1.weight": "model-00002-of-00002.safetensors",
416
+ "layers.33.self_attn.kv_proj.2.weight": "model-00002-of-00002.safetensors",
417
+ "layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
418
+ "layers.33.self_attn.q_proj.0.weight": "model-00002-of-00002.safetensors",
419
+ "layers.33.self_attn.q_proj.1.weight": "model-00002-of-00002.safetensors",
420
+ "layers.33.self_attn.q_proj.2.weight": "model-00002-of-00002.safetensors",
421
+ "layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
422
+ "layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
423
+ "layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
424
+ "layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
425
+ "layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
426
+ "layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
427
+ "layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
428
+ "layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
429
+ "layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
430
+ "layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
431
+ "layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
432
+ "layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
433
+ "layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
434
+ "layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
435
+ "layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
436
+ "layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
437
+ "layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
438
+ "layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
439
+ "layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
440
+ "layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
441
+ "layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
442
+ "layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
443
+ "layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
444
+ "layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
445
+ "layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
446
+ "layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
447
+ "layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
448
+ "layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
449
+ "layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
450
+ "layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
451
+ "layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
452
+ "layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
453
+ "layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
454
+ "layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
455
+ "layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
456
+ "layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
457
+ "layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
458
+ "layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
459
+ "layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
460
+ "layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
461
+ "layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
462
+ "layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
463
+ "layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
464
+ "layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
465
+ "layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
466
+ "layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
467
+ "layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
468
+ "layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
469
+ "layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
470
+ "layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
471
+ "layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
472
+ "layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
473
+ "layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
474
+ "layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
475
+ "layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
476
+ "layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
477
+ "layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
478
+ "layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
479
+ "layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
480
+ "layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
481
+ "layers.7.self_attn.A_log": "model-00001-of-00002.safetensors",
482
+ "layers.7.self_attn.a_proj.weight": "model-00001-of-00002.safetensors",
483
+ "layers.7.self_attn.b_proj.weight": "model-00001-of-00002.safetensors",
484
+ "layers.7.self_attn.dt_bias": "model-00001-of-00002.safetensors",
485
+ "layers.7.self_attn.g_proj.weight": "model-00001-of-00002.safetensors",
486
+ "layers.7.self_attn.k_conv1d.weight": "model-00001-of-00002.safetensors",
487
+ "layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
488
+ "layers.7.self_attn.o_norm.weight": "model-00001-of-00002.safetensors",
489
+ "layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
490
+ "layers.7.self_attn.q_conv1d.weight": "model-00001-of-00002.safetensors",
491
+ "layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
492
+ "layers.7.self_attn.v_conv1d.weight": "model-00001-of-00002.safetensors",
493
+ "layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
494
+ "layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
495
+ "layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
496
+ "layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
497
+ "layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
498
+ "layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
499
+ "layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
500
+ "layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
501
+ "layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
502
+ "layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
503
+ "layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
504
+ "layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
505
+ "layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
506
+ "layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
507
+ "layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
508
+ "layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
509
+ "layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
510
+ "layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
511
+ "layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
512
+ "layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
513
+ "layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
514
+ "layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
515
+ "layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
516
+ "lm_head.weight": "model-00002-of-00002.safetensors",
517
+ "norm.weight": "model-00002-of-00002.safetensors",
518
+ "visual.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
519
+ "visual.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
520
+ "visual.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
521
+ "visual.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
522
+ "visual.blocks.0.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
523
+ "visual.blocks.0.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
524
+ "visual.blocks.0.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
525
+ "visual.blocks.0.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
526
+ "visual.blocks.0.norm1.bias": "model-00001-of-00002.safetensors",
527
+ "visual.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
528
+ "visual.blocks.0.norm2.bias": "model-00001-of-00002.safetensors",
529
+ "visual.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
530
+ "visual.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
531
+ "visual.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
532
+ "visual.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
533
+ "visual.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
534
+ "visual.blocks.1.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
535
+ "visual.blocks.1.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
536
+ "visual.blocks.1.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
537
+ "visual.blocks.1.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
538
+ "visual.blocks.1.norm1.bias": "model-00001-of-00002.safetensors",
539
+ "visual.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
540
+ "visual.blocks.1.norm2.bias": "model-00001-of-00002.safetensors",
541
+ "visual.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
542
+ "visual.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
543
+ "visual.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
544
+ "visual.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
545
+ "visual.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
546
+ "visual.blocks.10.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
547
+ "visual.blocks.10.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
548
+ "visual.blocks.10.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
549
+ "visual.blocks.10.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
550
+ "visual.blocks.10.norm1.bias": "model-00001-of-00002.safetensors",
551
+ "visual.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
552
+ "visual.blocks.10.norm2.bias": "model-00001-of-00002.safetensors",
553
+ "visual.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
554
+ "visual.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
555
+ "visual.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
556
+ "visual.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
557
+ "visual.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
558
+ "visual.blocks.11.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
559
+ "visual.blocks.11.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
560
+ "visual.blocks.11.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
561
+ "visual.blocks.11.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
562
+ "visual.blocks.11.norm1.bias": "model-00001-of-00002.safetensors",
563
+ "visual.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
564
+ "visual.blocks.11.norm2.bias": "model-00001-of-00002.safetensors",
565
+ "visual.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
566
+ "visual.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
567
+ "visual.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
568
+ "visual.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
569
+ "visual.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
570
+ "visual.blocks.12.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
571
+ "visual.blocks.12.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
572
+ "visual.blocks.12.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
573
+ "visual.blocks.12.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
574
+ "visual.blocks.12.norm1.bias": "model-00001-of-00002.safetensors",
575
+ "visual.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
576
+ "visual.blocks.12.norm2.bias": "model-00001-of-00002.safetensors",
577
+ "visual.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
578
+ "visual.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
579
+ "visual.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
580
+ "visual.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
581
+ "visual.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
582
+ "visual.blocks.13.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
583
+ "visual.blocks.13.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
584
+ "visual.blocks.13.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
585
+ "visual.blocks.13.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
586
+ "visual.blocks.13.norm1.bias": "model-00001-of-00002.safetensors",
587
+ "visual.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
588
+ "visual.blocks.13.norm2.bias": "model-00001-of-00002.safetensors",
589
+ "visual.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
590
+ "visual.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
591
+ "visual.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
592
+ "visual.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
593
+ "visual.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
594
+ "visual.blocks.14.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
595
+ "visual.blocks.14.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
596
+ "visual.blocks.14.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
597
+ "visual.blocks.14.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
598
+ "visual.blocks.14.norm1.bias": "model-00001-of-00002.safetensors",
599
+ "visual.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
600
+ "visual.blocks.14.norm2.bias": "model-00001-of-00002.safetensors",
601
+ "visual.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
602
+ "visual.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
603
+ "visual.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
604
+ "visual.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
605
+ "visual.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
606
+ "visual.blocks.15.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
607
+ "visual.blocks.15.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
608
+ "visual.blocks.15.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
609
+ "visual.blocks.15.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
610
+ "visual.blocks.15.norm1.bias": "model-00001-of-00002.safetensors",
611
+ "visual.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
612
+ "visual.blocks.15.norm2.bias": "model-00001-of-00002.safetensors",
613
+ "visual.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
614
+ "visual.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
615
+ "visual.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
616
+ "visual.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
617
+ "visual.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
618
+ "visual.blocks.16.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
619
+ "visual.blocks.16.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
620
+ "visual.blocks.16.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
621
+ "visual.blocks.16.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
622
+ "visual.blocks.16.norm1.bias": "model-00001-of-00002.safetensors",
623
+ "visual.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
624
+ "visual.blocks.16.norm2.bias": "model-00001-of-00002.safetensors",
625
+ "visual.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
626
+ "visual.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
627
+ "visual.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
628
+ "visual.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
629
+ "visual.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
630
+ "visual.blocks.17.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
631
+ "visual.blocks.17.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
632
+ "visual.blocks.17.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
633
+ "visual.blocks.17.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
634
+ "visual.blocks.17.norm1.bias": "model-00001-of-00002.safetensors",
635
+ "visual.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
636
+ "visual.blocks.17.norm2.bias": "model-00001-of-00002.safetensors",
637
+ "visual.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
638
+ "visual.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
639
+ "visual.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
640
+ "visual.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
641
+ "visual.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
642
+ "visual.blocks.18.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
643
+ "visual.blocks.18.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
644
+ "visual.blocks.18.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
645
+ "visual.blocks.18.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
646
+ "visual.blocks.18.norm1.bias": "model-00001-of-00002.safetensors",
647
+ "visual.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
648
+ "visual.blocks.18.norm2.bias": "model-00001-of-00002.safetensors",
649
+ "visual.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
650
+ "visual.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
651
+ "visual.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
652
+ "visual.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
653
+ "visual.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
654
+ "visual.blocks.19.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
655
+ "visual.blocks.19.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
656
+ "visual.blocks.19.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
657
+ "visual.blocks.19.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
658
+ "visual.blocks.19.norm1.bias": "model-00001-of-00002.safetensors",
659
+ "visual.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
660
+ "visual.blocks.19.norm2.bias": "model-00001-of-00002.safetensors",
661
+ "visual.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
662
+ "visual.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
663
+ "visual.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
664
+ "visual.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
665
+ "visual.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
666
+ "visual.blocks.2.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
667
+ "visual.blocks.2.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
668
+ "visual.blocks.2.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
669
+ "visual.blocks.2.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
670
+ "visual.blocks.2.norm1.bias": "model-00001-of-00002.safetensors",
671
+ "visual.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
672
+ "visual.blocks.2.norm2.bias": "model-00001-of-00002.safetensors",
673
+ "visual.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
674
+ "visual.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
675
+ "visual.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
676
+ "visual.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
677
+ "visual.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
678
+ "visual.blocks.20.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
679
+ "visual.blocks.20.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
680
+ "visual.blocks.20.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
681
+ "visual.blocks.20.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
682
+ "visual.blocks.20.norm1.bias": "model-00001-of-00002.safetensors",
683
+ "visual.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
684
+ "visual.blocks.20.norm2.bias": "model-00001-of-00002.safetensors",
685
+ "visual.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
686
+ "visual.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
687
+ "visual.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
688
+ "visual.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
689
+ "visual.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
690
+ "visual.blocks.21.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
691
+ "visual.blocks.21.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
692
+ "visual.blocks.21.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
693
+ "visual.blocks.21.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
694
+ "visual.blocks.21.norm1.bias": "model-00001-of-00002.safetensors",
695
+ "visual.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
696
+ "visual.blocks.21.norm2.bias": "model-00001-of-00002.safetensors",
697
+ "visual.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
698
+ "visual.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
699
+ "visual.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
700
+ "visual.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
701
+ "visual.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
702
+ "visual.blocks.22.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
703
+ "visual.blocks.22.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
704
+ "visual.blocks.22.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
705
+ "visual.blocks.22.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
706
+ "visual.blocks.22.norm1.bias": "model-00001-of-00002.safetensors",
707
+ "visual.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
708
+ "visual.blocks.22.norm2.bias": "model-00001-of-00002.safetensors",
709
+ "visual.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
710
+ "visual.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
711
+ "visual.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
712
+ "visual.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
713
+ "visual.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
714
+ "visual.blocks.23.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
715
+ "visual.blocks.23.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
716
+ "visual.blocks.23.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
717
+ "visual.blocks.23.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
718
+ "visual.blocks.23.norm1.bias": "model-00001-of-00002.safetensors",
719
+ "visual.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
720
+ "visual.blocks.23.norm2.bias": "model-00001-of-00002.safetensors",
721
+ "visual.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
722
+ "visual.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
723
+ "visual.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
724
+ "visual.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
725
+ "visual.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
726
+ "visual.blocks.3.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
727
+ "visual.blocks.3.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
728
+ "visual.blocks.3.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
729
+ "visual.blocks.3.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
730
+ "visual.blocks.3.norm1.bias": "model-00001-of-00002.safetensors",
731
+ "visual.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
732
+ "visual.blocks.3.norm2.bias": "model-00001-of-00002.safetensors",
733
+ "visual.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
734
+ "visual.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
735
+ "visual.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
736
+ "visual.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
737
+ "visual.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
738
+ "visual.blocks.4.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
739
+ "visual.blocks.4.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
740
+ "visual.blocks.4.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
741
+ "visual.blocks.4.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
742
+ "visual.blocks.4.norm1.bias": "model-00001-of-00002.safetensors",
743
+ "visual.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
744
+ "visual.blocks.4.norm2.bias": "model-00001-of-00002.safetensors",
745
+ "visual.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
746
+ "visual.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
747
+ "visual.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
748
+ "visual.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
749
+ "visual.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
750
+ "visual.blocks.5.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
751
+ "visual.blocks.5.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
752
+ "visual.blocks.5.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
753
+ "visual.blocks.5.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
754
+ "visual.blocks.5.norm1.bias": "model-00001-of-00002.safetensors",
755
+ "visual.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
756
+ "visual.blocks.5.norm2.bias": "model-00001-of-00002.safetensors",
757
+ "visual.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
758
+ "visual.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
759
+ "visual.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
760
+ "visual.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
761
+ "visual.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
762
+ "visual.blocks.6.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
763
+ "visual.blocks.6.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
764
+ "visual.blocks.6.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
765
+ "visual.blocks.6.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
766
+ "visual.blocks.6.norm1.bias": "model-00001-of-00002.safetensors",
767
+ "visual.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
768
+ "visual.blocks.6.norm2.bias": "model-00001-of-00002.safetensors",
769
+ "visual.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
770
+ "visual.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
771
+ "visual.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
772
+ "visual.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
773
+ "visual.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
774
+ "visual.blocks.7.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
775
+ "visual.blocks.7.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
776
+ "visual.blocks.7.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
777
+ "visual.blocks.7.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
778
+ "visual.blocks.7.norm1.bias": "model-00001-of-00002.safetensors",
779
+ "visual.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
780
+ "visual.blocks.7.norm2.bias": "model-00001-of-00002.safetensors",
781
+ "visual.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
782
+ "visual.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
783
+ "visual.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
784
+ "visual.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
785
+ "visual.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
786
+ "visual.blocks.8.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
787
+ "visual.blocks.8.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
788
+ "visual.blocks.8.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
789
+ "visual.blocks.8.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
790
+ "visual.blocks.8.norm1.bias": "model-00001-of-00002.safetensors",
791
+ "visual.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
792
+ "visual.blocks.8.norm2.bias": "model-00001-of-00002.safetensors",
793
+ "visual.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
794
+ "visual.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
795
+ "visual.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
796
+ "visual.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
797
+ "visual.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
798
+ "visual.blocks.9.mlp.linear_fc1.bias": "model-00001-of-00002.safetensors",
799
+ "visual.blocks.9.mlp.linear_fc1.weight": "model-00001-of-00002.safetensors",
800
+ "visual.blocks.9.mlp.linear_fc2.bias": "model-00001-of-00002.safetensors",
801
+ "visual.blocks.9.mlp.linear_fc2.weight": "model-00001-of-00002.safetensors",
802
+ "visual.blocks.9.norm1.bias": "model-00001-of-00002.safetensors",
803
+ "visual.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
804
+ "visual.blocks.9.norm2.bias": "model-00001-of-00002.safetensors",
805
+ "visual.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
806
+ "visual.deepstack_merger_list.0.linear_fc1.bias": "model-00001-of-00002.safetensors",
807
+ "visual.deepstack_merger_list.0.linear_fc1.weight": "model-00001-of-00002.safetensors",
808
+ "visual.deepstack_merger_list.0.linear_fc2.bias": "model-00001-of-00002.safetensors",
809
+ "visual.deepstack_merger_list.0.linear_fc2.weight": "model-00001-of-00002.safetensors",
810
+ "visual.deepstack_merger_list.0.norm.bias": "model-00001-of-00002.safetensors",
811
+ "visual.deepstack_merger_list.0.norm.weight": "model-00001-of-00002.safetensors",
812
+ "visual.deepstack_merger_list.1.linear_fc1.bias": "model-00001-of-00002.safetensors",
813
+ "visual.deepstack_merger_list.1.linear_fc1.weight": "model-00001-of-00002.safetensors",
814
+ "visual.deepstack_merger_list.1.linear_fc2.bias": "model-00001-of-00002.safetensors",
815
+ "visual.deepstack_merger_list.1.linear_fc2.weight": "model-00001-of-00002.safetensors",
816
+ "visual.deepstack_merger_list.1.norm.bias": "model-00001-of-00002.safetensors",
817
+ "visual.deepstack_merger_list.1.norm.weight": "model-00001-of-00002.safetensors",
818
+ "visual.deepstack_merger_list.2.linear_fc1.bias": "model-00001-of-00002.safetensors",
819
+ "visual.deepstack_merger_list.2.linear_fc1.weight": "model-00001-of-00002.safetensors",
820
+ "visual.deepstack_merger_list.2.linear_fc2.bias": "model-00001-of-00002.safetensors",
821
+ "visual.deepstack_merger_list.2.linear_fc2.weight": "model-00001-of-00002.safetensors",
822
+ "visual.deepstack_merger_list.2.norm.bias": "model-00001-of-00002.safetensors",
823
+ "visual.deepstack_merger_list.2.norm.weight": "model-00001-of-00002.safetensors",
824
+ "visual.merger.linear_fc1.bias": "model-00001-of-00002.safetensors",
825
+ "visual.merger.linear_fc1.weight": "model-00001-of-00002.safetensors",
826
+ "visual.merger.linear_fc2.bias": "model-00001-of-00002.safetensors",
827
+ "visual.merger.linear_fc2.weight": "model-00001-of-00002.safetensors",
828
+ "visual.merger.norm.bias": "model-00001-of-00002.safetensors",
829
+ "visual.merger.norm.weight": "model-00001-of-00002.safetensors",
830
+ "visual.patch_embed.proj.bias": "model-00001-of-00002.safetensors",
831
+ "visual.patch_embed.proj.weight": "model-00001-of-00002.safetensors",
832
+ "visual.pos_embed.weight": "model-00001-of-00002.safetensors"
833
+ }
834
+ }
modeling_nas_child_vl.py ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from typing import List, Optional, Any
4
+ from dataclasses import dataclass
5
+
6
+ from transformers.models.qwen3_vl import Qwen3VLPreTrainedModel
7
+ from transformers.models.qwen3_vl.modeling_qwen3_vl import (
8
+ Qwen3VLModel,
9
+ Qwen3VLTextModel,
10
+ Qwen3VLVisionModel,
11
+ Qwen3VLTextRMSNorm,
12
+ Qwen3VLTextRotaryEmbedding,
13
+ )
14
+ from transformers.modeling_outputs import CausalLMOutputWithPast
15
+ from transformers.cache_utils import DynamicCache
16
+ from transformers.utils import is_torchdynamo_compiling
17
+ from transformers.generation import GenerationMixin
18
+
19
+ from .config_nas_vl import NasChildVLConfig
20
+ from .nas_vl_layer import NasVLDecoderLayer, ChildLayerVLConfig
21
+
22
+
23
+ @dataclass
24
+ class Qwen3VLCausalLMOutputWithPast(CausalLMOutputWithPast):
25
+ rope_deltas: Optional[torch.Tensor] = None
26
+
27
+
28
+ class NasChildVLModelForCausalLM(Qwen3VLPreTrainedModel, GenerationMixin):
29
+ config_class = NasChildVLConfig
30
+ _checkpoint_conversion_mapping = {}
31
+ _tied_weights_keys = ["lm_head.weight"]
32
+
33
+ get_image_features = Qwen3VLModel.get_image_features
34
+ get_video_features = Qwen3VLModel.get_video_features
35
+ get_placeholder_mask = Qwen3VLModel.get_placeholder_mask
36
+ get_rope_index = Qwen3VLModel.get_rope_index
37
+ _deepstack_process = Qwen3VLTextModel._deepstack_process
38
+
39
+ def __init__(self, config: NasChildVLConfig):
40
+ super().__init__(config)
41
+
42
+ self.parent_config = config
43
+ self.is_vl = True
44
+ self.rope_deltas = None
45
+
46
+ text_config = config.text_config
47
+ self.parent_hidden_size = text_config.hidden_size
48
+ self.child_hidden_size = self.parent_hidden_size
49
+ self.vocab_size = text_config.vocab_size
50
+
51
+ self.visual = Qwen3VLVisionModel._from_config(config.vision_config)
52
+
53
+ self.embed_tokens = nn.Embedding(
54
+ text_config.vocab_size, text_config.hidden_size
55
+ )
56
+
57
+ if self.child_hidden_size != self.parent_hidden_size:
58
+ self.input_proj = nn.Linear(self.parent_hidden_size, self.child_hidden_size, bias=False)
59
+ self.output_proj = nn.Linear(self.child_hidden_size, self.parent_hidden_size, bias=False)
60
+ else:
61
+ self.input_proj = nn.Identity()
62
+ self.output_proj = nn.Identity()
63
+
64
+ layer_types = []
65
+ global_sliding_window = None
66
+
67
+ for i in range(text_config.num_hidden_layers):
68
+ cfg = config.nas_layer_config[i]
69
+ if isinstance(cfg, dict):
70
+ cfg = ChildLayerVLConfig(**cfg)
71
+ attn_type = str(cfg.attention_type).split('.')[-1].lower()
72
+ if attn_type == "swa":
73
+ layer_types.append("sliding_attention")
74
+ if global_sliding_window is None:
75
+ sw_val = getattr(cfg, "sliding_window", 1024)
76
+ global_sliding_window = int(sw_val) if sw_val else 1024
77
+ else:
78
+ layer_types.append("full_attention")
79
+
80
+ if hasattr(self.config, "text_config"):
81
+ self.config.text_config.layer_types = layer_types
82
+ if global_sliding_window is not None:
83
+ self.config.text_config.sliding_window = global_sliding_window
84
+ else:
85
+ self.config.layer_types = layer_types
86
+ if global_sliding_window is not None:
87
+ self.config.sliding_window = global_sliding_window
88
+
89
+ self.layers = nn.ModuleList()
90
+ for i in range(text_config.num_hidden_layers):
91
+ cfg = config.nas_layer_config[i]
92
+ self.layers.append(
93
+ NasVLDecoderLayer(
94
+ layer_idx=i,
95
+ nas_config=cfg,
96
+ parent_config=config,
97
+ parent_model=None,
98
+ )
99
+ )
100
+
101
+ self.norm = Qwen3VLTextRMSNorm(
102
+ self.child_hidden_size, eps=text_config.rms_norm_eps
103
+ )
104
+
105
+ self.lm_head = nn.Linear(
106
+ self.parent_hidden_size, self.vocab_size, bias=False
107
+ )
108
+ if config.tie_word_embeddings:
109
+ self.lm_head.weight = self.embed_tokens.weight
110
+
111
+ self.rotary_emb = Qwen3VLTextRotaryEmbedding(config=text_config)
112
+ self.has_sliding_layers = False
113
+
114
+ self.post_init()
115
+
116
+ def get_input_embeddings(self):
117
+ return self.embed_tokens
118
+
119
+ def set_input_embeddings(self, value):
120
+ self.embed_tokens = value
121
+
122
+ def get_output_embeddings(self):
123
+ return self.lm_head
124
+
125
+ def set_output_embeddings(self, new_embeddings):
126
+ self.lm_head = new_embeddings
127
+
128
+ def prepare_inputs_for_generation(
129
+ self, input_ids, past_key_values=None, attention_mask=None,
130
+ inputs_embeds=None, cache_position=None, position_ids=None,
131
+ use_cache=True, pixel_values=None, pixel_values_videos=None,
132
+ image_grid_thw=None, video_grid_thw=None, **kwargs,
133
+ ):
134
+ model_inputs = super().prepare_inputs_for_generation(
135
+ input_ids, past_key_values=past_key_values,
136
+ attention_mask=attention_mask, inputs_embeds=inputs_embeds,
137
+ cache_position=cache_position, position_ids=position_ids,
138
+ use_cache=use_cache, **kwargs,
139
+ )
140
+ model_inputs.update({
141
+ "pixel_values": pixel_values,
142
+ "pixel_values_videos": pixel_values_videos,
143
+ "image_grid_thw": image_grid_thw,
144
+ "video_grid_thw": video_grid_thw,
145
+ })
146
+ model_inputs["position_ids"] = None
147
+ if cache_position[0] != 0:
148
+ model_inputs["pixel_values"] = None
149
+ model_inputs["pixel_values_videos"] = None
150
+ return model_inputs
151
+
152
+ def _get_image_nums_and_video_nums(self, input_ids, inputs_embeds=None):
153
+ image_token_id = self.config.image_token_id
154
+ video_token_id = self.config.video_token_id
155
+ vision_start_token_id = self.config.vision_start_token_id
156
+
157
+ if inputs_embeds is not None:
158
+ dev = inputs_embeds.device
159
+ _embed = lambda tid: self.embed_tokens(
160
+ torch.tensor(tid, dtype=torch.long, device=dev)
161
+ )
162
+ vision_start_mask = (inputs_embeds == _embed(vision_start_token_id))[..., 0]
163
+ image_mask = (inputs_embeds == _embed(image_token_id))[..., 0]
164
+ video_mask = (inputs_embeds == _embed(video_token_id))[..., 0]
165
+ else:
166
+ vision_start_mask = input_ids == vision_start_token_id
167
+ image_mask = input_ids == image_token_id
168
+ video_mask = input_ids == video_token_id
169
+
170
+ vision_first_mask = torch.roll(vision_start_mask, shifts=1, dims=1)
171
+ image_nums = torch.sum(vision_first_mask & image_mask, dim=1)
172
+ video_nums = torch.sum(vision_first_mask & video_mask, dim=1)
173
+ return image_nums, video_nums
174
+
175
+ def _expand_inputs_for_generation(
176
+ self, expand_size=1, is_encoder_decoder=False, input_ids=None,
177
+ **model_kwargs,
178
+ ):
179
+ if expand_size == 1:
180
+ return input_ids, model_kwargs
181
+
182
+ visual_keys = [
183
+ "pixel_values", "image_grid_thw",
184
+ "pixel_values_videos", "video_grid_thw",
185
+ "second_per_grid_ts",
186
+ ]
187
+
188
+ def _repeat_interleave_samples(x, lengths, repeat_times):
189
+ samples = torch.split(x, lengths)
190
+ repeat_args = [repeat_times] + [1] * (x.dim() - 1)
191
+ return torch.cat([s.repeat(*repeat_args) for s in samples], dim=0)
192
+
193
+ def _expand_visual(d):
194
+ image_grid_thw = model_kwargs.get("image_grid_thw")
195
+ video_grid_thw = model_kwargs.get("video_grid_thw")
196
+ image_nums, video_nums = self._get_image_nums_and_video_nums(
197
+ input_ids, inputs_embeds=model_kwargs.get("inputs_embeds")
198
+ )
199
+ for key in list(d.keys()):
200
+ if d[key] is None:
201
+ continue
202
+ if key == "pixel_values":
203
+ lens = [torch.prod(s, dim=1).sum()
204
+ for s in torch.split(image_grid_thw, list(image_nums))]
205
+ d[key] = _repeat_interleave_samples(d[key], lens, expand_size)
206
+ elif key == "image_grid_thw":
207
+ d[key] = _repeat_interleave_samples(d[key], list(image_nums), expand_size)
208
+ elif key == "pixel_values_videos":
209
+ lens = [torch.prod(s, dim=1).sum()
210
+ for s in torch.split(video_grid_thw, list(video_nums))]
211
+ d[key] = _repeat_interleave_samples(d[key], lens, expand_size)
212
+ elif key == "video_grid_thw":
213
+ d[key] = _repeat_interleave_samples(d[key], list(video_nums), expand_size)
214
+ elif key == "second_per_grid_ts":
215
+ d[key] = _repeat_interleave_samples(d[key], list(video_nums), expand_size)
216
+ return d
217
+
218
+ def _expand_general(d):
219
+ for key in d:
220
+ if (key != "cache_position" and d[key] is not None
221
+ and isinstance(d[key], torch.Tensor) and key not in visual_keys):
222
+ d[key] = d[key].repeat_interleave(expand_size, dim=0)
223
+ return d
224
+
225
+ model_kwargs = _expand_visual(model_kwargs)
226
+ if input_ids is not None:
227
+ input_ids = input_ids.repeat_interleave(expand_size, dim=0)
228
+ model_kwargs = _expand_general(model_kwargs)
229
+
230
+ if is_encoder_decoder:
231
+ if model_kwargs.get("encoder_outputs") is None:
232
+ raise ValueError("encoder_outputs required for encoder-decoder")
233
+ model_kwargs["encoder_outputs"] = _expand_general(
234
+ model_kwargs["encoder_outputs"]
235
+ )
236
+ return input_ids, model_kwargs
237
+
238
+ def forward(
239
+ self,
240
+ input_ids: torch.LongTensor = None,
241
+ attention_mask: torch.Tensor = None,
242
+ position_ids: Optional[torch.LongTensor] = None,
243
+ past_key_values=None,
244
+ inputs_embeds: Optional[torch.FloatTensor] = None,
245
+ labels: Optional[torch.LongTensor] = None,
246
+ use_cache: Optional[bool] = None,
247
+ output_hidden_states: Optional[bool] = None,
248
+ return_dict: Optional[bool] = None,
249
+ pixel_values: Optional[torch.Tensor] = None,
250
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
251
+ image_grid_thw: Optional[torch.Tensor] = None,
252
+ video_grid_thw: Optional[torch.Tensor] = None,
253
+ cache_position: Optional[torch.LongTensor] = None,
254
+ **kwargs,
255
+ ):
256
+ output_hidden_states = (
257
+ output_hidden_states if output_hidden_states is not None
258
+ else self.config.output_hidden_states
259
+ )
260
+ return_dict = (
261
+ return_dict if return_dict is not None
262
+ else self.config.use_return_dict
263
+ )
264
+
265
+ if inputs_embeds is None:
266
+ inputs_embeds = self.embed_tokens(input_ids)
267
+
268
+ image_mask = video_mask = None
269
+ deepstack_image_embeds = deepstack_video_embeds = None
270
+
271
+ if pixel_values is not None and self.visual is not None:
272
+ image_embeds, deepstack_image_embeds = self.get_image_features(
273
+ pixel_values, image_grid_thw
274
+ )
275
+ image_embeds = torch.cat(image_embeds, dim=0).to(
276
+ inputs_embeds.device, inputs_embeds.dtype
277
+ )
278
+ image_mask, _ = self.get_placeholder_mask(
279
+ input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
280
+ )
281
+ inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
282
+
283
+ if pixel_values_videos is not None and self.visual is not None:
284
+ video_embeds, deepstack_video_embeds = self.get_video_features(
285
+ pixel_values_videos, video_grid_thw
286
+ )
287
+ video_embeds = torch.cat(video_embeds, dim=0).to(
288
+ inputs_embeds.device, inputs_embeds.dtype
289
+ )
290
+ _, video_mask = self.get_placeholder_mask(
291
+ input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
292
+ )
293
+ inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
294
+
295
+ visual_pos_masks = None
296
+ deepstack_visual_embeds = None
297
+ if image_mask is not None and video_mask is not None:
298
+ image_mask = image_mask[..., 0]
299
+ video_mask = video_mask[..., 0]
300
+ visual_pos_masks = image_mask | video_mask
301
+ deepstack_visual_embeds = []
302
+ img_joint = image_mask[visual_pos_masks]
303
+ vid_joint = video_mask[visual_pos_masks]
304
+ for img_e, vid_e in zip(deepstack_image_embeds, deepstack_video_embeds):
305
+ joint = img_e.new_zeros(
306
+ visual_pos_masks.sum(), img_e.shape[-1]
307
+ ).to(img_e.device)
308
+ joint[img_joint, :] = img_e
309
+ joint[vid_joint, :] = vid_e
310
+ deepstack_visual_embeds.append(joint)
311
+ elif image_mask is not None:
312
+ image_mask = image_mask[..., 0]
313
+ visual_pos_masks = image_mask
314
+ deepstack_visual_embeds = deepstack_image_embeds
315
+ elif video_mask is not None:
316
+ video_mask = video_mask[..., 0]
317
+ visual_pos_masks = video_mask
318
+ deepstack_visual_embeds = deepstack_video_embeds
319
+
320
+ if use_cache and past_key_values is None:
321
+ past_key_values = DynamicCache(config=self.config)
322
+
323
+ if cache_position is None:
324
+ past_seen = (
325
+ past_key_values.get_seq_length()
326
+ if past_key_values is not None else 0
327
+ )
328
+ cache_position = torch.arange(
329
+ past_seen, past_seen + inputs_embeds.shape[1],
330
+ device=inputs_embeds.device,
331
+ )
332
+
333
+ current_seq_len = inputs_embeds.shape[1]
334
+ if (current_seq_len == 1 and cache_position[0] == 0
335
+ and attention_mask is not None):
336
+ real_past_seen = attention_mask.shape[-1] - 1
337
+ if real_past_seen > 0:
338
+ cache_position = torch.tensor(
339
+ [real_past_seen], device=inputs_embeds.device
340
+ )
341
+
342
+ if position_ids is None:
343
+ attn_mask_tensor = (
344
+ attention_mask
345
+ if not isinstance(attention_mask, dict)
346
+ else attention_mask.get("full_attention")
347
+ )
348
+ if attn_mask_tensor is not None and attn_mask_tensor.ndim == 4:
349
+ attn_mask_tensor = torch.diagonal(
350
+ attn_mask_tensor[:, 0], dim1=1, dim2=2
351
+ )
352
+ if attn_mask_tensor.dtype.is_floating_point:
353
+ attn_mask_tensor = (
354
+ attn_mask_tensor
355
+ / torch.finfo(attn_mask_tensor.dtype).min
356
+ )
357
+ attn_mask_tensor = (1.0 - attn_mask_tensor).int()
358
+
359
+ is_real_prefill = (
360
+ (input_ids is not None and input_ids.shape[1] > 1)
361
+ or (inputs_embeds is not None and inputs_embeds.shape[1] > 1)
362
+ )
363
+ prefill_compiled = is_torchdynamo_compiling() and is_real_prefill
364
+ prefill_noncompiled = not is_torchdynamo_compiling() and (
365
+ (cache_position is not None and cache_position[0] == 0)
366
+ or (past_key_values is None
367
+ or past_key_values.get_seq_length() == 0)
368
+ )
369
+ should_calc_rope = (
370
+ (prefill_compiled or prefill_noncompiled)
371
+ or self.rope_deltas is None
372
+ )
373
+ if (should_calc_rope and not is_real_prefill
374
+ and self.rope_deltas is not None):
375
+ should_calc_rope = False
376
+
377
+ if should_calc_rope:
378
+ position_ids, rope_deltas = self.get_rope_index(
379
+ input_ids, image_grid_thw, video_grid_thw,
380
+ attention_mask=attn_mask_tensor,
381
+ )
382
+ self.rope_deltas = rope_deltas
383
+ else:
384
+ batch_size = inputs_embeds.shape[0]
385
+ seq_length = inputs_embeds.shape[1]
386
+ delta = (
387
+ (cache_position[0] + self.rope_deltas).to(
388
+ inputs_embeds.device
389
+ )
390
+ if cache_position is not None else 0
391
+ )
392
+ position_ids = torch.arange(
393
+ seq_length, device=inputs_embeds.device
394
+ ).view(1, -1).expand(batch_size, -1)
395
+ if cache_position is not None:
396
+ delta = delta.repeat_interleave(
397
+ batch_size // delta.shape[0], dim=0
398
+ )
399
+ position_ids = position_ids.add(delta)
400
+ position_ids = position_ids.unsqueeze(0).expand(3, -1, -1)
401
+
402
+ if position_ids.ndim == 3 and position_ids.shape[0] == 4:
403
+ text_position_ids = position_ids[0]
404
+ rope_position_ids = position_ids[1:]
405
+ elif position_ids.ndim == 3:
406
+ text_position_ids = position_ids[0]
407
+ rope_position_ids = position_ids
408
+ else:
409
+ text_position_ids = position_ids
410
+ rope_position_ids = position_ids
411
+
412
+ rotary_emb = self.rotary_emb(inputs_embeds, rope_position_ids)
413
+
414
+ hidden_states = self.input_proj(inputs_embeds)
415
+
416
+ all_hidden_states = () if output_hidden_states else None
417
+ if output_hidden_states:
418
+ all_hidden_states += (hidden_states,)
419
+
420
+ for i, layer in enumerate(self.layers):
421
+ layer_outputs = layer(
422
+ hidden_states,
423
+ attention_mask=attention_mask,
424
+ position_ids=text_position_ids,
425
+ position_embeddings=rotary_emb,
426
+ use_cache=use_cache,
427
+ past_key_values=past_key_values,
428
+ cache_position=cache_position,
429
+ **kwargs,
430
+ )
431
+ hidden_states = (
432
+ layer_outputs[0]
433
+ if isinstance(layer_outputs, tuple)
434
+ else layer_outputs
435
+ )
436
+
437
+ if (deepstack_visual_embeds is not None
438
+ and i < len(deepstack_visual_embeds)):
439
+ hidden_states = self._deepstack_process(
440
+ hidden_states,
441
+ visual_pos_masks,
442
+ deepstack_visual_embeds[i],
443
+ )
444
+
445
+ if output_hidden_states:
446
+ all_hidden_states += (hidden_states,)
447
+
448
+ hidden_states = self.norm(hidden_states)
449
+ hidden_states = self.output_proj(hidden_states)
450
+ logits = self.lm_head(hidden_states)
451
+
452
+ loss = None
453
+ if labels is not None:
454
+ shift_logits = logits[..., :-1, :].contiguous()
455
+ shift_labels = labels[..., 1:].contiguous()
456
+ loss = nn.CrossEntropyLoss()(
457
+ shift_logits.view(-1, self.vocab_size),
458
+ shift_labels.view(-1),
459
+ )
460
+
461
+ return Qwen3VLCausalLMOutputWithPast(
462
+ loss=loss,
463
+ logits=logits,
464
+ past_key_values=past_key_values,
465
+ hidden_states=all_hidden_states,
466
+ rope_deltas=self.rope_deltas,
467
+ )
nas_vl_layer.py ADDED
@@ -0,0 +1,690 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Tuple
2
+ from enum import Enum
3
+ from dataclasses import dataclass, field
4
+ from types import SimpleNamespace
5
+
6
+ import torch
7
+ import copy
8
+ from transformers import Qwen3Config
9
+ from transformers import GradientCheckpointingLayer, Cache
10
+ from transformers.masking_utils import (
11
+ create_causal_mask,
12
+ create_sliding_window_causal_mask,
13
+ )
14
+ from transformers.models.qwen3.modeling_qwen3 import Qwen3Attention, Qwen3MLP, Qwen3RMSNorm
15
+ from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLTextAttention, Qwen3VLTextMLP, Qwen3VLTextRMSNorm
16
+
17
+ from fla.layers.delta_net import DeltaNet
18
+ from fla.models.delta_net.configuration_delta_net import DeltaNetConfig
19
+
20
+ from fla.layers.gated_deltanet import GatedDeltaNet
21
+ from fla.models.gated_deltanet.configuration_gated_deltanet import GatedDeltaNetConfig
22
+
23
+ from fla.layers.kda import KimiDeltaAttention
24
+ from fla.models.kda.configuration_kda import KDAConfig
25
+ from fla.models.kda.modeling_kda import KDAPreTrainedModel
26
+
27
+ from fla.layers.mamba2 import Mamba2
28
+ from fla.models.mamba2.configuration_mamba2 import Mamba2Config
29
+ from fla.models.mamba2.modeling_mamba2 import Mamba2Block
30
+
31
+ from fla.layers.gla import GatedLinearAttention
32
+ from fla.models.gla.configuration_gla import GLAConfig
33
+
34
+ from fla.layers.nsa import NativeSparseAttention
35
+ from fla.models.nsa.configuration_nsa import NSAConfig
36
+
37
+ from fla.layers.mla import MultiheadLatentAttention
38
+ from fla.models.mla.configuration_mla import MLAConfig
39
+
40
+ import copy
41
+
42
+
43
+ class FLACacheAdapter:
44
+ def __init__(self, cache):
45
+ self.cache = cache
46
+ if not hasattr(self.cache, 'fla_states'):
47
+ self.cache.fla_states = {}
48
+
49
+ def get_seq_length(self, layer_idx=None):
50
+ if layer_idx is not None and layer_idx in self.cache.fla_states:
51
+ state = self.cache.fla_states[layer_idx]
52
+ if 'attn_state' in state:
53
+ attn_state = state['attn_state']
54
+ if (isinstance(attn_state, tuple) and len(attn_state) == 2
55
+ and isinstance(attn_state[0], torch.Tensor)):
56
+ return attn_state[0].shape[1]
57
+ return 0
58
+
59
+ def update(self, attn_state=None, layer_idx=None, offset=None,
60
+ cache_kwargs=None, **kwargs):
61
+ if layer_idx is None:
62
+ layer_idx = kwargs.pop('layer_idx', None)
63
+ if layer_idx is None:
64
+ return {}
65
+
66
+ if layer_idx not in self.cache.fla_states:
67
+ self.cache.fla_states[layer_idx] = {}
68
+
69
+ state = self.cache.fla_states[layer_idx]
70
+
71
+ if attn_state is not None:
72
+ if (isinstance(attn_state, tuple) and len(attn_state) == 2
73
+ and isinstance(attn_state[0], torch.Tensor)
74
+ and isinstance(attn_state[1], torch.Tensor)):
75
+ new_k, new_v = attn_state
76
+ if 'attn_state' in state:
77
+ old_k, old_v = state['attn_state']
78
+ new_k = torch.cat([old_k, new_k], dim=1)
79
+ new_v = torch.cat([old_v, new_v], dim=1)
80
+ state['attn_state'] = (new_k, new_v)
81
+ else:
82
+ state['attn_state'] = attn_state
83
+
84
+ for key, value in kwargs.items():
85
+ if key != 'layer_idx':
86
+ state[key] = value
87
+
88
+ return state
89
+
90
+ def __getitem__(self, layer_idx):
91
+ return self.cache.fla_states.get(layer_idx, None)
92
+
93
+ def __setitem__(self, layer_idx, value):
94
+ self.cache.fla_states[layer_idx] = value
95
+
96
+ def __contains__(self, layer_idx):
97
+ return layer_idx in self.cache.fla_states
98
+
99
+ def __len__(self):
100
+ if not self.cache.fla_states:
101
+ return 0
102
+ return max(self.cache.fla_states.keys()) + 1
103
+
104
+
105
+ class AttentionType(str, Enum):
106
+ FULL = "full_attention"
107
+ SWA = "swa"
108
+ MAMBA2 = "mamba2"
109
+ GLA = "gla"
110
+ GDN = "gdn"
111
+ DN = "dn"
112
+ KDA = "kda"
113
+ NSA = "nsa"
114
+ MLA = "mla"
115
+ NOOP = "no-op"
116
+ LINEAR = "linear"
117
+
118
+ class FFNType(str, Enum):
119
+ FFN = "ffn"
120
+ MOE = "moe"
121
+ NOOP = "no-op"
122
+ LINEAR = "linear"
123
+ NFFN = "nffn"
124
+
125
+ class MetricType(str, Enum):
126
+ mse = "mse"
127
+ cosine = "cosine"
128
+ kl = "kl"
129
+
130
+ @dataclass
131
+ class ChildLayerVLConfig:
132
+ attention_type: Optional[AttentionType] = field(default=None)
133
+ ffn_type: Optional[FFNType] = field(default=None)
134
+ block_metric: Optional[MetricType] = field(default=None)
135
+ child_hidden_size: Optional[int] = field(default=None)
136
+ child_intermediate_size: Optional[int] = field(default=None)
137
+ gqa_num_kv_heads: Optional[int] = field(default=None)
138
+ child_num_attention_heads: Optional[int] = field(default=None)
139
+ inherit: str = field(default="false")
140
+ sliding_window: Optional[int] = field(default=1024)
141
+
142
+ def __post_init__(self):
143
+ if self.inherit is not None:
144
+ cleaned = str(self.inherit).strip().lower()
145
+ self.inherit = cleaned in ("true", "yes", "1")
146
+ else:
147
+ self.inherit = False
148
+
149
+
150
+ class NonGatedFFN(torch.nn.Module):
151
+ def __init__(self, config):
152
+ super().__init__()
153
+ self.config = config
154
+ self.hidden_size = config.hidden_size
155
+ self.intermediate_size = config.intermediate_size
156
+ self.up_proj = torch.nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
157
+ self.down_proj = torch.nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
158
+ self.act_fn = torch.nn.ReLU()
159
+
160
+ def forward(self, x):
161
+ return self.down_proj(self.act_fn(self.up_proj(x)))
162
+
163
+
164
+ class NasVLDecoderLayer(GradientCheckpointingLayer):
165
+ def __init__(self, layer_idx: int, nas_config, parent_config, parent_model=None):
166
+ super().__init__()
167
+
168
+ self.parent_config = parent_config
169
+ self.parent_text_config = parent_config.text_config
170
+ self.layer_idx = layer_idx
171
+
172
+ if isinstance(nas_config, dict):
173
+ nas_config = ChildLayerVLConfig(**nas_config)
174
+ elif not isinstance(nas_config, ChildLayerVLConfig):
175
+ nas_config = ChildLayerVLConfig(**vars(nas_config))
176
+
177
+ self.nas_config = nas_config
178
+ self.attention_type = nas_config.attention_type
179
+ self.inherit = nas_config.inherit
180
+
181
+ self.child_attn_heads = int(
182
+ getattr(nas_config, "child_num_attention_heads", 0)
183
+ or self.parent_text_config.num_attention_heads
184
+ )
185
+ self.child_kv_heads = int(
186
+ getattr(nas_config, "gqa_num_kv_heads", 0)
187
+ or self.parent_text_config.num_key_value_heads
188
+ )
189
+ self.child_inter_size = int(
190
+ getattr(nas_config, "child_intermediate_size", 0)
191
+ or self.parent_text_config.intermediate_size
192
+ )
193
+
194
+ self.hidden_size = self.parent_text_config.hidden_size
195
+
196
+ if nas_config.attention_type == AttentionType.FULL:
197
+ attn_config = copy.deepcopy(self.parent_text_config)
198
+ attn_config.num_attention_heads = self.child_attn_heads
199
+ attn_config.num_key_value_heads = self.child_kv_heads
200
+ attn_config._attn_implementation = "sdpa"
201
+ self.self_attn = Qwen3VLTextAttention(config=attn_config, layer_idx=layer_idx)
202
+
203
+ if parent_model is not None and self.inherit:
204
+ teacher_attn = parent_model.model.language_model.layers[layer_idx].self_attn
205
+ if (self.child_attn_heads == self.parent_text_config.num_attention_heads
206
+ and self.child_kv_heads == self.parent_text_config.num_key_value_heads):
207
+ self.self_attn.load_state_dict(teacher_attn.state_dict(), strict=True)
208
+ else:
209
+ prune_qwen_attention_head(
210
+ student_attn=self.self_attn,
211
+ teacher_attn=teacher_attn,
212
+ teacher_config=self.parent_text_config,
213
+ target_q_heads=self.child_attn_heads,
214
+ target_kv_heads=self.child_kv_heads,
215
+ )
216
+
217
+ elif nas_config.attention_type == AttentionType.SWA:
218
+ self.sliding_window = int(
219
+ getattr(nas_config, "sliding_window", 1024) or 1024
220
+ )
221
+
222
+ self._swa_mask_config = copy.deepcopy(parent_config)
223
+ self._swa_mask_config.sliding_window = self.sliding_window
224
+ if hasattr(self._swa_mask_config, "text_config"):
225
+ self._swa_mask_config.text_config.sliding_window = self.sliding_window
226
+ self._swa_mask_config._attn_implementation = "sdpa"
227
+ if hasattr(self._swa_mask_config, "text_config"):
228
+ self._swa_mask_config.text_config._attn_implementation = "sdpa"
229
+
230
+ attn_config = copy.deepcopy(self.parent_text_config)
231
+ attn_config.num_attention_heads = self.child_attn_heads
232
+ attn_config.num_key_value_heads = self.child_kv_heads
233
+ attn_config._attn_implementation = "sdpa"
234
+ self.self_attn = Qwen3VLTextAttention(config=attn_config, layer_idx=layer_idx)
235
+
236
+ if parent_model is not None and self.inherit:
237
+ teacher_attn = parent_model.model.language_model.layers[layer_idx].self_attn
238
+ if (self.child_attn_heads == self.parent_text_config.num_attention_heads
239
+ and self.child_kv_heads == self.parent_text_config.num_key_value_heads):
240
+ self.self_attn.load_state_dict(teacher_attn.state_dict(), strict=True)
241
+ else:
242
+ prune_qwen_attention_head(
243
+ student_attn=self.self_attn,
244
+ teacher_attn=teacher_attn,
245
+ teacher_config=self.parent_text_config,
246
+ target_q_heads=self.child_attn_heads,
247
+ target_kv_heads=self.child_kv_heads,
248
+ )
249
+
250
+ elif nas_config.attention_type == AttentionType.LINEAR:
251
+ self.self_attn = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False)
252
+ if parent_model is not None and self.inherit:
253
+ prune_qwen_attention_head_linear(
254
+ student_attn=self.self_attn,
255
+ teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
256
+ teacher_config=parent_config.text_config,
257
+ )
258
+
259
+ elif nas_config.attention_type == AttentionType.KDA:
260
+ config = KDAConfig(hidden_size=self.hidden_size)
261
+ config.expand_v = 1
262
+ self.self_attn = KimiDeltaAttention(
263
+ mode=config.attn_mode,
264
+ hidden_size=config.hidden_size,
265
+ expand_v=config.expand_v,
266
+ head_dim=config.head_dim,
267
+ num_heads=config.num_heads,
268
+ num_v_heads=config.num_v_heads,
269
+ use_short_conv=config.use_short_conv,
270
+ allow_neg_eigval=config.allow_neg_eigval,
271
+ conv_size=config.conv_size,
272
+ norm_eps=config.norm_eps,
273
+ layer_idx=layer_idx,
274
+ )
275
+ if parent_model is not None and self.inherit:
276
+ prune_qwen_attention_head_kda(
277
+ student_attn=self.self_attn,
278
+ teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
279
+ teacher_config=parent_config.text_config,
280
+ )
281
+
282
+ elif nas_config.attention_type == AttentionType.GDN:
283
+ config = GatedDeltaNetConfig(hidden_size=self.hidden_size)
284
+ self.self_attn = GatedDeltaNet(
285
+ mode=config.attn_mode,
286
+ hidden_size=config.hidden_size,
287
+ expand_v=config.expand_v,
288
+ head_dim=config.head_dim,
289
+ num_heads=config.num_heads,
290
+ num_v_heads=config.num_v_heads,
291
+ use_gate=config.use_gate,
292
+ use_short_conv=config.use_short_conv,
293
+ allow_neg_eigval=config.allow_neg_eigval,
294
+ conv_size=config.conv_size,
295
+ norm_eps=config.norm_eps,
296
+ layer_idx=layer_idx,
297
+ )
298
+ if parent_model is not None and self.inherit:
299
+ prune_qwen_attention_head_gdn(
300
+ student_attn=self.self_attn,
301
+ teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
302
+ teacher_config=parent_config.text_config,
303
+ )
304
+
305
+ elif nas_config.attention_type == AttentionType.NSA:
306
+ config = NSAConfig(hidden_size=self.hidden_size)
307
+ self.self_attn = NativeSparseAttention(
308
+ hidden_size=config.hidden_size,
309
+ num_heads=config.num_heads,
310
+ num_kv_heads=config.num_kv_heads,
311
+ head_dim=config.head_dim,
312
+ qkv_bias=config.qkv_bias,
313
+ block_size=config.block_size,
314
+ block_counts=config.block_counts,
315
+ window_size=config.window_size,
316
+ rope_theta=config.rope_theta,
317
+ max_position_embeddings=config.max_position_embeddings,
318
+ layer_idx=layer_idx,
319
+ )
320
+ if parent_model is not None and self.inherit:
321
+ prune_qwen_attention_head_nsa(
322
+ student_attn=self.self_attn,
323
+ teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
324
+ teacher_config=parent_config.text_config,
325
+ )
326
+
327
+ elif nas_config.attention_type == AttentionType.MLA:
328
+ config = MLAConfig(hidden_size=self.hidden_size)
329
+ self.self_attn = MultiheadLatentAttention(
330
+ hidden_size=config.hidden_size,
331
+ num_heads=config.num_heads,
332
+ q_lora_rank=config.q_lora_rank,
333
+ qk_rope_head_dim=config.qk_rope_head_dim,
334
+ kv_lora_rank=config.kv_lora_rank,
335
+ v_head_dim=config.v_head_dim,
336
+ qk_nope_head_dim=config.qk_nope_head_dim,
337
+ qk_head_dim=config.qk_head_dim,
338
+ window_size=config.window_size,
339
+ rope_theta=config.rope_theta,
340
+ max_position_embeddings=config.max_position_embeddings,
341
+ rope_scaling=config.rope_scaling,
342
+ layer_idx=layer_idx,
343
+ )
344
+ if parent_model is not None and self.inherit:
345
+ prune_qwen_attention_head_mla(
346
+ student_attn=self.self_attn,
347
+ teacher_attn=parent_model.model.language_model.layers[layer_idx].self_attn,
348
+ teacher_config=parent_config.text_config,
349
+ )
350
+
351
+ elif nas_config.attention_type == AttentionType.NOOP:
352
+ self.self_attn = None
353
+
354
+ else:
355
+ raise Exception(f"Attention Type Not Define: {nas_config.attention_type}")
356
+
357
+ if nas_config.ffn_type == FFNType.FFN:
358
+ mlp_config = copy.deepcopy(self.parent_text_config)
359
+ mlp_config.intermediate_size = self.child_inter_size
360
+ self.mlp = Qwen3VLTextMLP(mlp_config)
361
+
362
+ if parent_model is not None and self.inherit:
363
+ teacher_mlp = parent_model.model.language_model.layers[layer_idx].mlp
364
+ teacher_inter_size = teacher_mlp.up_proj.weight.shape[0]
365
+
366
+ if self.child_inter_size < teacher_inter_size:
367
+ init_student_ffn(self.mlp, teacher_mlp, self.child_inter_size)
368
+ elif self.child_inter_size == teacher_inter_size:
369
+ self.mlp.load_state_dict(teacher_mlp.state_dict(), strict=True)
370
+ else:
371
+ raise ValueError(
372
+ f"Layer {layer_idx}: Student intermediate size ({self.child_inter_size}) "
373
+ f"is larger than Teacher ({teacher_inter_size})."
374
+ )
375
+
376
+ elif nas_config.ffn_type == FFNType.LINEAR:
377
+ self.mlp = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False)
378
+ if parent_model is not None and self.inherit:
379
+ init_student_ffn_linear(
380
+ self.mlp, parent_model.model.language_model.layers[layer_idx].mlp
381
+ )
382
+
383
+ elif nas_config.ffn_type == FFNType.NFFN:
384
+ nffn_config = copy.deepcopy(self.parent_text_config)
385
+ nffn_config.intermediate_size = self.child_inter_size
386
+ self.mlp = NonGatedFFN(nffn_config)
387
+
388
+ elif nas_config.ffn_type == FFNType.NOOP:
389
+ self.mlp = None
390
+
391
+ else:
392
+ raise Exception(f"FFN Type Not Define: {nas_config.ffn_type}")
393
+
394
+ norm_eps = self.parent_text_config.rms_norm_eps
395
+ if self.self_attn is not None:
396
+ self.input_layernorm = Qwen3VLTextRMSNorm(self.hidden_size, eps=norm_eps)
397
+ if parent_model is not None:
398
+ self.input_layernorm.load_state_dict(
399
+ parent_model.model.language_model.layers[layer_idx].input_layernorm.state_dict()
400
+ )
401
+ else:
402
+ self.input_layernorm = None
403
+
404
+ if self.mlp is not None:
405
+ self.post_attention_layernorm = Qwen3VLTextRMSNorm(self.hidden_size, eps=norm_eps)
406
+ if parent_model is not None:
407
+ self.post_attention_layernorm.load_state_dict(
408
+ parent_model.model.language_model.layers[layer_idx].post_attention_layernorm.state_dict()
409
+ )
410
+ else:
411
+ self.post_attention_layernorm = None
412
+
413
+ def forward(
414
+ self,
415
+ hidden_states: torch.Tensor,
416
+ attention_mask: Optional[torch.Tensor] = None,
417
+ position_ids: Optional[torch.LongTensor] = None,
418
+ past_key_values: Optional[Cache] = None,
419
+ use_cache: Optional[bool] = False,
420
+ cache_position: Optional[torch.LongTensor] = None,
421
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
422
+ **kwargs,
423
+ ) -> Tuple[torch.Tensor, Optional[Cache]]:
424
+
425
+ residual = hidden_states
426
+ present_key_values = past_key_values
427
+
428
+ mask_2d = None
429
+ mask_4d = None
430
+
431
+ if attention_mask is not None:
432
+ if attention_mask.ndim == 4:
433
+ mask_2d = attention_mask[:, 0, -1, :]
434
+ else:
435
+ mask_2d = attention_mask
436
+
437
+ if self.nas_config.attention_type == AttentionType.FULL:
438
+ if attention_mask.ndim == 4:
439
+ mask_4d = attention_mask
440
+ else:
441
+ if cache_position is None:
442
+ past_seen_tokens = (
443
+ past_key_values.get_seq_length()
444
+ if past_key_values is not None
445
+ else 0
446
+ )
447
+ cache_position = torch.arange(
448
+ past_seen_tokens,
449
+ past_seen_tokens + hidden_states.shape[1],
450
+ device=hidden_states.device,
451
+ )
452
+ mask_4d = create_causal_mask(
453
+ input_embeds=hidden_states,
454
+ attention_mask=attention_mask,
455
+ cache_position=cache_position,
456
+ past_key_values=past_key_values,
457
+ config=self.parent_config,
458
+ )
459
+
460
+ elif self.nas_config.attention_type == AttentionType.SWA:
461
+ if attention_mask.ndim == 4:
462
+ mask_4d = attention_mask
463
+ else:
464
+ if cache_position is None:
465
+ past_seen_tokens = (
466
+ past_key_values.get_seq_length()
467
+ if past_key_values is not None
468
+ else 0
469
+ )
470
+ cache_position = torch.arange(
471
+ past_seen_tokens,
472
+ past_seen_tokens + hidden_states.shape[1],
473
+ device=hidden_states.device,
474
+ )
475
+ mask_4d = create_sliding_window_causal_mask(
476
+ config=self._swa_mask_config,
477
+ input_embeds=hidden_states,
478
+ attention_mask=attention_mask,
479
+ cache_position=cache_position,
480
+ past_key_values=past_key_values,
481
+ )
482
+
483
+ if self.nas_config.attention_type == AttentionType.SWA and mask_4d is None:
484
+ if cache_position is None:
485
+ past_seen_tokens = (
486
+ past_key_values.get_seq_length()
487
+ if past_key_values is not None
488
+ else 0
489
+ )
490
+ cache_position = torch.arange(
491
+ past_seen_tokens,
492
+ past_seen_tokens + hidden_states.shape[1],
493
+ device=hidden_states.device,
494
+ )
495
+ mask_4d = create_sliding_window_causal_mask(
496
+ config=self._swa_mask_config,
497
+ input_embeds=hidden_states,
498
+ attention_mask=None,
499
+ cache_position=cache_position,
500
+ past_key_values=past_key_values,
501
+ )
502
+
503
+ if self.nas_config.attention_type == AttentionType.FULL:
504
+ hidden_states = self.input_layernorm(hidden_states)
505
+ hidden_states, _ = self.self_attn(
506
+ hidden_states=hidden_states,
507
+ attention_mask=mask_4d,
508
+ position_ids=position_ids,
509
+ past_key_values=past_key_values,
510
+ use_cache=use_cache,
511
+ cache_position=cache_position,
512
+ position_embeddings=position_embeddings,
513
+ **kwargs,
514
+ )
515
+ hidden_states = residual + hidden_states
516
+
517
+ elif self.nas_config.attention_type == AttentionType.SWA:
518
+ hidden_states = self.input_layernorm(hidden_states)
519
+ hidden_states, _ = self.self_attn(
520
+ hidden_states=hidden_states,
521
+ attention_mask=mask_4d,
522
+ position_ids=position_ids,
523
+ past_key_values=past_key_values,
524
+ use_cache=use_cache,
525
+ cache_position=cache_position,
526
+ position_embeddings=position_embeddings,
527
+ **kwargs,
528
+ )
529
+ hidden_states = residual + hidden_states
530
+
531
+ elif self.nas_config.attention_type == AttentionType.LINEAR:
532
+ hidden_states = self.input_layernorm(hidden_states)
533
+ hidden_states = self.self_attn(hidden_states)
534
+ hidden_states = residual + hidden_states
535
+
536
+ elif self.nas_config.attention_type == AttentionType.NOOP:
537
+ hidden_states = residual
538
+
539
+ elif self.nas_config.attention_type in [
540
+ AttentionType.KDA,
541
+ AttentionType.GDN
542
+ ]:
543
+ fla_cache_proxy = None
544
+ if use_cache and past_key_values is not None:
545
+ fla_cache_proxy = FLACacheAdapter(past_key_values)
546
+
547
+ if self.training:
548
+ mode = "chunk"
549
+ else:
550
+ mode = "fused_recurrent" if use_cache else "chunk"
551
+
552
+ batch_size, q_len, _ = hidden_states.shape
553
+
554
+ if q_len > 64 or use_cache:
555
+ hidden_states = self.input_layernorm(hidden_states)
556
+
557
+ outputs = self.self_attn(
558
+ hidden_states=hidden_states,
559
+ attention_mask=mask_2d,
560
+ past_key_values=fla_cache_proxy,
561
+ use_cache=use_cache,
562
+ mode=mode,
563
+ **kwargs,
564
+ )
565
+
566
+ if isinstance(outputs, tuple):
567
+ hidden_states = outputs[0]
568
+ else:
569
+ hidden_states = outputs
570
+
571
+ hidden_states = residual + hidden_states
572
+ else:
573
+ hidden_states = residual
574
+
575
+ elif self.nas_config.attention_type == AttentionType.NSA:
576
+ hidden_states = self.input_layernorm(hidden_states)
577
+
578
+ if self.training:
579
+ nsa_kwargs = {k: v for k, v in kwargs.items() if k in ("cu_seqlens",)}
580
+
581
+ outputs = self.self_attn(
582
+ hidden_states=hidden_states,
583
+ attention_mask=mask_2d,
584
+ past_key_values=None,
585
+ use_cache=False,
586
+ **nsa_kwargs,
587
+ )
588
+ if isinstance(outputs, tuple):
589
+ hidden_states = outputs[0]
590
+ else:
591
+ hidden_states = outputs
592
+ else:
593
+ if past_key_values is not None and use_cache:
594
+ if not hasattr(past_key_values, "fla_states"):
595
+ past_key_values.fla_states = {}
596
+
597
+ nsa_state = past_key_values.fla_states.get(
598
+ f"nsa_hidden_{self.layer_idx}", None
599
+ )
600
+
601
+ if nsa_state is not None:
602
+ full_hidden = torch.cat([nsa_state, hidden_states], dim=1)
603
+ else:
604
+ full_hidden = hidden_states
605
+
606
+ past_key_values.fla_states[f"nsa_hidden_{self.layer_idx}"] = (
607
+ full_hidden.detach()
608
+ )
609
+
610
+ full_mask = None
611
+ if mask_2d is not None:
612
+ cached_len = full_hidden.shape[1] - hidden_states.shape[1]
613
+ if cached_len > 0:
614
+ prefix_mask = torch.ones(
615
+ mask_2d.shape[0],
616
+ cached_len,
617
+ dtype=mask_2d.dtype,
618
+ device=mask_2d.device,
619
+ )
620
+ full_mask = torch.cat([prefix_mask, mask_2d], dim=1)
621
+ else:
622
+ full_mask = mask_2d
623
+
624
+ outputs = self.self_attn(
625
+ hidden_states=full_hidden,
626
+ attention_mask=full_mask,
627
+ past_key_values=None,
628
+ use_cache=False,
629
+ **{k: v for k, v in kwargs.items() if k in ("cu_seqlens",)},
630
+ )
631
+
632
+ if isinstance(outputs, tuple):
633
+ full_output = outputs[0]
634
+ else:
635
+ full_output = outputs
636
+
637
+ hidden_states = full_output[:, -hidden_states.shape[1] :, :]
638
+ else:
639
+ outputs = self.self_attn(
640
+ hidden_states=hidden_states,
641
+ attention_mask=mask_2d,
642
+ past_key_values=None,
643
+ use_cache=False,
644
+ )
645
+ if isinstance(outputs, tuple):
646
+ hidden_states = outputs[0]
647
+ else:
648
+ hidden_states = outputs
649
+
650
+ if isinstance(hidden_states, tuple):
651
+ hidden_states = hidden_states[0]
652
+
653
+ hidden_states = residual + hidden_states
654
+
655
+ elif self.nas_config.attention_type == AttentionType.MLA:
656
+ hidden_states = self.input_layernorm(hidden_states)
657
+
658
+ fla_cache_proxy = None
659
+ if past_key_values is not None:
660
+ fla_cache_proxy = FLACacheAdapter(past_key_values)
661
+
662
+ outputs = self.self_attn(
663
+ hidden_states=hidden_states,
664
+ attention_mask=mask_2d,
665
+ past_key_values=fla_cache_proxy,
666
+ use_cache=use_cache,
667
+ **kwargs,
668
+ )
669
+
670
+ if isinstance(outputs, tuple):
671
+ hidden_states = outputs[0]
672
+ else:
673
+ hidden_states = outputs
674
+
675
+ hidden_states = residual + hidden_states
676
+
677
+ else:
678
+ raise Exception(f"Attention Type Not Define: {self.self_attn}")
679
+
680
+ if self.nas_config.ffn_type in [FFNType.FFN, FFNType.NFFN, FFNType.LINEAR]:
681
+ residual = hidden_states
682
+ hidden_states = self.post_attention_layernorm(hidden_states)
683
+ hidden_states = self.mlp(hidden_states)
684
+ hidden_states = residual + hidden_states
685
+ elif self.nas_config.ffn_type == FFNType.NOOP:
686
+ pass
687
+ else:
688
+ raise Exception(f"FFN Type Not Define: {self.nas_config.ffn_type}")
689
+
690
+ return hidden_states, present_key_values
preprocessor_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "disable_grouping": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_pad": null,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_processor_type": "Qwen2VLImageProcessorFast",
19
+ "image_std": [
20
+ 0.5,
21
+ 0.5,
22
+ 0.5
23
+ ],
24
+ "input_data_format": null,
25
+ "max_pixels": 4194304,
26
+ "merge_size": 2,
27
+ "min_pixels": 4096,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_tensors": null,
34
+ "size": {
35
+ "longest_edge": 1605632,
36
+ "shortest_edge": 3136
37
+ },
38
+ "temporal_patch_size": 2
39
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 262144,
235
+ "pad_token": "<|endoftext|>",
236
+ "processor_class": "Qwen3VLProcessor",
237
+ "split_special_tokens": false,
238
+ "tokenizer_class": "Qwen2Tokenizer",
239
+ "unk_token": null
240
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "do_sample_frames": true,
12
+ "fps": null,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_std": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "input_data_format": null,
24
+ "max_frames": 64,
25
+ "merge_size": 2,
26
+ "min_frames": 4,
27
+ "num_frames": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_metadata": false,
34
+ "size": {
35
+ "longest_edge": 14680064,
36
+ "shortest_edge": 1228800
37
+ },
38
+ "temporal_patch_size": 2,
39
+ "video_metadata": null,
40
+ "video_processor_type": "Qwen3VLVideoProcessor"
41
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff