UnstableLlama commited on
Commit
088d745
·
verified ·
1 Parent(s): c448612

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +179 -3
  2. config.json +74 -0
  3. dflash.py +188 -0
  4. model.safetensors +3 -0
  5. quantization_config.json +2383 -0
README.md CHANGED
@@ -1,3 +1,179 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - dflash
7
+ - speculative-decoding
8
+ - block-diffusion
9
+ - draft-model
10
+ - efficiency
11
+ - qwen
12
+ - diffusion-language-model
13
+ ---
14
+
15
+ # Qwen3.6-35B-A3B-DFlash
16
+
17
+ [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
18
+
19
+ **DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B).
20
+
21
+ <div align="center">
22
+ <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
23
+ </div>
24
+
25
+ ## Quick Start
26
+
27
+ ### Installation
28
+
29
+ vLLM (We temporarily modify the installation through this PR to support interleaved SWA and ensure correct handling of target hidden states for optimal performance):
30
+ ```bash
31
+ uv pip install vllm
32
+ uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"
33
+ ```
34
+
35
+ SGLang:
36
+ ```bash
37
+ uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
38
+ ```
39
+
40
+ ### Launch Server
41
+
42
+ vLLM:
43
+ ```bash
44
+ vllm serve Qwen/Qwen3.6-35B-A3B \
45
+ --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}' \
46
+ --attention-backend flash_attn \
47
+ --max-num-batched-tokens 32768
48
+ ```
49
+
50
+ SGLang:
51
+ ```bash
52
+ # Optional: enable schedule overlapping (experimental, may not be stable)
53
+ # export SGLANG_ENABLE_SPEC_V2=1
54
+ # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
55
+ # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
56
+
57
+ python -m sglang.launch_server \
58
+ --model-path Qwen/Qwen3.6-35B-A3B \
59
+ --speculative-algorithm DFLASH \
60
+ --speculative-draft-model-path z-lab/Qwen3.6-35B-A3B-DFlash \
61
+ --speculative-num-draft-tokens 16 \
62
+ --tp-size 1 \
63
+ --attention-backend fa3 \
64
+ --mem-fraction-static 0.75 \
65
+ --mamba-scheduler-strategy extra_buffer \
66
+ --trust-remote-code
67
+ ```
68
+ > **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
69
+
70
+ ### Usage
71
+
72
+ ```python
73
+ from openai import OpenAI
74
+
75
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
76
+
77
+ response = client.chat.completions.create(
78
+ model="Qwen/Qwen3.6-35B-A3B",
79
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
80
+ max_tokens=4096,
81
+ temperature=0.0
82
+ )
83
+ print(response.choices[0].message.content)
84
+ ```
85
+
86
+ ## Benchmark Results
87
+
88
+ **Setup:** Single NVIDIA B200, SGLang, thinking enabled, max output length 4096. We report end-to-end throughput, including prefill time. See our [GitHub repository](https://github.com/z-lab/dflash) for reproduction scripts.
89
+
90
+ ### Throughput and Speedup
91
+
92
+ DFlash achieves up to **2.9x** speedup at concurrency 1.
93
+
94
+ _Tokens/sec (speedup vs. autoregressive baseline)_
95
+
96
+ **Block Size = 16**
97
+ | Task | Concurrency | AR | **DFlash** |
98
+ |---|---:|---:|---:|
99
+ | Math500 | 1 | 234 | **682 (2.9x)** |
100
+ | | 8 | 1266 | **3138 (2.5x)** |
101
+ | | 16 | 1954 | **4813 (2.5x)** |
102
+ | | 32 | 2755 | **6520 (2.4x)** |
103
+ | GSM8K | 1 | 235 | **556 (2.4x)** |
104
+ | | 8 | 1236 | **2564 (2.1x)** |
105
+ | | 16 | 1886 | **3821 (2.0x)** |
106
+ | | 32 | 2699 | **5239 (1.9x)** |
107
+ | HumanEval | 1 | 238 | **603 (2.5x)** |
108
+ | | 8 | 1255 | **2800 (2.2x)** |
109
+ | | 16 | 1944 | **4208 (2.2x)** |
110
+ | | 32 | 2767 | **5782 (2.1x)** |
111
+ | MBPP | 1 | 235 | **559 (2.4x)** |
112
+ | | 8 | 1224 | **2538 (2.1x)** |
113
+ | | 16 | 1948 | **3816 (2.0x)** |
114
+ | | 32 | 2780 | **5378 (1.9x)** |
115
+ | MT-Bench | 1 | 233 | **442 (1.9x)** |
116
+ | | 8 | 1238 | **2028 (1.6x)** |
117
+ | | 16 | 1885 | **2997 (1.6x)** |
118
+ | | 32 | 2633 | **4034 (1.5x)** |
119
+ | Alpaca | 1 | 235 | **393 (1.7x)** |
120
+ | | 8 | 1221 | **1782 (1.5x)** |
121
+ | | 16 | 1844 | **2567 (1.4x)** |
122
+ | | 32 | 2579 | **3689 (1.4x)** |
123
+
124
+ **Block Size = 8**
125
+ | Task | Concurrency | AR | **DFlash** |
126
+ |---|---:|---:|---:|
127
+ | Math500 | 1 | 234 | **617 (2.6x)** |
128
+ | | 8 | 1266 | **2839 (2.2x)** |
129
+ | | 16 | 1954 | **4465 (2.3x)** |
130
+ | | 32 | 2755 | **6614 (2.4x)** |
131
+ | GSM8K | 1 | 235 | **540 (2.3x)** |
132
+ | | 8 | 1236 | **2466 (2.0x)** |
133
+ | | 16 | 1886 | **3899 (2.1x)** |
134
+ | | 32 | 2699 | **5713 (2.1x)** |
135
+ | HumanEval | 1 | 238 | **561 (2.4x)** |
136
+ | | 8 | 1255 | **2655 (2.1x)** |
137
+ | | 16 | 1944 | **4135 (2.1x)** |
138
+ | | 32 | 2767 | **6059 (2.2x)** |
139
+ | MBPP | 1 | 235 | **497 (2.1x)** |
140
+ | | 8 | 1224 | **2324 (1.9x)** |
141
+ | | 16 | 1948 | **3636 (1.9x)** |
142
+ | | 32 | 2780 | **4884 (1.8x)** |
143
+ | MT-Bench | 1 | 233 | **438 (1.9x)** |
144
+ | | 8 | 1238 | **2060 (1.7x)** |
145
+ | | 16 | 1885 | **3182 (1.7x)** |
146
+ | | 32 | 2633 | **4720 (1.8x)** |
147
+ | Alpaca | 1 | 235 | **407 (1.7x)** |
148
+ | | 8 | 1221 | **1880 (1.5x)** |
149
+ | | 16 | 1844 | **2903 (1.6x)** |
150
+ | | 32 | 2579 | **4115 (1.6x)** |
151
+
152
+ ### Acceptance Length
153
+
154
+ | Task | B8 | B16 |
155
+ |---|---:|---:|
156
+ | Math500 | 5.56 | 7.35 |
157
+ | GSM8K | 5.21 | 6.73 |
158
+ | HumanEval | 5.09 | 6.44 |
159
+ | MBPP | 4.78 | 5.83 |
160
+ | MT-Bench | 4.20 | 5.14 |
161
+ | Alpaca | 3.94 | 4.62 |
162
+
163
+
164
+ ## Acknowledgements
165
+
166
+ Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.
167
+
168
+ ## Citation
169
+
170
+ If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).
171
+
172
+ ```bibtex
173
+ @article{chen2026dflash,
174
+ title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
175
+ author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
176
+ journal = {arXiv preprint arXiv:2602.06036},
177
+ year = {2026}
178
+ }
179
+ ```
config.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DFlashDraftModel"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoModel": "dflash.DFlashDraftModel"
9
+ },
10
+ "block_size": 16,
11
+ "dflash_config": {
12
+ "mask_token_id": 248070,
13
+ "target_layer_ids": [
14
+ 1,
15
+ 10,
16
+ 19,
17
+ 28,
18
+ 37
19
+ ]
20
+ },
21
+ "dtype": "bfloat16",
22
+ "eos_token_id": 248046,
23
+ "head_dim": 128,
24
+ "hidden_act": "silu",
25
+ "hidden_size": 2048,
26
+ "initializer_range": 0.02,
27
+ "intermediate_size": 6144,
28
+ "layer_types": [
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention"
37
+ ],
38
+ "max_position_embeddings": 262144,
39
+ "max_window_layers": 8,
40
+ "model_type": "qwen3",
41
+ "num_attention_heads": 32,
42
+ "num_hidden_layers": 8,
43
+ "num_key_value_heads": 4,
44
+ "num_target_layers": 40,
45
+ "pad_token_id": 248044,
46
+ "rms_norm_eps": 1e-06,
47
+ "rope_scaling": {
48
+ "beta_fast": 32.0,
49
+ "beta_slow": 1.0,
50
+ "factor": 64.0,
51
+ "original_max_position_embeddings": 4096,
52
+ "rope_type": "yarn",
53
+ "type": "yarn"
54
+ },
55
+ "rope_theta": 10000000,
56
+ "sliding_window": null,
57
+ "tie_word_embeddings": false,
58
+ "transformers_version": "4.57.1",
59
+ "use_cache": false,
60
+ "use_sliding_window": false,
61
+ "vocab_size": 248320,
62
+ "quantization_config": {
63
+ "quant_method": "exl3",
64
+ "version": "0.0.32",
65
+ "bits": 4.0,
66
+ "head_bits": 6,
67
+ "calibration": {
68
+ "rows": 250,
69
+ "cols": 2048
70
+ },
71
+ "out_scales": "always",
72
+ "codebook": "mcg"
73
+ }
74
+ }
dflash.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, Callable
2
+ from typing_extensions import Unpack, Tuple
3
+ import torch
4
+ from torch import nn
5
+ from transformers.models.qwen3.modeling_qwen3 import (
6
+ Qwen3RMSNorm,
7
+ Qwen3RotaryEmbedding,
8
+ Qwen3Config,
9
+ Qwen3PreTrainedModel,
10
+ Qwen3MLP,
11
+ GradientCheckpointingLayer,
12
+ FlashAttentionKwargs,
13
+ rotate_half,
14
+ eager_attention_forward,
15
+ ALL_ATTENTION_FUNCTIONS,
16
+ )
17
+ from transformers.modeling_outputs import CausalLMOutputWithPast
18
+ from transformers.cache_utils import Cache
19
+
20
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
21
+ cos = cos.unsqueeze(unsqueeze_dim)
22
+ sin = sin.unsqueeze(unsqueeze_dim)
23
+ q_len = q.size(-2)
24
+ q_embed = (q * cos[..., -q_len:, :]) + (rotate_half(q) * sin[..., -q_len:, :])
25
+ k_embed = (k * cos) + (rotate_half(k) * sin)
26
+ return q_embed, k_embed
27
+
28
+ class Qwen3DFlashAttention(nn.Module):
29
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
30
+
31
+ def __init__(self, config: Qwen3Config, layer_idx: int):
32
+ super().__init__()
33
+ self.config = config
34
+ self.layer_idx = layer_idx
35
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
36
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
37
+ self.scaling = self.head_dim**-0.5
38
+ self.attention_dropout = config.attention_dropout
39
+ self.is_causal = False
40
+ self.q_proj = nn.Linear(
41
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
42
+ )
43
+ self.k_proj = nn.Linear(
44
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
45
+ )
46
+ self.v_proj = nn.Linear(
47
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
48
+ )
49
+ self.o_proj = nn.Linear(
50
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
51
+ )
52
+ self.q_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)
53
+ self.k_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)
54
+ self.sliding_window = config.sliding_window if config.layer_types[layer_idx] == "sliding_attention" else None
55
+
56
+ def forward(
57
+ self,
58
+ hidden_states: torch.Tensor,
59
+ target_hidden: torch.Tensor,
60
+ position_embeddings: tuple[torch.Tensor, torch.Tensor],
61
+ attention_mask: Optional[torch.Tensor],
62
+ past_key_values: Optional[Cache] = None,
63
+ cache_position: Optional[torch.LongTensor] = None,
64
+ **kwargs: Unpack[FlashAttentionKwargs],
65
+ ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
66
+ bsz, q_len = hidden_states.shape[:-1]
67
+ ctx_len = target_hidden.shape[1]
68
+ q = self.q_proj(hidden_states)
69
+ q = q.view(bsz, q_len, -1, self.head_dim)
70
+ q = self.q_norm(q).transpose(1, 2)
71
+ k_ctx = self.k_proj(target_hidden)
72
+ k_noise = self.k_proj(hidden_states)
73
+ v_ctx = self.v_proj(target_hidden)
74
+ v_noise = self.v_proj(hidden_states)
75
+ k = torch.cat([k_ctx, k_noise], dim=1).view(bsz, ctx_len + q_len, -1, self.head_dim)
76
+ v = torch.cat([v_ctx, v_noise], dim=1).view(bsz, ctx_len + q_len, -1, self.head_dim)
77
+ k = self.k_norm(k).transpose(1, 2)
78
+ v = v.transpose(1, 2)
79
+ cos, sin = position_embeddings
80
+ q, k = apply_rotary_pos_emb(q, k, cos, sin)
81
+ if past_key_values is not None:
82
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
83
+ k, v = past_key_values.update(k, v, self.layer_idx, cache_kwargs)
84
+ attn_fn: Callable = eager_attention_forward
85
+ if self.config._attn_implementation != "eager":
86
+ attn_fn = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
87
+ attn_output, attn_weights = attn_fn(
88
+ self,
89
+ q,
90
+ k,
91
+ v,
92
+ attention_mask,
93
+ dropout=0.0 if not self.training else self.attention_dropout,
94
+ scaling=self.scaling,
95
+ sliding_window=self.sliding_window,
96
+ **kwargs,
97
+ )
98
+ attn_output = attn_output.reshape(bsz, q_len, -1)
99
+ attn_output = self.o_proj(attn_output)
100
+ return attn_output, attn_weights
101
+
102
+ class Qwen3DFlashDecoderLayer(GradientCheckpointingLayer):
103
+ def __init__(self, config: Qwen3Config, layer_idx: int):
104
+ super().__init__()
105
+ self.hidden_size = config.hidden_size
106
+ self.self_attn = Qwen3DFlashAttention(config=config, layer_idx=layer_idx)
107
+ self.mlp = Qwen3MLP(config)
108
+ self.input_layernorm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
109
+ self.post_attention_layernorm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
110
+
111
+ def forward(
112
+ self,
113
+ target_hidden: Optional[torch.Tensor] = None,
114
+ hidden_states: Optional[torch.Tensor] = None,
115
+ attention_mask: Optional[torch.Tensor] = None,
116
+ position_ids: Optional[torch.LongTensor] = None,
117
+ past_key_value: Optional[Cache] = None,
118
+ output_attentions: Optional[bool] = False,
119
+ use_cache: Optional[bool] = False,
120
+ cache_position: Optional[torch.LongTensor] = None,
121
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
122
+ **kwargs: Unpack[FlashAttentionKwargs],
123
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
124
+ residual = hidden_states
125
+ hidden_states = self.input_layernorm(hidden_states)
126
+ hidden_states = self.self_attn(
127
+ hidden_states=hidden_states,
128
+ target_hidden=target_hidden,
129
+ attention_mask=attention_mask,
130
+ position_ids=position_ids,
131
+ past_key_values=past_key_value,
132
+ output_attentions=output_attentions,
133
+ use_cache=use_cache,
134
+ cache_position=cache_position,
135
+ position_embeddings=position_embeddings,
136
+ **kwargs,
137
+ )[0]
138
+ hidden_states = residual + hidden_states
139
+ residual = hidden_states
140
+ hidden_states = self.post_attention_layernorm(hidden_states)
141
+ hidden_states = self.mlp(hidden_states)
142
+ hidden_states = residual + hidden_states
143
+ return hidden_states
144
+
145
+ class DFlashDraftModel(Qwen3PreTrainedModel):
146
+ config_class = Qwen3Config
147
+ _no_split_modules = ["Qwen3DFlashDecoderLayer"]
148
+
149
+ def __init__(self, config) -> None:
150
+ super().__init__(config)
151
+ self.config = config
152
+ self.layers = nn.ModuleList(
153
+ [Qwen3DFlashDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
154
+ )
155
+ self.target_layer_ids = self.config.dflash_config.get("target_layer_ids", None)
156
+ self.norm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
157
+ self.rotary_emb = Qwen3RotaryEmbedding(config)
158
+ self.fc = nn.Linear(len(self.target_layer_ids) * config.hidden_size, config.hidden_size, bias=False)
159
+ self.hidden_norm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
160
+ self.block_size = config.block_size
161
+ self.mask_token_id = self.config.dflash_config.get("mask_token_id", None)
162
+ self.post_init()
163
+
164
+ def forward(
165
+ self,
166
+ position_ids: torch.LongTensor,
167
+ attention_mask: Optional[torch.Tensor] = None,
168
+ noise_embedding: Optional[torch.Tensor] = None,
169
+ target_hidden: Optional[torch.Tensor] = None,
170
+ past_key_values: Optional[Cache] = None,
171
+ use_cache: bool = False,
172
+ **kwargs,
173
+ ) -> CausalLMOutputWithPast:
174
+ hidden_states = noise_embedding
175
+ target_hidden = self.hidden_norm(self.fc(target_hidden))
176
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
177
+ for layer in self.layers:
178
+ hidden_states = layer(
179
+ hidden_states=hidden_states,
180
+ target_hidden=target_hidden,
181
+ attention_mask=attention_mask,
182
+ position_ids=position_ids,
183
+ past_key_value=past_key_values,
184
+ use_cache=use_cache,
185
+ position_embeddings=position_embeddings,
186
+ **kwargs,
187
+ )
188
+ return self.norm(hidden_states)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b9675b9461ac581c0115af112fc18e7b4358e25204d9461d65224aa44409093
3
+ size 237778523
quantization_config.json ADDED
@@ -0,0 +1,2383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "quant_method": "exl3",
3
+ "version": "0.0.32",
4
+ "bits": 4.0,
5
+ "head_bits": 6,
6
+ "calibration": {
7
+ "rows": 250,
8
+ "cols": 2048
9
+ },
10
+ "out_scales": "always",
11
+ "codebook": "mcg",
12
+ "tensor_storage": {
13
+ "fc": {
14
+ "stored_tensors": {
15
+ "fc.suh": {
16
+ "shape": [
17
+ 10240
18
+ ],
19
+ "n_bytes": 20480,
20
+ "dtype": "torch.float16"
21
+ },
22
+ "fc.svh": {
23
+ "shape": [
24
+ 2048
25
+ ],
26
+ "n_bytes": 4096,
27
+ "dtype": "torch.float16"
28
+ },
29
+ "fc.mcg": {
30
+ "shape": [],
31
+ "n_bytes": 4,
32
+ "dtype": "torch.int32"
33
+ },
34
+ "fc.trellis": {
35
+ "shape": [
36
+ 640,
37
+ 128,
38
+ 64
39
+ ],
40
+ "n_bytes": 10485760,
41
+ "dtype": "torch.int16"
42
+ }
43
+ },
44
+ "quant_format": "exl3",
45
+ "bits_per_weight": 4,
46
+ "mcg_multiplier": 3417055213
47
+ },
48
+ "hidden_norm": {
49
+ "stored_tensors": {
50
+ "hidden_norm.weight": {
51
+ "shape": [
52
+ 2048
53
+ ],
54
+ "n_bytes": 4096,
55
+ "dtype": "torch.bfloat16"
56
+ }
57
+ }
58
+ },
59
+ "layers.0.input_layernorm": {
60
+ "stored_tensors": {
61
+ "layers.0.input_layernorm.weight": {
62
+ "shape": [
63
+ 2048
64
+ ],
65
+ "n_bytes": 4096,
66
+ "dtype": "torch.bfloat16"
67
+ }
68
+ }
69
+ },
70
+ "layers.0.self_attn.q_proj": {
71
+ "stored_tensors": {
72
+ "layers.0.self_attn.q_proj.suh": {
73
+ "shape": [
74
+ 2048
75
+ ],
76
+ "n_bytes": 4096,
77
+ "dtype": "torch.float16"
78
+ },
79
+ "layers.0.self_attn.q_proj.svh": {
80
+ "shape": [
81
+ 4096
82
+ ],
83
+ "n_bytes": 8192,
84
+ "dtype": "torch.float16"
85
+ },
86
+ "layers.0.self_attn.q_proj.mcg": {
87
+ "shape": [],
88
+ "n_bytes": 4,
89
+ "dtype": "torch.int32"
90
+ },
91
+ "layers.0.self_attn.q_proj.trellis": {
92
+ "shape": [
93
+ 128,
94
+ 256,
95
+ 64
96
+ ],
97
+ "n_bytes": 4194304,
98
+ "dtype": "torch.int16"
99
+ }
100
+ },
101
+ "quant_format": "exl3",
102
+ "bits_per_weight": 4,
103
+ "mcg_multiplier": 3417055213
104
+ },
105
+ "layers.0.self_attn.k_proj": {
106
+ "stored_tensors": {
107
+ "layers.0.self_attn.k_proj.suh": {
108
+ "shape": [
109
+ 2048
110
+ ],
111
+ "n_bytes": 4096,
112
+ "dtype": "torch.float16"
113
+ },
114
+ "layers.0.self_attn.k_proj.svh": {
115
+ "shape": [
116
+ 512
117
+ ],
118
+ "n_bytes": 1024,
119
+ "dtype": "torch.float16"
120
+ },
121
+ "layers.0.self_attn.k_proj.mcg": {
122
+ "shape": [],
123
+ "n_bytes": 4,
124
+ "dtype": "torch.int32"
125
+ },
126
+ "layers.0.self_attn.k_proj.trellis": {
127
+ "shape": [
128
+ 128,
129
+ 32,
130
+ 64
131
+ ],
132
+ "n_bytes": 524288,
133
+ "dtype": "torch.int16"
134
+ }
135
+ },
136
+ "quant_format": "exl3",
137
+ "bits_per_weight": 4,
138
+ "mcg_multiplier": 3417055213
139
+ },
140
+ "layers.0.self_attn.v_proj": {
141
+ "stored_tensors": {
142
+ "layers.0.self_attn.v_proj.suh": {
143
+ "shape": [
144
+ 2048
145
+ ],
146
+ "n_bytes": 4096,
147
+ "dtype": "torch.float16"
148
+ },
149
+ "layers.0.self_attn.v_proj.svh": {
150
+ "shape": [
151
+ 512
152
+ ],
153
+ "n_bytes": 1024,
154
+ "dtype": "torch.float16"
155
+ },
156
+ "layers.0.self_attn.v_proj.mcg": {
157
+ "shape": [],
158
+ "n_bytes": 4,
159
+ "dtype": "torch.int32"
160
+ },
161
+ "layers.0.self_attn.v_proj.trellis": {
162
+ "shape": [
163
+ 128,
164
+ 32,
165
+ 64
166
+ ],
167
+ "n_bytes": 524288,
168
+ "dtype": "torch.int16"
169
+ }
170
+ },
171
+ "quant_format": "exl3",
172
+ "bits_per_weight": 4,
173
+ "mcg_multiplier": 3417055213
174
+ },
175
+ "layers.0.self_attn.o_proj": {
176
+ "stored_tensors": {
177
+ "layers.0.self_attn.o_proj.suh": {
178
+ "shape": [
179
+ 4096
180
+ ],
181
+ "n_bytes": 8192,
182
+ "dtype": "torch.float16"
183
+ },
184
+ "layers.0.self_attn.o_proj.svh": {
185
+ "shape": [
186
+ 2048
187
+ ],
188
+ "n_bytes": 4096,
189
+ "dtype": "torch.float16"
190
+ },
191
+ "layers.0.self_attn.o_proj.mcg": {
192
+ "shape": [],
193
+ "n_bytes": 4,
194
+ "dtype": "torch.int32"
195
+ },
196
+ "layers.0.self_attn.o_proj.trellis": {
197
+ "shape": [
198
+ 256,
199
+ 128,
200
+ 64
201
+ ],
202
+ "n_bytes": 4194304,
203
+ "dtype": "torch.int16"
204
+ }
205
+ },
206
+ "quant_format": "exl3",
207
+ "bits_per_weight": 4,
208
+ "mcg_multiplier": 3417055213
209
+ },
210
+ "layers.0.self_attn.q_norm": {
211
+ "stored_tensors": {
212
+ "layers.0.self_attn.q_norm.weight": {
213
+ "shape": [
214
+ 128
215
+ ],
216
+ "n_bytes": 256,
217
+ "dtype": "torch.bfloat16"
218
+ }
219
+ }
220
+ },
221
+ "layers.0.self_attn.k_norm": {
222
+ "stored_tensors": {
223
+ "layers.0.self_attn.k_norm.weight": {
224
+ "shape": [
225
+ 128
226
+ ],
227
+ "n_bytes": 256,
228
+ "dtype": "torch.bfloat16"
229
+ }
230
+ }
231
+ },
232
+ "layers.0.post_attention_layernorm": {
233
+ "stored_tensors": {
234
+ "layers.0.post_attention_layernorm.weight": {
235
+ "shape": [
236
+ 2048
237
+ ],
238
+ "n_bytes": 4096,
239
+ "dtype": "torch.bfloat16"
240
+ }
241
+ }
242
+ },
243
+ "layers.0.mlp.up_proj": {
244
+ "stored_tensors": {
245
+ "layers.0.mlp.up_proj.suh": {
246
+ "shape": [
247
+ 2048
248
+ ],
249
+ "n_bytes": 4096,
250
+ "dtype": "torch.float16"
251
+ },
252
+ "layers.0.mlp.up_proj.svh": {
253
+ "shape": [
254
+ 6144
255
+ ],
256
+ "n_bytes": 12288,
257
+ "dtype": "torch.float16"
258
+ },
259
+ "layers.0.mlp.up_proj.mcg": {
260
+ "shape": [],
261
+ "n_bytes": 4,
262
+ "dtype": "torch.int32"
263
+ },
264
+ "layers.0.mlp.up_proj.trellis": {
265
+ "shape": [
266
+ 128,
267
+ 384,
268
+ 64
269
+ ],
270
+ "n_bytes": 6291456,
271
+ "dtype": "torch.int16"
272
+ }
273
+ },
274
+ "quant_format": "exl3",
275
+ "bits_per_weight": 4,
276
+ "mcg_multiplier": 3417055213
277
+ },
278
+ "layers.0.mlp.gate_proj": {
279
+ "stored_tensors": {
280
+ "layers.0.mlp.gate_proj.suh": {
281
+ "shape": [
282
+ 2048
283
+ ],
284
+ "n_bytes": 4096,
285
+ "dtype": "torch.float16"
286
+ },
287
+ "layers.0.mlp.gate_proj.svh": {
288
+ "shape": [
289
+ 6144
290
+ ],
291
+ "n_bytes": 12288,
292
+ "dtype": "torch.float16"
293
+ },
294
+ "layers.0.mlp.gate_proj.mcg": {
295
+ "shape": [],
296
+ "n_bytes": 4,
297
+ "dtype": "torch.int32"
298
+ },
299
+ "layers.0.mlp.gate_proj.trellis": {
300
+ "shape": [
301
+ 128,
302
+ 384,
303
+ 64
304
+ ],
305
+ "n_bytes": 6291456,
306
+ "dtype": "torch.int16"
307
+ }
308
+ },
309
+ "quant_format": "exl3",
310
+ "bits_per_weight": 4,
311
+ "mcg_multiplier": 3417055213
312
+ },
313
+ "layers.0.mlp.down_proj": {
314
+ "stored_tensors": {
315
+ "layers.0.mlp.down_proj.suh": {
316
+ "shape": [
317
+ 6144
318
+ ],
319
+ "n_bytes": 12288,
320
+ "dtype": "torch.float16"
321
+ },
322
+ "layers.0.mlp.down_proj.svh": {
323
+ "shape": [
324
+ 2048
325
+ ],
326
+ "n_bytes": 4096,
327
+ "dtype": "torch.float16"
328
+ },
329
+ "layers.0.mlp.down_proj.mcg": {
330
+ "shape": [],
331
+ "n_bytes": 4,
332
+ "dtype": "torch.int32"
333
+ },
334
+ "layers.0.mlp.down_proj.trellis": {
335
+ "shape": [
336
+ 384,
337
+ 128,
338
+ 64
339
+ ],
340
+ "n_bytes": 6291456,
341
+ "dtype": "torch.int16"
342
+ }
343
+ },
344
+ "quant_format": "exl3",
345
+ "bits_per_weight": 4,
346
+ "mcg_multiplier": 3417055213
347
+ },
348
+ "layers.1.input_layernorm": {
349
+ "stored_tensors": {
350
+ "layers.1.input_layernorm.weight": {
351
+ "shape": [
352
+ 2048
353
+ ],
354
+ "n_bytes": 4096,
355
+ "dtype": "torch.bfloat16"
356
+ }
357
+ }
358
+ },
359
+ "layers.1.self_attn.q_proj": {
360
+ "stored_tensors": {
361
+ "layers.1.self_attn.q_proj.suh": {
362
+ "shape": [
363
+ 2048
364
+ ],
365
+ "n_bytes": 4096,
366
+ "dtype": "torch.float16"
367
+ },
368
+ "layers.1.self_attn.q_proj.svh": {
369
+ "shape": [
370
+ 4096
371
+ ],
372
+ "n_bytes": 8192,
373
+ "dtype": "torch.float16"
374
+ },
375
+ "layers.1.self_attn.q_proj.mcg": {
376
+ "shape": [],
377
+ "n_bytes": 4,
378
+ "dtype": "torch.int32"
379
+ },
380
+ "layers.1.self_attn.q_proj.trellis": {
381
+ "shape": [
382
+ 128,
383
+ 256,
384
+ 64
385
+ ],
386
+ "n_bytes": 4194304,
387
+ "dtype": "torch.int16"
388
+ }
389
+ },
390
+ "quant_format": "exl3",
391
+ "bits_per_weight": 4,
392
+ "mcg_multiplier": 3417055213
393
+ },
394
+ "layers.1.self_attn.k_proj": {
395
+ "stored_tensors": {
396
+ "layers.1.self_attn.k_proj.suh": {
397
+ "shape": [
398
+ 2048
399
+ ],
400
+ "n_bytes": 4096,
401
+ "dtype": "torch.float16"
402
+ },
403
+ "layers.1.self_attn.k_proj.svh": {
404
+ "shape": [
405
+ 512
406
+ ],
407
+ "n_bytes": 1024,
408
+ "dtype": "torch.float16"
409
+ },
410
+ "layers.1.self_attn.k_proj.mcg": {
411
+ "shape": [],
412
+ "n_bytes": 4,
413
+ "dtype": "torch.int32"
414
+ },
415
+ "layers.1.self_attn.k_proj.trellis": {
416
+ "shape": [
417
+ 128,
418
+ 32,
419
+ 64
420
+ ],
421
+ "n_bytes": 524288,
422
+ "dtype": "torch.int16"
423
+ }
424
+ },
425
+ "quant_format": "exl3",
426
+ "bits_per_weight": 4,
427
+ "mcg_multiplier": 3417055213
428
+ },
429
+ "layers.1.self_attn.v_proj": {
430
+ "stored_tensors": {
431
+ "layers.1.self_attn.v_proj.suh": {
432
+ "shape": [
433
+ 2048
434
+ ],
435
+ "n_bytes": 4096,
436
+ "dtype": "torch.float16"
437
+ },
438
+ "layers.1.self_attn.v_proj.svh": {
439
+ "shape": [
440
+ 512
441
+ ],
442
+ "n_bytes": 1024,
443
+ "dtype": "torch.float16"
444
+ },
445
+ "layers.1.self_attn.v_proj.mcg": {
446
+ "shape": [],
447
+ "n_bytes": 4,
448
+ "dtype": "torch.int32"
449
+ },
450
+ "layers.1.self_attn.v_proj.trellis": {
451
+ "shape": [
452
+ 128,
453
+ 32,
454
+ 64
455
+ ],
456
+ "n_bytes": 524288,
457
+ "dtype": "torch.int16"
458
+ }
459
+ },
460
+ "quant_format": "exl3",
461
+ "bits_per_weight": 4,
462
+ "mcg_multiplier": 3417055213
463
+ },
464
+ "layers.1.self_attn.o_proj": {
465
+ "stored_tensors": {
466
+ "layers.1.self_attn.o_proj.suh": {
467
+ "shape": [
468
+ 4096
469
+ ],
470
+ "n_bytes": 8192,
471
+ "dtype": "torch.float16"
472
+ },
473
+ "layers.1.self_attn.o_proj.svh": {
474
+ "shape": [
475
+ 2048
476
+ ],
477
+ "n_bytes": 4096,
478
+ "dtype": "torch.float16"
479
+ },
480
+ "layers.1.self_attn.o_proj.mcg": {
481
+ "shape": [],
482
+ "n_bytes": 4,
483
+ "dtype": "torch.int32"
484
+ },
485
+ "layers.1.self_attn.o_proj.trellis": {
486
+ "shape": [
487
+ 256,
488
+ 128,
489
+ 64
490
+ ],
491
+ "n_bytes": 4194304,
492
+ "dtype": "torch.int16"
493
+ }
494
+ },
495
+ "quant_format": "exl3",
496
+ "bits_per_weight": 4,
497
+ "mcg_multiplier": 3417055213
498
+ },
499
+ "layers.1.self_attn.q_norm": {
500
+ "stored_tensors": {
501
+ "layers.1.self_attn.q_norm.weight": {
502
+ "shape": [
503
+ 128
504
+ ],
505
+ "n_bytes": 256,
506
+ "dtype": "torch.bfloat16"
507
+ }
508
+ }
509
+ },
510
+ "layers.1.self_attn.k_norm": {
511
+ "stored_tensors": {
512
+ "layers.1.self_attn.k_norm.weight": {
513
+ "shape": [
514
+ 128
515
+ ],
516
+ "n_bytes": 256,
517
+ "dtype": "torch.bfloat16"
518
+ }
519
+ }
520
+ },
521
+ "layers.1.post_attention_layernorm": {
522
+ "stored_tensors": {
523
+ "layers.1.post_attention_layernorm.weight": {
524
+ "shape": [
525
+ 2048
526
+ ],
527
+ "n_bytes": 4096,
528
+ "dtype": "torch.bfloat16"
529
+ }
530
+ }
531
+ },
532
+ "layers.1.mlp.up_proj": {
533
+ "stored_tensors": {
534
+ "layers.1.mlp.up_proj.suh": {
535
+ "shape": [
536
+ 2048
537
+ ],
538
+ "n_bytes": 4096,
539
+ "dtype": "torch.float16"
540
+ },
541
+ "layers.1.mlp.up_proj.svh": {
542
+ "shape": [
543
+ 6144
544
+ ],
545
+ "n_bytes": 12288,
546
+ "dtype": "torch.float16"
547
+ },
548
+ "layers.1.mlp.up_proj.mcg": {
549
+ "shape": [],
550
+ "n_bytes": 4,
551
+ "dtype": "torch.int32"
552
+ },
553
+ "layers.1.mlp.up_proj.trellis": {
554
+ "shape": [
555
+ 128,
556
+ 384,
557
+ 64
558
+ ],
559
+ "n_bytes": 6291456,
560
+ "dtype": "torch.int16"
561
+ }
562
+ },
563
+ "quant_format": "exl3",
564
+ "bits_per_weight": 4,
565
+ "mcg_multiplier": 3417055213
566
+ },
567
+ "layers.1.mlp.gate_proj": {
568
+ "stored_tensors": {
569
+ "layers.1.mlp.gate_proj.suh": {
570
+ "shape": [
571
+ 2048
572
+ ],
573
+ "n_bytes": 4096,
574
+ "dtype": "torch.float16"
575
+ },
576
+ "layers.1.mlp.gate_proj.svh": {
577
+ "shape": [
578
+ 6144
579
+ ],
580
+ "n_bytes": 12288,
581
+ "dtype": "torch.float16"
582
+ },
583
+ "layers.1.mlp.gate_proj.mcg": {
584
+ "shape": [],
585
+ "n_bytes": 4,
586
+ "dtype": "torch.int32"
587
+ },
588
+ "layers.1.mlp.gate_proj.trellis": {
589
+ "shape": [
590
+ 128,
591
+ 384,
592
+ 64
593
+ ],
594
+ "n_bytes": 6291456,
595
+ "dtype": "torch.int16"
596
+ }
597
+ },
598
+ "quant_format": "exl3",
599
+ "bits_per_weight": 4,
600
+ "mcg_multiplier": 3417055213
601
+ },
602
+ "layers.1.mlp.down_proj": {
603
+ "stored_tensors": {
604
+ "layers.1.mlp.down_proj.suh": {
605
+ "shape": [
606
+ 6144
607
+ ],
608
+ "n_bytes": 12288,
609
+ "dtype": "torch.float16"
610
+ },
611
+ "layers.1.mlp.down_proj.svh": {
612
+ "shape": [
613
+ 2048
614
+ ],
615
+ "n_bytes": 4096,
616
+ "dtype": "torch.float16"
617
+ },
618
+ "layers.1.mlp.down_proj.mcg": {
619
+ "shape": [],
620
+ "n_bytes": 4,
621
+ "dtype": "torch.int32"
622
+ },
623
+ "layers.1.mlp.down_proj.trellis": {
624
+ "shape": [
625
+ 384,
626
+ 128,
627
+ 64
628
+ ],
629
+ "n_bytes": 6291456,
630
+ "dtype": "torch.int16"
631
+ }
632
+ },
633
+ "quant_format": "exl3",
634
+ "bits_per_weight": 4,
635
+ "mcg_multiplier": 3417055213
636
+ },
637
+ "layers.2.input_layernorm": {
638
+ "stored_tensors": {
639
+ "layers.2.input_layernorm.weight": {
640
+ "shape": [
641
+ 2048
642
+ ],
643
+ "n_bytes": 4096,
644
+ "dtype": "torch.bfloat16"
645
+ }
646
+ }
647
+ },
648
+ "layers.2.self_attn.q_proj": {
649
+ "stored_tensors": {
650
+ "layers.2.self_attn.q_proj.suh": {
651
+ "shape": [
652
+ 2048
653
+ ],
654
+ "n_bytes": 4096,
655
+ "dtype": "torch.float16"
656
+ },
657
+ "layers.2.self_attn.q_proj.svh": {
658
+ "shape": [
659
+ 4096
660
+ ],
661
+ "n_bytes": 8192,
662
+ "dtype": "torch.float16"
663
+ },
664
+ "layers.2.self_attn.q_proj.mcg": {
665
+ "shape": [],
666
+ "n_bytes": 4,
667
+ "dtype": "torch.int32"
668
+ },
669
+ "layers.2.self_attn.q_proj.trellis": {
670
+ "shape": [
671
+ 128,
672
+ 256,
673
+ 64
674
+ ],
675
+ "n_bytes": 4194304,
676
+ "dtype": "torch.int16"
677
+ }
678
+ },
679
+ "quant_format": "exl3",
680
+ "bits_per_weight": 4,
681
+ "mcg_multiplier": 3417055213
682
+ },
683
+ "layers.2.self_attn.k_proj": {
684
+ "stored_tensors": {
685
+ "layers.2.self_attn.k_proj.suh": {
686
+ "shape": [
687
+ 2048
688
+ ],
689
+ "n_bytes": 4096,
690
+ "dtype": "torch.float16"
691
+ },
692
+ "layers.2.self_attn.k_proj.svh": {
693
+ "shape": [
694
+ 512
695
+ ],
696
+ "n_bytes": 1024,
697
+ "dtype": "torch.float16"
698
+ },
699
+ "layers.2.self_attn.k_proj.mcg": {
700
+ "shape": [],
701
+ "n_bytes": 4,
702
+ "dtype": "torch.int32"
703
+ },
704
+ "layers.2.self_attn.k_proj.trellis": {
705
+ "shape": [
706
+ 128,
707
+ 32,
708
+ 64
709
+ ],
710
+ "n_bytes": 524288,
711
+ "dtype": "torch.int16"
712
+ }
713
+ },
714
+ "quant_format": "exl3",
715
+ "bits_per_weight": 4,
716
+ "mcg_multiplier": 3417055213
717
+ },
718
+ "layers.2.self_attn.v_proj": {
719
+ "stored_tensors": {
720
+ "layers.2.self_attn.v_proj.suh": {
721
+ "shape": [
722
+ 2048
723
+ ],
724
+ "n_bytes": 4096,
725
+ "dtype": "torch.float16"
726
+ },
727
+ "layers.2.self_attn.v_proj.svh": {
728
+ "shape": [
729
+ 512
730
+ ],
731
+ "n_bytes": 1024,
732
+ "dtype": "torch.float16"
733
+ },
734
+ "layers.2.self_attn.v_proj.mcg": {
735
+ "shape": [],
736
+ "n_bytes": 4,
737
+ "dtype": "torch.int32"
738
+ },
739
+ "layers.2.self_attn.v_proj.trellis": {
740
+ "shape": [
741
+ 128,
742
+ 32,
743
+ 64
744
+ ],
745
+ "n_bytes": 524288,
746
+ "dtype": "torch.int16"
747
+ }
748
+ },
749
+ "quant_format": "exl3",
750
+ "bits_per_weight": 4,
751
+ "mcg_multiplier": 3417055213
752
+ },
753
+ "layers.2.self_attn.o_proj": {
754
+ "stored_tensors": {
755
+ "layers.2.self_attn.o_proj.suh": {
756
+ "shape": [
757
+ 4096
758
+ ],
759
+ "n_bytes": 8192,
760
+ "dtype": "torch.float16"
761
+ },
762
+ "layers.2.self_attn.o_proj.svh": {
763
+ "shape": [
764
+ 2048
765
+ ],
766
+ "n_bytes": 4096,
767
+ "dtype": "torch.float16"
768
+ },
769
+ "layers.2.self_attn.o_proj.mcg": {
770
+ "shape": [],
771
+ "n_bytes": 4,
772
+ "dtype": "torch.int32"
773
+ },
774
+ "layers.2.self_attn.o_proj.trellis": {
775
+ "shape": [
776
+ 256,
777
+ 128,
778
+ 64
779
+ ],
780
+ "n_bytes": 4194304,
781
+ "dtype": "torch.int16"
782
+ }
783
+ },
784
+ "quant_format": "exl3",
785
+ "bits_per_weight": 4,
786
+ "mcg_multiplier": 3417055213
787
+ },
788
+ "layers.2.self_attn.q_norm": {
789
+ "stored_tensors": {
790
+ "layers.2.self_attn.q_norm.weight": {
791
+ "shape": [
792
+ 128
793
+ ],
794
+ "n_bytes": 256,
795
+ "dtype": "torch.bfloat16"
796
+ }
797
+ }
798
+ },
799
+ "layers.2.self_attn.k_norm": {
800
+ "stored_tensors": {
801
+ "layers.2.self_attn.k_norm.weight": {
802
+ "shape": [
803
+ 128
804
+ ],
805
+ "n_bytes": 256,
806
+ "dtype": "torch.bfloat16"
807
+ }
808
+ }
809
+ },
810
+ "layers.2.post_attention_layernorm": {
811
+ "stored_tensors": {
812
+ "layers.2.post_attention_layernorm.weight": {
813
+ "shape": [
814
+ 2048
815
+ ],
816
+ "n_bytes": 4096,
817
+ "dtype": "torch.bfloat16"
818
+ }
819
+ }
820
+ },
821
+ "layers.2.mlp.up_proj": {
822
+ "stored_tensors": {
823
+ "layers.2.mlp.up_proj.suh": {
824
+ "shape": [
825
+ 2048
826
+ ],
827
+ "n_bytes": 4096,
828
+ "dtype": "torch.float16"
829
+ },
830
+ "layers.2.mlp.up_proj.svh": {
831
+ "shape": [
832
+ 6144
833
+ ],
834
+ "n_bytes": 12288,
835
+ "dtype": "torch.float16"
836
+ },
837
+ "layers.2.mlp.up_proj.mcg": {
838
+ "shape": [],
839
+ "n_bytes": 4,
840
+ "dtype": "torch.int32"
841
+ },
842
+ "layers.2.mlp.up_proj.trellis": {
843
+ "shape": [
844
+ 128,
845
+ 384,
846
+ 64
847
+ ],
848
+ "n_bytes": 6291456,
849
+ "dtype": "torch.int16"
850
+ }
851
+ },
852
+ "quant_format": "exl3",
853
+ "bits_per_weight": 4,
854
+ "mcg_multiplier": 3417055213
855
+ },
856
+ "layers.2.mlp.gate_proj": {
857
+ "stored_tensors": {
858
+ "layers.2.mlp.gate_proj.suh": {
859
+ "shape": [
860
+ 2048
861
+ ],
862
+ "n_bytes": 4096,
863
+ "dtype": "torch.float16"
864
+ },
865
+ "layers.2.mlp.gate_proj.svh": {
866
+ "shape": [
867
+ 6144
868
+ ],
869
+ "n_bytes": 12288,
870
+ "dtype": "torch.float16"
871
+ },
872
+ "layers.2.mlp.gate_proj.mcg": {
873
+ "shape": [],
874
+ "n_bytes": 4,
875
+ "dtype": "torch.int32"
876
+ },
877
+ "layers.2.mlp.gate_proj.trellis": {
878
+ "shape": [
879
+ 128,
880
+ 384,
881
+ 64
882
+ ],
883
+ "n_bytes": 6291456,
884
+ "dtype": "torch.int16"
885
+ }
886
+ },
887
+ "quant_format": "exl3",
888
+ "bits_per_weight": 4,
889
+ "mcg_multiplier": 3417055213
890
+ },
891
+ "layers.2.mlp.down_proj": {
892
+ "stored_tensors": {
893
+ "layers.2.mlp.down_proj.suh": {
894
+ "shape": [
895
+ 6144
896
+ ],
897
+ "n_bytes": 12288,
898
+ "dtype": "torch.float16"
899
+ },
900
+ "layers.2.mlp.down_proj.svh": {
901
+ "shape": [
902
+ 2048
903
+ ],
904
+ "n_bytes": 4096,
905
+ "dtype": "torch.float16"
906
+ },
907
+ "layers.2.mlp.down_proj.mcg": {
908
+ "shape": [],
909
+ "n_bytes": 4,
910
+ "dtype": "torch.int32"
911
+ },
912
+ "layers.2.mlp.down_proj.trellis": {
913
+ "shape": [
914
+ 384,
915
+ 128,
916
+ 64
917
+ ],
918
+ "n_bytes": 6291456,
919
+ "dtype": "torch.int16"
920
+ }
921
+ },
922
+ "quant_format": "exl3",
923
+ "bits_per_weight": 4,
924
+ "mcg_multiplier": 3417055213
925
+ },
926
+ "layers.3.input_layernorm": {
927
+ "stored_tensors": {
928
+ "layers.3.input_layernorm.weight": {
929
+ "shape": [
930
+ 2048
931
+ ],
932
+ "n_bytes": 4096,
933
+ "dtype": "torch.bfloat16"
934
+ }
935
+ }
936
+ },
937
+ "layers.3.self_attn.q_proj": {
938
+ "stored_tensors": {
939
+ "layers.3.self_attn.q_proj.suh": {
940
+ "shape": [
941
+ 2048
942
+ ],
943
+ "n_bytes": 4096,
944
+ "dtype": "torch.float16"
945
+ },
946
+ "layers.3.self_attn.q_proj.svh": {
947
+ "shape": [
948
+ 4096
949
+ ],
950
+ "n_bytes": 8192,
951
+ "dtype": "torch.float16"
952
+ },
953
+ "layers.3.self_attn.q_proj.mcg": {
954
+ "shape": [],
955
+ "n_bytes": 4,
956
+ "dtype": "torch.int32"
957
+ },
958
+ "layers.3.self_attn.q_proj.trellis": {
959
+ "shape": [
960
+ 128,
961
+ 256,
962
+ 64
963
+ ],
964
+ "n_bytes": 4194304,
965
+ "dtype": "torch.int16"
966
+ }
967
+ },
968
+ "quant_format": "exl3",
969
+ "bits_per_weight": 4,
970
+ "mcg_multiplier": 3417055213
971
+ },
972
+ "layers.3.self_attn.k_proj": {
973
+ "stored_tensors": {
974
+ "layers.3.self_attn.k_proj.suh": {
975
+ "shape": [
976
+ 2048
977
+ ],
978
+ "n_bytes": 4096,
979
+ "dtype": "torch.float16"
980
+ },
981
+ "layers.3.self_attn.k_proj.svh": {
982
+ "shape": [
983
+ 512
984
+ ],
985
+ "n_bytes": 1024,
986
+ "dtype": "torch.float16"
987
+ },
988
+ "layers.3.self_attn.k_proj.mcg": {
989
+ "shape": [],
990
+ "n_bytes": 4,
991
+ "dtype": "torch.int32"
992
+ },
993
+ "layers.3.self_attn.k_proj.trellis": {
994
+ "shape": [
995
+ 128,
996
+ 32,
997
+ 64
998
+ ],
999
+ "n_bytes": 524288,
1000
+ "dtype": "torch.int16"
1001
+ }
1002
+ },
1003
+ "quant_format": "exl3",
1004
+ "bits_per_weight": 4,
1005
+ "mcg_multiplier": 3417055213
1006
+ },
1007
+ "layers.3.self_attn.v_proj": {
1008
+ "stored_tensors": {
1009
+ "layers.3.self_attn.v_proj.suh": {
1010
+ "shape": [
1011
+ 2048
1012
+ ],
1013
+ "n_bytes": 4096,
1014
+ "dtype": "torch.float16"
1015
+ },
1016
+ "layers.3.self_attn.v_proj.svh": {
1017
+ "shape": [
1018
+ 512
1019
+ ],
1020
+ "n_bytes": 1024,
1021
+ "dtype": "torch.float16"
1022
+ },
1023
+ "layers.3.self_attn.v_proj.mcg": {
1024
+ "shape": [],
1025
+ "n_bytes": 4,
1026
+ "dtype": "torch.int32"
1027
+ },
1028
+ "layers.3.self_attn.v_proj.trellis": {
1029
+ "shape": [
1030
+ 128,
1031
+ 32,
1032
+ 64
1033
+ ],
1034
+ "n_bytes": 524288,
1035
+ "dtype": "torch.int16"
1036
+ }
1037
+ },
1038
+ "quant_format": "exl3",
1039
+ "bits_per_weight": 4,
1040
+ "mcg_multiplier": 3417055213
1041
+ },
1042
+ "layers.3.self_attn.o_proj": {
1043
+ "stored_tensors": {
1044
+ "layers.3.self_attn.o_proj.suh": {
1045
+ "shape": [
1046
+ 4096
1047
+ ],
1048
+ "n_bytes": 8192,
1049
+ "dtype": "torch.float16"
1050
+ },
1051
+ "layers.3.self_attn.o_proj.svh": {
1052
+ "shape": [
1053
+ 2048
1054
+ ],
1055
+ "n_bytes": 4096,
1056
+ "dtype": "torch.float16"
1057
+ },
1058
+ "layers.3.self_attn.o_proj.mcg": {
1059
+ "shape": [],
1060
+ "n_bytes": 4,
1061
+ "dtype": "torch.int32"
1062
+ },
1063
+ "layers.3.self_attn.o_proj.trellis": {
1064
+ "shape": [
1065
+ 256,
1066
+ 128,
1067
+ 64
1068
+ ],
1069
+ "n_bytes": 4194304,
1070
+ "dtype": "torch.int16"
1071
+ }
1072
+ },
1073
+ "quant_format": "exl3",
1074
+ "bits_per_weight": 4,
1075
+ "mcg_multiplier": 3417055213
1076
+ },
1077
+ "layers.3.self_attn.q_norm": {
1078
+ "stored_tensors": {
1079
+ "layers.3.self_attn.q_norm.weight": {
1080
+ "shape": [
1081
+ 128
1082
+ ],
1083
+ "n_bytes": 256,
1084
+ "dtype": "torch.bfloat16"
1085
+ }
1086
+ }
1087
+ },
1088
+ "layers.3.self_attn.k_norm": {
1089
+ "stored_tensors": {
1090
+ "layers.3.self_attn.k_norm.weight": {
1091
+ "shape": [
1092
+ 128
1093
+ ],
1094
+ "n_bytes": 256,
1095
+ "dtype": "torch.bfloat16"
1096
+ }
1097
+ }
1098
+ },
1099
+ "layers.3.post_attention_layernorm": {
1100
+ "stored_tensors": {
1101
+ "layers.3.post_attention_layernorm.weight": {
1102
+ "shape": [
1103
+ 2048
1104
+ ],
1105
+ "n_bytes": 4096,
1106
+ "dtype": "torch.bfloat16"
1107
+ }
1108
+ }
1109
+ },
1110
+ "layers.3.mlp.up_proj": {
1111
+ "stored_tensors": {
1112
+ "layers.3.mlp.up_proj.suh": {
1113
+ "shape": [
1114
+ 2048
1115
+ ],
1116
+ "n_bytes": 4096,
1117
+ "dtype": "torch.float16"
1118
+ },
1119
+ "layers.3.mlp.up_proj.svh": {
1120
+ "shape": [
1121
+ 6144
1122
+ ],
1123
+ "n_bytes": 12288,
1124
+ "dtype": "torch.float16"
1125
+ },
1126
+ "layers.3.mlp.up_proj.mcg": {
1127
+ "shape": [],
1128
+ "n_bytes": 4,
1129
+ "dtype": "torch.int32"
1130
+ },
1131
+ "layers.3.mlp.up_proj.trellis": {
1132
+ "shape": [
1133
+ 128,
1134
+ 384,
1135
+ 64
1136
+ ],
1137
+ "n_bytes": 6291456,
1138
+ "dtype": "torch.int16"
1139
+ }
1140
+ },
1141
+ "quant_format": "exl3",
1142
+ "bits_per_weight": 4,
1143
+ "mcg_multiplier": 3417055213
1144
+ },
1145
+ "layers.3.mlp.gate_proj": {
1146
+ "stored_tensors": {
1147
+ "layers.3.mlp.gate_proj.suh": {
1148
+ "shape": [
1149
+ 2048
1150
+ ],
1151
+ "n_bytes": 4096,
1152
+ "dtype": "torch.float16"
1153
+ },
1154
+ "layers.3.mlp.gate_proj.svh": {
1155
+ "shape": [
1156
+ 6144
1157
+ ],
1158
+ "n_bytes": 12288,
1159
+ "dtype": "torch.float16"
1160
+ },
1161
+ "layers.3.mlp.gate_proj.mcg": {
1162
+ "shape": [],
1163
+ "n_bytes": 4,
1164
+ "dtype": "torch.int32"
1165
+ },
1166
+ "layers.3.mlp.gate_proj.trellis": {
1167
+ "shape": [
1168
+ 128,
1169
+ 384,
1170
+ 64
1171
+ ],
1172
+ "n_bytes": 6291456,
1173
+ "dtype": "torch.int16"
1174
+ }
1175
+ },
1176
+ "quant_format": "exl3",
1177
+ "bits_per_weight": 4,
1178
+ "mcg_multiplier": 3417055213
1179
+ },
1180
+ "layers.3.mlp.down_proj": {
1181
+ "stored_tensors": {
1182
+ "layers.3.mlp.down_proj.suh": {
1183
+ "shape": [
1184
+ 6144
1185
+ ],
1186
+ "n_bytes": 12288,
1187
+ "dtype": "torch.float16"
1188
+ },
1189
+ "layers.3.mlp.down_proj.svh": {
1190
+ "shape": [
1191
+ 2048
1192
+ ],
1193
+ "n_bytes": 4096,
1194
+ "dtype": "torch.float16"
1195
+ },
1196
+ "layers.3.mlp.down_proj.mcg": {
1197
+ "shape": [],
1198
+ "n_bytes": 4,
1199
+ "dtype": "torch.int32"
1200
+ },
1201
+ "layers.3.mlp.down_proj.trellis": {
1202
+ "shape": [
1203
+ 384,
1204
+ 128,
1205
+ 64
1206
+ ],
1207
+ "n_bytes": 6291456,
1208
+ "dtype": "torch.int16"
1209
+ }
1210
+ },
1211
+ "quant_format": "exl3",
1212
+ "bits_per_weight": 4,
1213
+ "mcg_multiplier": 3417055213
1214
+ },
1215
+ "layers.4.input_layernorm": {
1216
+ "stored_tensors": {
1217
+ "layers.4.input_layernorm.weight": {
1218
+ "shape": [
1219
+ 2048
1220
+ ],
1221
+ "n_bytes": 4096,
1222
+ "dtype": "torch.bfloat16"
1223
+ }
1224
+ }
1225
+ },
1226
+ "layers.4.self_attn.q_proj": {
1227
+ "stored_tensors": {
1228
+ "layers.4.self_attn.q_proj.suh": {
1229
+ "shape": [
1230
+ 2048
1231
+ ],
1232
+ "n_bytes": 4096,
1233
+ "dtype": "torch.float16"
1234
+ },
1235
+ "layers.4.self_attn.q_proj.svh": {
1236
+ "shape": [
1237
+ 4096
1238
+ ],
1239
+ "n_bytes": 8192,
1240
+ "dtype": "torch.float16"
1241
+ },
1242
+ "layers.4.self_attn.q_proj.mcg": {
1243
+ "shape": [],
1244
+ "n_bytes": 4,
1245
+ "dtype": "torch.int32"
1246
+ },
1247
+ "layers.4.self_attn.q_proj.trellis": {
1248
+ "shape": [
1249
+ 128,
1250
+ 256,
1251
+ 64
1252
+ ],
1253
+ "n_bytes": 4194304,
1254
+ "dtype": "torch.int16"
1255
+ }
1256
+ },
1257
+ "quant_format": "exl3",
1258
+ "bits_per_weight": 4,
1259
+ "mcg_multiplier": 3417055213
1260
+ },
1261
+ "layers.4.self_attn.k_proj": {
1262
+ "stored_tensors": {
1263
+ "layers.4.self_attn.k_proj.suh": {
1264
+ "shape": [
1265
+ 2048
1266
+ ],
1267
+ "n_bytes": 4096,
1268
+ "dtype": "torch.float16"
1269
+ },
1270
+ "layers.4.self_attn.k_proj.svh": {
1271
+ "shape": [
1272
+ 512
1273
+ ],
1274
+ "n_bytes": 1024,
1275
+ "dtype": "torch.float16"
1276
+ },
1277
+ "layers.4.self_attn.k_proj.mcg": {
1278
+ "shape": [],
1279
+ "n_bytes": 4,
1280
+ "dtype": "torch.int32"
1281
+ },
1282
+ "layers.4.self_attn.k_proj.trellis": {
1283
+ "shape": [
1284
+ 128,
1285
+ 32,
1286
+ 64
1287
+ ],
1288
+ "n_bytes": 524288,
1289
+ "dtype": "torch.int16"
1290
+ }
1291
+ },
1292
+ "quant_format": "exl3",
1293
+ "bits_per_weight": 4,
1294
+ "mcg_multiplier": 3417055213
1295
+ },
1296
+ "layers.4.self_attn.v_proj": {
1297
+ "stored_tensors": {
1298
+ "layers.4.self_attn.v_proj.suh": {
1299
+ "shape": [
1300
+ 2048
1301
+ ],
1302
+ "n_bytes": 4096,
1303
+ "dtype": "torch.float16"
1304
+ },
1305
+ "layers.4.self_attn.v_proj.svh": {
1306
+ "shape": [
1307
+ 512
1308
+ ],
1309
+ "n_bytes": 1024,
1310
+ "dtype": "torch.float16"
1311
+ },
1312
+ "layers.4.self_attn.v_proj.mcg": {
1313
+ "shape": [],
1314
+ "n_bytes": 4,
1315
+ "dtype": "torch.int32"
1316
+ },
1317
+ "layers.4.self_attn.v_proj.trellis": {
1318
+ "shape": [
1319
+ 128,
1320
+ 32,
1321
+ 64
1322
+ ],
1323
+ "n_bytes": 524288,
1324
+ "dtype": "torch.int16"
1325
+ }
1326
+ },
1327
+ "quant_format": "exl3",
1328
+ "bits_per_weight": 4,
1329
+ "mcg_multiplier": 3417055213
1330
+ },
1331
+ "layers.4.self_attn.o_proj": {
1332
+ "stored_tensors": {
1333
+ "layers.4.self_attn.o_proj.suh": {
1334
+ "shape": [
1335
+ 4096
1336
+ ],
1337
+ "n_bytes": 8192,
1338
+ "dtype": "torch.float16"
1339
+ },
1340
+ "layers.4.self_attn.o_proj.svh": {
1341
+ "shape": [
1342
+ 2048
1343
+ ],
1344
+ "n_bytes": 4096,
1345
+ "dtype": "torch.float16"
1346
+ },
1347
+ "layers.4.self_attn.o_proj.mcg": {
1348
+ "shape": [],
1349
+ "n_bytes": 4,
1350
+ "dtype": "torch.int32"
1351
+ },
1352
+ "layers.4.self_attn.o_proj.trellis": {
1353
+ "shape": [
1354
+ 256,
1355
+ 128,
1356
+ 64
1357
+ ],
1358
+ "n_bytes": 4194304,
1359
+ "dtype": "torch.int16"
1360
+ }
1361
+ },
1362
+ "quant_format": "exl3",
1363
+ "bits_per_weight": 4,
1364
+ "mcg_multiplier": 3417055213
1365
+ },
1366
+ "layers.4.self_attn.q_norm": {
1367
+ "stored_tensors": {
1368
+ "layers.4.self_attn.q_norm.weight": {
1369
+ "shape": [
1370
+ 128
1371
+ ],
1372
+ "n_bytes": 256,
1373
+ "dtype": "torch.bfloat16"
1374
+ }
1375
+ }
1376
+ },
1377
+ "layers.4.self_attn.k_norm": {
1378
+ "stored_tensors": {
1379
+ "layers.4.self_attn.k_norm.weight": {
1380
+ "shape": [
1381
+ 128
1382
+ ],
1383
+ "n_bytes": 256,
1384
+ "dtype": "torch.bfloat16"
1385
+ }
1386
+ }
1387
+ },
1388
+ "layers.4.post_attention_layernorm": {
1389
+ "stored_tensors": {
1390
+ "layers.4.post_attention_layernorm.weight": {
1391
+ "shape": [
1392
+ 2048
1393
+ ],
1394
+ "n_bytes": 4096,
1395
+ "dtype": "torch.bfloat16"
1396
+ }
1397
+ }
1398
+ },
1399
+ "layers.4.mlp.up_proj": {
1400
+ "stored_tensors": {
1401
+ "layers.4.mlp.up_proj.suh": {
1402
+ "shape": [
1403
+ 2048
1404
+ ],
1405
+ "n_bytes": 4096,
1406
+ "dtype": "torch.float16"
1407
+ },
1408
+ "layers.4.mlp.up_proj.svh": {
1409
+ "shape": [
1410
+ 6144
1411
+ ],
1412
+ "n_bytes": 12288,
1413
+ "dtype": "torch.float16"
1414
+ },
1415
+ "layers.4.mlp.up_proj.mcg": {
1416
+ "shape": [],
1417
+ "n_bytes": 4,
1418
+ "dtype": "torch.int32"
1419
+ },
1420
+ "layers.4.mlp.up_proj.trellis": {
1421
+ "shape": [
1422
+ 128,
1423
+ 384,
1424
+ 64
1425
+ ],
1426
+ "n_bytes": 6291456,
1427
+ "dtype": "torch.int16"
1428
+ }
1429
+ },
1430
+ "quant_format": "exl3",
1431
+ "bits_per_weight": 4,
1432
+ "mcg_multiplier": 3417055213
1433
+ },
1434
+ "layers.4.mlp.gate_proj": {
1435
+ "stored_tensors": {
1436
+ "layers.4.mlp.gate_proj.suh": {
1437
+ "shape": [
1438
+ 2048
1439
+ ],
1440
+ "n_bytes": 4096,
1441
+ "dtype": "torch.float16"
1442
+ },
1443
+ "layers.4.mlp.gate_proj.svh": {
1444
+ "shape": [
1445
+ 6144
1446
+ ],
1447
+ "n_bytes": 12288,
1448
+ "dtype": "torch.float16"
1449
+ },
1450
+ "layers.4.mlp.gate_proj.mcg": {
1451
+ "shape": [],
1452
+ "n_bytes": 4,
1453
+ "dtype": "torch.int32"
1454
+ },
1455
+ "layers.4.mlp.gate_proj.trellis": {
1456
+ "shape": [
1457
+ 128,
1458
+ 384,
1459
+ 64
1460
+ ],
1461
+ "n_bytes": 6291456,
1462
+ "dtype": "torch.int16"
1463
+ }
1464
+ },
1465
+ "quant_format": "exl3",
1466
+ "bits_per_weight": 4,
1467
+ "mcg_multiplier": 3417055213
1468
+ },
1469
+ "layers.4.mlp.down_proj": {
1470
+ "stored_tensors": {
1471
+ "layers.4.mlp.down_proj.suh": {
1472
+ "shape": [
1473
+ 6144
1474
+ ],
1475
+ "n_bytes": 12288,
1476
+ "dtype": "torch.float16"
1477
+ },
1478
+ "layers.4.mlp.down_proj.svh": {
1479
+ "shape": [
1480
+ 2048
1481
+ ],
1482
+ "n_bytes": 4096,
1483
+ "dtype": "torch.float16"
1484
+ },
1485
+ "layers.4.mlp.down_proj.mcg": {
1486
+ "shape": [],
1487
+ "n_bytes": 4,
1488
+ "dtype": "torch.int32"
1489
+ },
1490
+ "layers.4.mlp.down_proj.trellis": {
1491
+ "shape": [
1492
+ 384,
1493
+ 128,
1494
+ 64
1495
+ ],
1496
+ "n_bytes": 6291456,
1497
+ "dtype": "torch.int16"
1498
+ }
1499
+ },
1500
+ "quant_format": "exl3",
1501
+ "bits_per_weight": 4,
1502
+ "mcg_multiplier": 3417055213
1503
+ },
1504
+ "layers.5.input_layernorm": {
1505
+ "stored_tensors": {
1506
+ "layers.5.input_layernorm.weight": {
1507
+ "shape": [
1508
+ 2048
1509
+ ],
1510
+ "n_bytes": 4096,
1511
+ "dtype": "torch.bfloat16"
1512
+ }
1513
+ }
1514
+ },
1515
+ "layers.5.self_attn.q_proj": {
1516
+ "stored_tensors": {
1517
+ "layers.5.self_attn.q_proj.suh": {
1518
+ "shape": [
1519
+ 2048
1520
+ ],
1521
+ "n_bytes": 4096,
1522
+ "dtype": "torch.float16"
1523
+ },
1524
+ "layers.5.self_attn.q_proj.svh": {
1525
+ "shape": [
1526
+ 4096
1527
+ ],
1528
+ "n_bytes": 8192,
1529
+ "dtype": "torch.float16"
1530
+ },
1531
+ "layers.5.self_attn.q_proj.mcg": {
1532
+ "shape": [],
1533
+ "n_bytes": 4,
1534
+ "dtype": "torch.int32"
1535
+ },
1536
+ "layers.5.self_attn.q_proj.trellis": {
1537
+ "shape": [
1538
+ 128,
1539
+ 256,
1540
+ 64
1541
+ ],
1542
+ "n_bytes": 4194304,
1543
+ "dtype": "torch.int16"
1544
+ }
1545
+ },
1546
+ "quant_format": "exl3",
1547
+ "bits_per_weight": 4,
1548
+ "mcg_multiplier": 3417055213
1549
+ },
1550
+ "layers.5.self_attn.k_proj": {
1551
+ "stored_tensors": {
1552
+ "layers.5.self_attn.k_proj.suh": {
1553
+ "shape": [
1554
+ 2048
1555
+ ],
1556
+ "n_bytes": 4096,
1557
+ "dtype": "torch.float16"
1558
+ },
1559
+ "layers.5.self_attn.k_proj.svh": {
1560
+ "shape": [
1561
+ 512
1562
+ ],
1563
+ "n_bytes": 1024,
1564
+ "dtype": "torch.float16"
1565
+ },
1566
+ "layers.5.self_attn.k_proj.mcg": {
1567
+ "shape": [],
1568
+ "n_bytes": 4,
1569
+ "dtype": "torch.int32"
1570
+ },
1571
+ "layers.5.self_attn.k_proj.trellis": {
1572
+ "shape": [
1573
+ 128,
1574
+ 32,
1575
+ 64
1576
+ ],
1577
+ "n_bytes": 524288,
1578
+ "dtype": "torch.int16"
1579
+ }
1580
+ },
1581
+ "quant_format": "exl3",
1582
+ "bits_per_weight": 4,
1583
+ "mcg_multiplier": 3417055213
1584
+ },
1585
+ "layers.5.self_attn.v_proj": {
1586
+ "stored_tensors": {
1587
+ "layers.5.self_attn.v_proj.suh": {
1588
+ "shape": [
1589
+ 2048
1590
+ ],
1591
+ "n_bytes": 4096,
1592
+ "dtype": "torch.float16"
1593
+ },
1594
+ "layers.5.self_attn.v_proj.svh": {
1595
+ "shape": [
1596
+ 512
1597
+ ],
1598
+ "n_bytes": 1024,
1599
+ "dtype": "torch.float16"
1600
+ },
1601
+ "layers.5.self_attn.v_proj.mcg": {
1602
+ "shape": [],
1603
+ "n_bytes": 4,
1604
+ "dtype": "torch.int32"
1605
+ },
1606
+ "layers.5.self_attn.v_proj.trellis": {
1607
+ "shape": [
1608
+ 128,
1609
+ 32,
1610
+ 64
1611
+ ],
1612
+ "n_bytes": 524288,
1613
+ "dtype": "torch.int16"
1614
+ }
1615
+ },
1616
+ "quant_format": "exl3",
1617
+ "bits_per_weight": 4,
1618
+ "mcg_multiplier": 3417055213
1619
+ },
1620
+ "layers.5.self_attn.o_proj": {
1621
+ "stored_tensors": {
1622
+ "layers.5.self_attn.o_proj.suh": {
1623
+ "shape": [
1624
+ 4096
1625
+ ],
1626
+ "n_bytes": 8192,
1627
+ "dtype": "torch.float16"
1628
+ },
1629
+ "layers.5.self_attn.o_proj.svh": {
1630
+ "shape": [
1631
+ 2048
1632
+ ],
1633
+ "n_bytes": 4096,
1634
+ "dtype": "torch.float16"
1635
+ },
1636
+ "layers.5.self_attn.o_proj.mcg": {
1637
+ "shape": [],
1638
+ "n_bytes": 4,
1639
+ "dtype": "torch.int32"
1640
+ },
1641
+ "layers.5.self_attn.o_proj.trellis": {
1642
+ "shape": [
1643
+ 256,
1644
+ 128,
1645
+ 64
1646
+ ],
1647
+ "n_bytes": 4194304,
1648
+ "dtype": "torch.int16"
1649
+ }
1650
+ },
1651
+ "quant_format": "exl3",
1652
+ "bits_per_weight": 4,
1653
+ "mcg_multiplier": 3417055213
1654
+ },
1655
+ "layers.5.self_attn.q_norm": {
1656
+ "stored_tensors": {
1657
+ "layers.5.self_attn.q_norm.weight": {
1658
+ "shape": [
1659
+ 128
1660
+ ],
1661
+ "n_bytes": 256,
1662
+ "dtype": "torch.bfloat16"
1663
+ }
1664
+ }
1665
+ },
1666
+ "layers.5.self_attn.k_norm": {
1667
+ "stored_tensors": {
1668
+ "layers.5.self_attn.k_norm.weight": {
1669
+ "shape": [
1670
+ 128
1671
+ ],
1672
+ "n_bytes": 256,
1673
+ "dtype": "torch.bfloat16"
1674
+ }
1675
+ }
1676
+ },
1677
+ "layers.5.post_attention_layernorm": {
1678
+ "stored_tensors": {
1679
+ "layers.5.post_attention_layernorm.weight": {
1680
+ "shape": [
1681
+ 2048
1682
+ ],
1683
+ "n_bytes": 4096,
1684
+ "dtype": "torch.bfloat16"
1685
+ }
1686
+ }
1687
+ },
1688
+ "layers.5.mlp.up_proj": {
1689
+ "stored_tensors": {
1690
+ "layers.5.mlp.up_proj.suh": {
1691
+ "shape": [
1692
+ 2048
1693
+ ],
1694
+ "n_bytes": 4096,
1695
+ "dtype": "torch.float16"
1696
+ },
1697
+ "layers.5.mlp.up_proj.svh": {
1698
+ "shape": [
1699
+ 6144
1700
+ ],
1701
+ "n_bytes": 12288,
1702
+ "dtype": "torch.float16"
1703
+ },
1704
+ "layers.5.mlp.up_proj.mcg": {
1705
+ "shape": [],
1706
+ "n_bytes": 4,
1707
+ "dtype": "torch.int32"
1708
+ },
1709
+ "layers.5.mlp.up_proj.trellis": {
1710
+ "shape": [
1711
+ 128,
1712
+ 384,
1713
+ 64
1714
+ ],
1715
+ "n_bytes": 6291456,
1716
+ "dtype": "torch.int16"
1717
+ }
1718
+ },
1719
+ "quant_format": "exl3",
1720
+ "bits_per_weight": 4,
1721
+ "mcg_multiplier": 3417055213
1722
+ },
1723
+ "layers.5.mlp.gate_proj": {
1724
+ "stored_tensors": {
1725
+ "layers.5.mlp.gate_proj.suh": {
1726
+ "shape": [
1727
+ 2048
1728
+ ],
1729
+ "n_bytes": 4096,
1730
+ "dtype": "torch.float16"
1731
+ },
1732
+ "layers.5.mlp.gate_proj.svh": {
1733
+ "shape": [
1734
+ 6144
1735
+ ],
1736
+ "n_bytes": 12288,
1737
+ "dtype": "torch.float16"
1738
+ },
1739
+ "layers.5.mlp.gate_proj.mcg": {
1740
+ "shape": [],
1741
+ "n_bytes": 4,
1742
+ "dtype": "torch.int32"
1743
+ },
1744
+ "layers.5.mlp.gate_proj.trellis": {
1745
+ "shape": [
1746
+ 128,
1747
+ 384,
1748
+ 64
1749
+ ],
1750
+ "n_bytes": 6291456,
1751
+ "dtype": "torch.int16"
1752
+ }
1753
+ },
1754
+ "quant_format": "exl3",
1755
+ "bits_per_weight": 4,
1756
+ "mcg_multiplier": 3417055213
1757
+ },
1758
+ "layers.5.mlp.down_proj": {
1759
+ "stored_tensors": {
1760
+ "layers.5.mlp.down_proj.suh": {
1761
+ "shape": [
1762
+ 6144
1763
+ ],
1764
+ "n_bytes": 12288,
1765
+ "dtype": "torch.float16"
1766
+ },
1767
+ "layers.5.mlp.down_proj.svh": {
1768
+ "shape": [
1769
+ 2048
1770
+ ],
1771
+ "n_bytes": 4096,
1772
+ "dtype": "torch.float16"
1773
+ },
1774
+ "layers.5.mlp.down_proj.mcg": {
1775
+ "shape": [],
1776
+ "n_bytes": 4,
1777
+ "dtype": "torch.int32"
1778
+ },
1779
+ "layers.5.mlp.down_proj.trellis": {
1780
+ "shape": [
1781
+ 384,
1782
+ 128,
1783
+ 64
1784
+ ],
1785
+ "n_bytes": 6291456,
1786
+ "dtype": "torch.int16"
1787
+ }
1788
+ },
1789
+ "quant_format": "exl3",
1790
+ "bits_per_weight": 4,
1791
+ "mcg_multiplier": 3417055213
1792
+ },
1793
+ "layers.6.input_layernorm": {
1794
+ "stored_tensors": {
1795
+ "layers.6.input_layernorm.weight": {
1796
+ "shape": [
1797
+ 2048
1798
+ ],
1799
+ "n_bytes": 4096,
1800
+ "dtype": "torch.bfloat16"
1801
+ }
1802
+ }
1803
+ },
1804
+ "layers.6.self_attn.q_proj": {
1805
+ "stored_tensors": {
1806
+ "layers.6.self_attn.q_proj.suh": {
1807
+ "shape": [
1808
+ 2048
1809
+ ],
1810
+ "n_bytes": 4096,
1811
+ "dtype": "torch.float16"
1812
+ },
1813
+ "layers.6.self_attn.q_proj.svh": {
1814
+ "shape": [
1815
+ 4096
1816
+ ],
1817
+ "n_bytes": 8192,
1818
+ "dtype": "torch.float16"
1819
+ },
1820
+ "layers.6.self_attn.q_proj.mcg": {
1821
+ "shape": [],
1822
+ "n_bytes": 4,
1823
+ "dtype": "torch.int32"
1824
+ },
1825
+ "layers.6.self_attn.q_proj.trellis": {
1826
+ "shape": [
1827
+ 128,
1828
+ 256,
1829
+ 64
1830
+ ],
1831
+ "n_bytes": 4194304,
1832
+ "dtype": "torch.int16"
1833
+ }
1834
+ },
1835
+ "quant_format": "exl3",
1836
+ "bits_per_weight": 4,
1837
+ "mcg_multiplier": 3417055213
1838
+ },
1839
+ "layers.6.self_attn.k_proj": {
1840
+ "stored_tensors": {
1841
+ "layers.6.self_attn.k_proj.suh": {
1842
+ "shape": [
1843
+ 2048
1844
+ ],
1845
+ "n_bytes": 4096,
1846
+ "dtype": "torch.float16"
1847
+ },
1848
+ "layers.6.self_attn.k_proj.svh": {
1849
+ "shape": [
1850
+ 512
1851
+ ],
1852
+ "n_bytes": 1024,
1853
+ "dtype": "torch.float16"
1854
+ },
1855
+ "layers.6.self_attn.k_proj.mcg": {
1856
+ "shape": [],
1857
+ "n_bytes": 4,
1858
+ "dtype": "torch.int32"
1859
+ },
1860
+ "layers.6.self_attn.k_proj.trellis": {
1861
+ "shape": [
1862
+ 128,
1863
+ 32,
1864
+ 64
1865
+ ],
1866
+ "n_bytes": 524288,
1867
+ "dtype": "torch.int16"
1868
+ }
1869
+ },
1870
+ "quant_format": "exl3",
1871
+ "bits_per_weight": 4,
1872
+ "mcg_multiplier": 3417055213
1873
+ },
1874
+ "layers.6.self_attn.v_proj": {
1875
+ "stored_tensors": {
1876
+ "layers.6.self_attn.v_proj.suh": {
1877
+ "shape": [
1878
+ 2048
1879
+ ],
1880
+ "n_bytes": 4096,
1881
+ "dtype": "torch.float16"
1882
+ },
1883
+ "layers.6.self_attn.v_proj.svh": {
1884
+ "shape": [
1885
+ 512
1886
+ ],
1887
+ "n_bytes": 1024,
1888
+ "dtype": "torch.float16"
1889
+ },
1890
+ "layers.6.self_attn.v_proj.mcg": {
1891
+ "shape": [],
1892
+ "n_bytes": 4,
1893
+ "dtype": "torch.int32"
1894
+ },
1895
+ "layers.6.self_attn.v_proj.trellis": {
1896
+ "shape": [
1897
+ 128,
1898
+ 32,
1899
+ 64
1900
+ ],
1901
+ "n_bytes": 524288,
1902
+ "dtype": "torch.int16"
1903
+ }
1904
+ },
1905
+ "quant_format": "exl3",
1906
+ "bits_per_weight": 4,
1907
+ "mcg_multiplier": 3417055213
1908
+ },
1909
+ "layers.6.self_attn.o_proj": {
1910
+ "stored_tensors": {
1911
+ "layers.6.self_attn.o_proj.suh": {
1912
+ "shape": [
1913
+ 4096
1914
+ ],
1915
+ "n_bytes": 8192,
1916
+ "dtype": "torch.float16"
1917
+ },
1918
+ "layers.6.self_attn.o_proj.svh": {
1919
+ "shape": [
1920
+ 2048
1921
+ ],
1922
+ "n_bytes": 4096,
1923
+ "dtype": "torch.float16"
1924
+ },
1925
+ "layers.6.self_attn.o_proj.mcg": {
1926
+ "shape": [],
1927
+ "n_bytes": 4,
1928
+ "dtype": "torch.int32"
1929
+ },
1930
+ "layers.6.self_attn.o_proj.trellis": {
1931
+ "shape": [
1932
+ 256,
1933
+ 128,
1934
+ 64
1935
+ ],
1936
+ "n_bytes": 4194304,
1937
+ "dtype": "torch.int16"
1938
+ }
1939
+ },
1940
+ "quant_format": "exl3",
1941
+ "bits_per_weight": 4,
1942
+ "mcg_multiplier": 3417055213
1943
+ },
1944
+ "layers.6.self_attn.q_norm": {
1945
+ "stored_tensors": {
1946
+ "layers.6.self_attn.q_norm.weight": {
1947
+ "shape": [
1948
+ 128
1949
+ ],
1950
+ "n_bytes": 256,
1951
+ "dtype": "torch.bfloat16"
1952
+ }
1953
+ }
1954
+ },
1955
+ "layers.6.self_attn.k_norm": {
1956
+ "stored_tensors": {
1957
+ "layers.6.self_attn.k_norm.weight": {
1958
+ "shape": [
1959
+ 128
1960
+ ],
1961
+ "n_bytes": 256,
1962
+ "dtype": "torch.bfloat16"
1963
+ }
1964
+ }
1965
+ },
1966
+ "layers.6.post_attention_layernorm": {
1967
+ "stored_tensors": {
1968
+ "layers.6.post_attention_layernorm.weight": {
1969
+ "shape": [
1970
+ 2048
1971
+ ],
1972
+ "n_bytes": 4096,
1973
+ "dtype": "torch.bfloat16"
1974
+ }
1975
+ }
1976
+ },
1977
+ "layers.6.mlp.up_proj": {
1978
+ "stored_tensors": {
1979
+ "layers.6.mlp.up_proj.suh": {
1980
+ "shape": [
1981
+ 2048
1982
+ ],
1983
+ "n_bytes": 4096,
1984
+ "dtype": "torch.float16"
1985
+ },
1986
+ "layers.6.mlp.up_proj.svh": {
1987
+ "shape": [
1988
+ 6144
1989
+ ],
1990
+ "n_bytes": 12288,
1991
+ "dtype": "torch.float16"
1992
+ },
1993
+ "layers.6.mlp.up_proj.mcg": {
1994
+ "shape": [],
1995
+ "n_bytes": 4,
1996
+ "dtype": "torch.int32"
1997
+ },
1998
+ "layers.6.mlp.up_proj.trellis": {
1999
+ "shape": [
2000
+ 128,
2001
+ 384,
2002
+ 64
2003
+ ],
2004
+ "n_bytes": 6291456,
2005
+ "dtype": "torch.int16"
2006
+ }
2007
+ },
2008
+ "quant_format": "exl3",
2009
+ "bits_per_weight": 4,
2010
+ "mcg_multiplier": 3417055213
2011
+ },
2012
+ "layers.6.mlp.gate_proj": {
2013
+ "stored_tensors": {
2014
+ "layers.6.mlp.gate_proj.suh": {
2015
+ "shape": [
2016
+ 2048
2017
+ ],
2018
+ "n_bytes": 4096,
2019
+ "dtype": "torch.float16"
2020
+ },
2021
+ "layers.6.mlp.gate_proj.svh": {
2022
+ "shape": [
2023
+ 6144
2024
+ ],
2025
+ "n_bytes": 12288,
2026
+ "dtype": "torch.float16"
2027
+ },
2028
+ "layers.6.mlp.gate_proj.mcg": {
2029
+ "shape": [],
2030
+ "n_bytes": 4,
2031
+ "dtype": "torch.int32"
2032
+ },
2033
+ "layers.6.mlp.gate_proj.trellis": {
2034
+ "shape": [
2035
+ 128,
2036
+ 384,
2037
+ 64
2038
+ ],
2039
+ "n_bytes": 6291456,
2040
+ "dtype": "torch.int16"
2041
+ }
2042
+ },
2043
+ "quant_format": "exl3",
2044
+ "bits_per_weight": 4,
2045
+ "mcg_multiplier": 3417055213
2046
+ },
2047
+ "layers.6.mlp.down_proj": {
2048
+ "stored_tensors": {
2049
+ "layers.6.mlp.down_proj.suh": {
2050
+ "shape": [
2051
+ 6144
2052
+ ],
2053
+ "n_bytes": 12288,
2054
+ "dtype": "torch.float16"
2055
+ },
2056
+ "layers.6.mlp.down_proj.svh": {
2057
+ "shape": [
2058
+ 2048
2059
+ ],
2060
+ "n_bytes": 4096,
2061
+ "dtype": "torch.float16"
2062
+ },
2063
+ "layers.6.mlp.down_proj.mcg": {
2064
+ "shape": [],
2065
+ "n_bytes": 4,
2066
+ "dtype": "torch.int32"
2067
+ },
2068
+ "layers.6.mlp.down_proj.trellis": {
2069
+ "shape": [
2070
+ 384,
2071
+ 128,
2072
+ 64
2073
+ ],
2074
+ "n_bytes": 6291456,
2075
+ "dtype": "torch.int16"
2076
+ }
2077
+ },
2078
+ "quant_format": "exl3",
2079
+ "bits_per_weight": 4,
2080
+ "mcg_multiplier": 3417055213
2081
+ },
2082
+ "layers.7.input_layernorm": {
2083
+ "stored_tensors": {
2084
+ "layers.7.input_layernorm.weight": {
2085
+ "shape": [
2086
+ 2048
2087
+ ],
2088
+ "n_bytes": 4096,
2089
+ "dtype": "torch.bfloat16"
2090
+ }
2091
+ }
2092
+ },
2093
+ "layers.7.self_attn.q_proj": {
2094
+ "stored_tensors": {
2095
+ "layers.7.self_attn.q_proj.suh": {
2096
+ "shape": [
2097
+ 2048
2098
+ ],
2099
+ "n_bytes": 4096,
2100
+ "dtype": "torch.float16"
2101
+ },
2102
+ "layers.7.self_attn.q_proj.svh": {
2103
+ "shape": [
2104
+ 4096
2105
+ ],
2106
+ "n_bytes": 8192,
2107
+ "dtype": "torch.float16"
2108
+ },
2109
+ "layers.7.self_attn.q_proj.mcg": {
2110
+ "shape": [],
2111
+ "n_bytes": 4,
2112
+ "dtype": "torch.int32"
2113
+ },
2114
+ "layers.7.self_attn.q_proj.trellis": {
2115
+ "shape": [
2116
+ 128,
2117
+ 256,
2118
+ 64
2119
+ ],
2120
+ "n_bytes": 4194304,
2121
+ "dtype": "torch.int16"
2122
+ }
2123
+ },
2124
+ "quant_format": "exl3",
2125
+ "bits_per_weight": 4,
2126
+ "mcg_multiplier": 3417055213
2127
+ },
2128
+ "layers.7.self_attn.k_proj": {
2129
+ "stored_tensors": {
2130
+ "layers.7.self_attn.k_proj.suh": {
2131
+ "shape": [
2132
+ 2048
2133
+ ],
2134
+ "n_bytes": 4096,
2135
+ "dtype": "torch.float16"
2136
+ },
2137
+ "layers.7.self_attn.k_proj.svh": {
2138
+ "shape": [
2139
+ 512
2140
+ ],
2141
+ "n_bytes": 1024,
2142
+ "dtype": "torch.float16"
2143
+ },
2144
+ "layers.7.self_attn.k_proj.mcg": {
2145
+ "shape": [],
2146
+ "n_bytes": 4,
2147
+ "dtype": "torch.int32"
2148
+ },
2149
+ "layers.7.self_attn.k_proj.trellis": {
2150
+ "shape": [
2151
+ 128,
2152
+ 32,
2153
+ 64
2154
+ ],
2155
+ "n_bytes": 524288,
2156
+ "dtype": "torch.int16"
2157
+ }
2158
+ },
2159
+ "quant_format": "exl3",
2160
+ "bits_per_weight": 4,
2161
+ "mcg_multiplier": 3417055213
2162
+ },
2163
+ "layers.7.self_attn.v_proj": {
2164
+ "stored_tensors": {
2165
+ "layers.7.self_attn.v_proj.suh": {
2166
+ "shape": [
2167
+ 2048
2168
+ ],
2169
+ "n_bytes": 4096,
2170
+ "dtype": "torch.float16"
2171
+ },
2172
+ "layers.7.self_attn.v_proj.svh": {
2173
+ "shape": [
2174
+ 512
2175
+ ],
2176
+ "n_bytes": 1024,
2177
+ "dtype": "torch.float16"
2178
+ },
2179
+ "layers.7.self_attn.v_proj.mcg": {
2180
+ "shape": [],
2181
+ "n_bytes": 4,
2182
+ "dtype": "torch.int32"
2183
+ },
2184
+ "layers.7.self_attn.v_proj.trellis": {
2185
+ "shape": [
2186
+ 128,
2187
+ 32,
2188
+ 64
2189
+ ],
2190
+ "n_bytes": 524288,
2191
+ "dtype": "torch.int16"
2192
+ }
2193
+ },
2194
+ "quant_format": "exl3",
2195
+ "bits_per_weight": 4,
2196
+ "mcg_multiplier": 3417055213
2197
+ },
2198
+ "layers.7.self_attn.o_proj": {
2199
+ "stored_tensors": {
2200
+ "layers.7.self_attn.o_proj.suh": {
2201
+ "shape": [
2202
+ 4096
2203
+ ],
2204
+ "n_bytes": 8192,
2205
+ "dtype": "torch.float16"
2206
+ },
2207
+ "layers.7.self_attn.o_proj.svh": {
2208
+ "shape": [
2209
+ 2048
2210
+ ],
2211
+ "n_bytes": 4096,
2212
+ "dtype": "torch.float16"
2213
+ },
2214
+ "layers.7.self_attn.o_proj.mcg": {
2215
+ "shape": [],
2216
+ "n_bytes": 4,
2217
+ "dtype": "torch.int32"
2218
+ },
2219
+ "layers.7.self_attn.o_proj.trellis": {
2220
+ "shape": [
2221
+ 256,
2222
+ 128,
2223
+ 64
2224
+ ],
2225
+ "n_bytes": 4194304,
2226
+ "dtype": "torch.int16"
2227
+ }
2228
+ },
2229
+ "quant_format": "exl3",
2230
+ "bits_per_weight": 4,
2231
+ "mcg_multiplier": 3417055213
2232
+ },
2233
+ "layers.7.self_attn.q_norm": {
2234
+ "stored_tensors": {
2235
+ "layers.7.self_attn.q_norm.weight": {
2236
+ "shape": [
2237
+ 128
2238
+ ],
2239
+ "n_bytes": 256,
2240
+ "dtype": "torch.bfloat16"
2241
+ }
2242
+ }
2243
+ },
2244
+ "layers.7.self_attn.k_norm": {
2245
+ "stored_tensors": {
2246
+ "layers.7.self_attn.k_norm.weight": {
2247
+ "shape": [
2248
+ 128
2249
+ ],
2250
+ "n_bytes": 256,
2251
+ "dtype": "torch.bfloat16"
2252
+ }
2253
+ }
2254
+ },
2255
+ "layers.7.post_attention_layernorm": {
2256
+ "stored_tensors": {
2257
+ "layers.7.post_attention_layernorm.weight": {
2258
+ "shape": [
2259
+ 2048
2260
+ ],
2261
+ "n_bytes": 4096,
2262
+ "dtype": "torch.bfloat16"
2263
+ }
2264
+ }
2265
+ },
2266
+ "layers.7.mlp.up_proj": {
2267
+ "stored_tensors": {
2268
+ "layers.7.mlp.up_proj.suh": {
2269
+ "shape": [
2270
+ 2048
2271
+ ],
2272
+ "n_bytes": 4096,
2273
+ "dtype": "torch.float16"
2274
+ },
2275
+ "layers.7.mlp.up_proj.svh": {
2276
+ "shape": [
2277
+ 6144
2278
+ ],
2279
+ "n_bytes": 12288,
2280
+ "dtype": "torch.float16"
2281
+ },
2282
+ "layers.7.mlp.up_proj.mcg": {
2283
+ "shape": [],
2284
+ "n_bytes": 4,
2285
+ "dtype": "torch.int32"
2286
+ },
2287
+ "layers.7.mlp.up_proj.trellis": {
2288
+ "shape": [
2289
+ 128,
2290
+ 384,
2291
+ 64
2292
+ ],
2293
+ "n_bytes": 6291456,
2294
+ "dtype": "torch.int16"
2295
+ }
2296
+ },
2297
+ "quant_format": "exl3",
2298
+ "bits_per_weight": 4,
2299
+ "mcg_multiplier": 3417055213
2300
+ },
2301
+ "layers.7.mlp.gate_proj": {
2302
+ "stored_tensors": {
2303
+ "layers.7.mlp.gate_proj.suh": {
2304
+ "shape": [
2305
+ 2048
2306
+ ],
2307
+ "n_bytes": 4096,
2308
+ "dtype": "torch.float16"
2309
+ },
2310
+ "layers.7.mlp.gate_proj.svh": {
2311
+ "shape": [
2312
+ 6144
2313
+ ],
2314
+ "n_bytes": 12288,
2315
+ "dtype": "torch.float16"
2316
+ },
2317
+ "layers.7.mlp.gate_proj.mcg": {
2318
+ "shape": [],
2319
+ "n_bytes": 4,
2320
+ "dtype": "torch.int32"
2321
+ },
2322
+ "layers.7.mlp.gate_proj.trellis": {
2323
+ "shape": [
2324
+ 128,
2325
+ 384,
2326
+ 64
2327
+ ],
2328
+ "n_bytes": 6291456,
2329
+ "dtype": "torch.int16"
2330
+ }
2331
+ },
2332
+ "quant_format": "exl3",
2333
+ "bits_per_weight": 4,
2334
+ "mcg_multiplier": 3417055213
2335
+ },
2336
+ "layers.7.mlp.down_proj": {
2337
+ "stored_tensors": {
2338
+ "layers.7.mlp.down_proj.suh": {
2339
+ "shape": [
2340
+ 6144
2341
+ ],
2342
+ "n_bytes": 12288,
2343
+ "dtype": "torch.float16"
2344
+ },
2345
+ "layers.7.mlp.down_proj.svh": {
2346
+ "shape": [
2347
+ 2048
2348
+ ],
2349
+ "n_bytes": 4096,
2350
+ "dtype": "torch.float16"
2351
+ },
2352
+ "layers.7.mlp.down_proj.mcg": {
2353
+ "shape": [],
2354
+ "n_bytes": 4,
2355
+ "dtype": "torch.int32"
2356
+ },
2357
+ "layers.7.mlp.down_proj.trellis": {
2358
+ "shape": [
2359
+ 384,
2360
+ 128,
2361
+ 64
2362
+ ],
2363
+ "n_bytes": 6291456,
2364
+ "dtype": "torch.int16"
2365
+ }
2366
+ },
2367
+ "quant_format": "exl3",
2368
+ "bits_per_weight": 4,
2369
+ "mcg_multiplier": 3417055213
2370
+ },
2371
+ "norm": {
2372
+ "stored_tensors": {
2373
+ "norm.weight": {
2374
+ "shape": [
2375
+ 2048
2376
+ ],
2377
+ "n_bytes": 4096,
2378
+ "dtype": "torch.bfloat16"
2379
+ }
2380
+ }
2381
+ }
2382
+ }
2383
+ }