Hyun9junn commited on
Commit
3ba9c19
ยท
verified ยท
1 Parent(s): 07bfb72

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +199 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: llama3
6
+ library_name: transformers
7
+ tags:
8
+ - moe
9
+ - awq
10
+ - quantized
11
+ - w4a16
12
+ - compressed-tensors
13
+ - vllm
14
+ - llm-compressor
15
+ base_model: LGAI-EXAONE/K-EXAONE-236B-A23B
16
+ ---
17
+
18
+ # K-EXAONE-236B-A23B-W4A16-G128
19
+
20
+ **W4A16 AWQ quantization** of [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B), produced with [llm-compressor](https://github.com/vllm-project/llm-compressor).
21
+
22
+ This is the **first W4A16 AWQ checkpoint** for K-EXAONE-236B-A23B publicly available โ€” the original model only has FP8 and GGUF variants on HuggingFace.
23
+
24
+ ---
25
+
26
+ ## Model Details
27
+
28
+ | Property | Value |
29
+ |---|---|
30
+ | Base model | LGAI-EXAONE/K-EXAONE-236B-A23B |
31
+ | Architecture | ExaoneMoeForCausalLM |
32
+ | Total parameters | ~236B |
33
+ | Active parameters | ~23B per token |
34
+ | Quantization method | AWQ (Activation-aware Weight Quantization) |
35
+ | Weight precision | INT4 (packed) |
36
+ | Activation precision | BF16 |
37
+ | Group size | 128 |
38
+ | Quantization scope | All `Linear` layers except `lm_head` and gate projections |
39
+ | Compressed-tensors version | 0.15.0 |
40
+ | Context length | 262,144 tokens |
41
+ | Languages | Korean, English |
42
+
43
+ ### Architecture Highlights
44
+
45
+ - **48 transformer layers** with mixed sliding-window (`LLLG` pattern) and full attention
46
+ - **MoE layers**: 47 sparse MoE layers + 1 dense MLP (layer 0)
47
+ - **128 routed experts** + 1 shared expert per MoE layer; top-8 experts activated per token
48
+ - **Sigmoid scoring** with `norm_topk_prob=True`
49
+ - **Hidden size**: 6144, **MoE intermediate size**: 2048
50
+
51
+ ---
52
+
53
+ ## Quantization Details
54
+
55
+ Quantization was performed using [llm-compressor](https://github.com/vllm-project/llm-compressor) with a **MoE-aware AWQ** recipe.
56
+
57
+ **Method:** AWQ applies channel-wise scaling to minimize quantization error by protecting salient weights, using a calibration dataset to determine optimal scales.
58
+
59
+ **Recipe highlights:**
60
+ - `scheme`: W4A16 (INT4 weights, BF16 activations)
61
+ - `group_size`: 128
62
+ - `n_grid`: 20 (search resolution for AWQ scale optimization)
63
+ - `duo_scaling`: True
64
+ - Smooth mappings cover all MoE expert layers (layers 1โ€“47) independently, plus attention and MLP projections
65
+ - Layer 0 (dense MLP) and `lm_head` are excluded from quantization
66
+ - Gate weight tensors are excluded from quantization
67
+
68
+ The full recipe is available in `recipe.yaml`.
69
+
70
+ **Calibration dataset:** [`neuralmagic/LLM_compression_calibration`](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) (512 samples, sequence length 2048)
71
+
72
+ ---
73
+
74
+ ## Usage
75
+
76
+ ### vLLM (Recommended)
77
+
78
+ Install vLLM (โ‰ฅ0.6.0 recommended for compressed-tensors support):
79
+
80
+ ```bash
81
+ pip install vllm
82
+ ```
83
+
84
+ ```python
85
+ from vllm import LLM, SamplingParams
86
+
87
+ llm = LLM(
88
+ model="Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128",
89
+ max_model_len=8192,
90
+ trust_remote_code=True, # K-EXAONE uses custom modeling code
91
+ tensor_parallel_size=4, # adjust to the number of GPUs available
92
+ )
93
+
94
+ sampling_params = SamplingParams(
95
+ temperature=0.6,
96
+ top_p=0.9,
97
+ max_tokens=512,
98
+ )
99
+
100
+ tokenizer = llm.get_tokenizer()
101
+
102
+ prompts = [
103
+ "What is the capital of South Korea?",
104
+ "Explain the difference between MoE and dense transformer models.",
105
+ ]
106
+
107
+ formatted_prompts = [
108
+ tokenizer.apply_chat_template(
109
+ [{"role": "user", "content": p}],
110
+ tokenize=False,
111
+ add_generation_prompt=True,
112
+ )
113
+ for p in prompts
114
+ ]
115
+
116
+ outputs = llm.generate(formatted_prompts, sampling_params)
117
+
118
+ for prompt, output in zip(prompts, outputs):
119
+ print(f"Prompt : {prompt}")
120
+ print(f"Response: {output.outputs[0].text.strip()}")
121
+ ```
122
+
123
+ ### Transformers
124
+
125
+ ```python
126
+ from transformers import AutoTokenizer, AutoModelForCausalLM
127
+ import torch
128
+
129
+ model_id = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128"
130
+
131
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
132
+ model = AutoModelForCausalLM.from_pretrained(
133
+ model_id,
134
+ torch_dtype=torch.bfloat16,
135
+ device_map="auto",
136
+ trust_remote_code=True,
137
+ )
138
+
139
+ messages = [{"role": "user", "content": "ํ•œ๊ตญ์˜ ์ˆ˜๋„๋Š” ์–ด๋””์ธ๊ฐ€์š”?"}]
140
+ input_ids = tokenizer.apply_chat_template(
141
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
142
+ ).to(model.device)
143
+
144
+ output = model.generate(input_ids, max_new_tokens=256, temperature=0.6, top_p=0.9)
145
+ print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
146
+ ```
147
+
148
+ ---
149
+
150
+ ## Hardware Requirements
151
+
152
+ | Precision | Min VRAM |
153
+ |---|---|
154
+ | This model (W4A16) | ~120 GB |
155
+ | Original BF16 | ~480 GB |
156
+
157
+ Tested on: NVIDIA B200 (180 GB HBM3e).
158
+
159
+ For multi-GPU inference, set `tensor_parallel_size` in vLLM to the number of GPUs.
160
+
161
+ ---
162
+
163
+ ## Files
164
+
165
+ | File | Description |
166
+ |---|---|
167
+ | `model-00001-of-00003.safetensors` | Model weights shard 1/3 |
168
+ | `model-00002-of-00003.safetensors` | Model weights shard 2/3 |
169
+ | `model-00003-of-00003.safetensors` | Model weights shard 3/3 |
170
+ | `model.safetensors.index.json` | Weight shard index |
171
+ | `config.json` | Model config with quantization metadata |
172
+ | `recipe.yaml` | llm-compressor AWQ recipe used for quantization |
173
+ | `tokenizer.json` | Tokenizer |
174
+ | `tokenizer_config.json` | Tokenizer config |
175
+ | `chat_template.jinja` | Chat template |
176
+ | `generation_config.json` | Default generation config |
177
+
178
+ ---
179
+
180
+ ## License
181
+
182
+ This model inherits the license of the base model [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B). Please refer to the original model page for license details.
183
+
184
+ ---
185
+
186
+ ## Citation
187
+
188
+ If you use this model, please cite the original K-EXAONE work:
189
+
190
+ ```
191
+ @misc{k-exaone-236b,
192
+ title = {K-EXAONE-236B-A23B},
193
+ author = {LG AI Research},
194
+ year = {2025},
195
+ url = {https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B}
196
+ }
197
+ ```
198
+
199
+ Quantization produced by [Hyun9junn](https://huggingface.co/Hyun9junn) using [llm-compressor](https://github.com/vllm-project/llm-compressor).