--- language: - ko - en license: llama3 library_name: transformers tags: - moe - awq - quantized - w4a16 - compressed-tensors - vllm - llm-compressor base_model: LGAI-EXAONE/K-EXAONE-236B-A23B --- # K-EXAONE-236B-A23B-W4A16-G128 ### 🔄 (2026-06-04) Change Calibration Dataset * use mixed calibration datasets (2,048 samples, sequence length 8,192) * [`beomi/KoAlpaca-RealQA`](https://huggingface.co/datasets/beomi/KoAlpaca-RealQA): 512 samples * [`ChuGyouk/HFH4_ultrachat_200k_ko`](https://huggingface.co/datasets/ChuGyouk/HFH4_ultrachat_200k_ko): 512 samples * [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k): 1,024 samples ### 🔄 (2026-05-04) Change Calibration Dataset * use [`beomi/KoAlpaca-RealQA`](https://huggingface.co/datasets/beomi/KoAlpaca-RealQA) (512 samples, sequence length 2048) ### 🔄 (2026-04-15) Improved Quantization - scale-up calibration dataset * (# of Calibration Dataset 512, Sequence len 2048) ### 🔄 (2026-04-13) Bug fixed * the vLLM rms_norm contiguous-buffer patch is no longer required. * the tensor-parallel issue is resolved, and inference now works correctly on other GPUs(a100, ...) as well. ### 🔄 (2026-04-13) Improved Quantization - scale-up calibration dataset * (# of Calibration Dataset 512, Sequence len 512) ### 🔄 (2026-04-10) Initial commit * (# of Calibration Dataset 32, Sequence len 128) --- **W4A16 AWQ quantization** of [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B), produced with [llm-compressor](https://github.com/vllm-project/llm-compressor). This is the **first W4A16 AWQ checkpoint** for K-EXAONE-236B-A23B publicly available — the original model only has FP8 and GGUF variants on HuggingFace. --- ## Model Details | Property | Value | |----------|-------| | Base model | LGAI-EXAONE/K-EXAONE-236B-A23B | | Architecture | ExaoneMoEForCausalLM | | Total parameters | ~236B | | Active parameters | ~23B per token | | Quantization method | AWQ (Activation-aware Weight Quantization) | | Weight precision | INT4 (packed) | | Activation precision | BF16 | | Group size | 128 | | Quantization scope | All `Linear` layers except `lm_head` and gate projections | | Compressed-tensors version | 0.15.0 | | Context length | 262,144 tokens | | Languages | Korean, English | ### Architecture Highlights * **48 transformer layers** with mixed sliding-window (`LLLG` pattern) and full attention * **MoE layers**: 47 sparse MoE layers + 1 dense MLP (layer 0) * **128 routed experts** + 1 shared expert per MoE layer; top-8 experts activated per token * **Sigmoid scoring** with `norm_topk_prob=True` * **Hidden size**: 6144, **MoE intermediate size**: 2048 --- ## Quantization Details Quantization was performed using [llm-compressor](https://github.com/vllm-project/llm-compressor) with a **MoE-aware AWQ** recipe. The EXAONE specific MoE-aware AWQ recipe was developed in [SqueezeBits/llm-compressor-K-EXAONE](https://github.com/SqueezeBits/llm-compressor-K-EXAONE). **Method:** AWQ applies channel-wise scaling to minimize quantization error by protecting salient weights, using a calibration dataset to determine optimal scales. **Recipe highlights:** * `scheme`: W4A16 (INT4 weights, BF16 activations) * `group_size`: 128 * `n_grid`: 20 (search resolution for AWQ scale optimization) * `duo_scaling`: True * Smooth mappings cover all MoE expert layers (layers 1–47) independently, plus attention and MLP projections * Layer 0 (dense MLP) and `lm_head` are excluded from quantization * Gate weight tensors are excluded from quantization **Calibration dataset:** [`beomi/KoAlpaca-RealQA`](https://huggingface.co/datasets/beomi/KoAlpaca-RealQA) (512 samples, sequence length 2048) --- ## Hardware Requirements | Precision | VRAM | |-----------|------| | This model (W4A16) | ~120 GB | | Original BF16 | ~480 GB | **CUDA / driver requirement:** for H200(sm_90), vLLM 0.19.0 wheels are compiled with the CUDA 12.9 toolkit, so you need **CUDA ≥ 12.9** (NVIDIA driver ≥ 575.x) to run without issues. If your driver is older, follow the monkey-patch workaround in the inference section below. * It is okay for A100(sm_80) --- ## Setup ```bash # 1. Install uv curl -LsSf https://astral.sh/uv/install.sh | sh # 2. Create a Python 3.12 virtual environment uv venv --python 3.12 # 3. Activate it source .venv/bin/activate # 4. Install vLLM and Transformers uv pip install "vllm==0.19.0" uv pip install "transformers==5.5.0" ``` --- ## Running Inference Save the script below as `vllm_inference.py` and run: ```bash python vllm_inference.py ``` ```python from vllm import LLM, SamplingParams MODEL_PATH = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" """# ── Monkey-patch required if NVIDIA driver < 575.x (CUDA < 12.9) ───────────── # for H200, vLLM 0.19.0 is compiled with CUDA 12.9; older drivers cannot JIT-compile its # PTX and crash with "cudaErrorUnsupportedPtxVersion" during weight loading. # This patch forces vLLM to use WNA16MoEMethod (no Marlin CUDA kernels) instead # of MarlinMoEMethod. Safe to keep even after upgrading the driver. import vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe as _ct_moe _ct_moe.check_moe_marlin_supports_layer = lambda *args, **kwargs: False # ─────────────────────────────────────────────────────────────────────────────""" def main(): llm = LLM( model=MODEL_PATH, max_model_len=8192, trust_remote_code=True, # K-EXAONE uses custom modeling code tensor_parallel_size=2, # 2x H200; or 4x A100 enforce_eager=True, ) sampling_params = SamplingParams( temperature=0.2, top_p=0.9, max_tokens=2048, ) prompts = [ "What is the capital of South Korea?", "Explain the difference between MoE and dense transformer models.", "Write a short Python function to compute Fibonacci numbers.", ] tokenizer = llm.get_tokenizer() formatted_prompts = [ tokenizer.apply_chat_template( [{"role": "user", "content": p}], tokenize=False, add_generation_prompt=True, ) for p in prompts ] outputs = llm.generate(formatted_prompts, sampling_params) for prompt, output in zip(prompts, outputs): print(f"Prompt : {prompt}") print(f"Response: {output.outputs[0].text.strip()}") print("-" * 60) if __name__ == "__main__": main() ``` --- ## Files | File | Description | |------|-------------| | `model-00001-of-00003.safetensors` | Model weights shard 1/3 | | `model-00002-of-00003.safetensors` | Model weights shard 2/3 | | `model-00003-of-00003.safetensors` | Model weights shard 3/3 | | `model.safetensors.index.json` | Weight shard index | | `config.json` | Model config with quantization metadata | | `recipe.yaml` | llm-compressor AWQ recipe used for quantization | | `tokenizer.json` | Tokenizer | | `tokenizer_config.json` | Tokenizer config | | `chat_template.jinja` | Chat template | | `generation_config.json` | Default generation config | --- ## License This model inherits the license of the base model [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B). Please refer to the original model page for license details. --- ## Citation If you use this model, please cite the original K-EXAONE work: ```bibtex @misc{k-exaone-236b, title = {K-EXAONE-236B-A23B}, author = {LG AI Research}, year = {2025}, url = {https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B} } ```