--- language: - ko - en license: llama3 library_name: transformers tags: - moe - awq - quantized - w4a16 - compressed-tensors - vllm - llm-compressor base_model: LGAI-EXAONE/K-EXAONE-236B-A23B --- # K-EXAONE-236B-A23B-W4A16-G128 **W4A16 AWQ quantization** of [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B), produced with [llm-compressor](https://github.com/vllm-project/llm-compressor). This is the **first W4A16 AWQ checkpoint** for K-EXAONE-236B-A23B publicly available — the original model only has FP8 and GGUF variants on HuggingFace. --- ## Model Details | Property | Value | |---|---| | Base model | LGAI-EXAONE/K-EXAONE-236B-A23B | | Architecture | ExaoneMoeForCausalLM | | Total parameters | ~236B | | Active parameters | ~23B per token | | Quantization method | AWQ (Activation-aware Weight Quantization) | | Weight precision | INT4 (packed) | | Activation precision | BF16 | | Group size | 128 | | Quantization scope | All `Linear` layers except `lm_head` and gate projections | | Compressed-tensors version | 0.15.0 | | Context length | 262,144 tokens | | Languages | Korean, English | ### Architecture Highlights - **48 transformer layers** with mixed sliding-window (`LLLG` pattern) and full attention - **MoE layers**: 47 sparse MoE layers + 1 dense MLP (layer 0) - **128 routed experts** + 1 shared expert per MoE layer; top-8 experts activated per token - **Sigmoid scoring** with `norm_topk_prob=True` - **Hidden size**: 6144, **MoE intermediate size**: 2048 --- ## Quantization Details Quantization was performed using [llm-compressor](https://github.com/vllm-project/llm-compressor) with a **MoE-aware AWQ** recipe. **Method:** AWQ applies channel-wise scaling to minimize quantization error by protecting salient weights, using a calibration dataset to determine optimal scales. **Recipe highlights:** - `scheme`: W4A16 (INT4 weights, BF16 activations) - `group_size`: 128 - `n_grid`: 20 (search resolution for AWQ scale optimization) - `duo_scaling`: True - Smooth mappings cover all MoE expert layers (layers 1–47) independently, plus attention and MLP projections - Layer 0 (dense MLP) and `lm_head` are excluded from quantization - Gate weight tensors are excluded from quantization The full recipe is available in `recipe.yaml`. **Calibration dataset:** [`neuralmagic/LLM_compression_calibration`](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) (512 samples, sequence length 2048) --- ## Usage ### vLLM (Recommended) Install vLLM (≥0.6.0 recommended for compressed-tensors support): ```bash pip install vllm ``` ```python from vllm import LLM, SamplingParams llm = LLM( model="Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", max_model_len=8192, trust_remote_code=True, # K-EXAONE uses custom modeling code tensor_parallel_size=4, # adjust to the number of GPUs available ) sampling_params = SamplingParams( temperature=0.6, top_p=0.9, max_tokens=512, ) tokenizer = llm.get_tokenizer() prompts = [ "What is the capital of South Korea?", "Explain the difference between MoE and dense transformer models.", ] formatted_prompts = [ tokenizer.apply_chat_template( [{"role": "user", "content": p}], tokenize=False, add_generation_prompt=True, ) for p in prompts ] outputs = llm.generate(formatted_prompts, sampling_params) for prompt, output in zip(prompts, outputs): print(f"Prompt : {prompt}") print(f"Response: {output.outputs[0].text.strip()}") ``` ### Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) messages = [{"role": "user", "content": "한국의 수도는 어디인가요?"}] input_ids = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) output = model.generate(input_ids, max_new_tokens=256, temperature=0.6, top_p=0.9) print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)) ``` --- ## Hardware Requirements | Precision | Min VRAM | |---|---| | This model (W4A16) | ~120 GB | | Original BF16 | ~480 GB | Tested on: NVIDIA B200 (180 GB HBM3e). For multi-GPU inference, set `tensor_parallel_size` in vLLM to the number of GPUs. --- ## Files | File | Description | |---|---| | `model-00001-of-00003.safetensors` | Model weights shard 1/3 | | `model-00002-of-00003.safetensors` | Model weights shard 2/3 | | `model-00003-of-00003.safetensors` | Model weights shard 3/3 | | `model.safetensors.index.json` | Weight shard index | | `config.json` | Model config with quantization metadata | | `recipe.yaml` | llm-compressor AWQ recipe used for quantization | | `tokenizer.json` | Tokenizer | | `tokenizer_config.json` | Tokenizer config | | `chat_template.jinja` | Chat template | | `generation_config.json` | Default generation config | --- ## License This model inherits the license of the base model [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B). Please refer to the original model page for license details. --- ## Citation If you use this model, please cite the original K-EXAONE work: ``` @misc{k-exaone-236b, title = {K-EXAONE-236B-A23B}, author = {LG AI Research}, year = {2025}, url = {https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B} } ``` Quantization produced by [Hyun9junn](https://huggingface.co/Hyun9junn) using [llm-compressor](https://github.com/vllm-project/llm-compressor).