Text Generation
Transformers
Safetensors
Korean
English
exaone_moe
Mixture of Experts
awq
quantized
w4a16
compressed-tensors
vllm
llm-compressor
conversational
Instructions to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128") model = AutoModelForCausalLM.from_pretrained("Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128
- SGLang
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with Docker Model Runner:
docker model run hf.co/Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128
| language: | |
| - ko | |
| - en | |
| license: llama3 | |
| library_name: transformers | |
| tags: | |
| - moe | |
| - awq | |
| - quantized | |
| - w4a16 | |
| - compressed-tensors | |
| - vllm | |
| - llm-compressor | |
| base_model: LGAI-EXAONE/K-EXAONE-236B-A23B | |
| # K-EXAONE-236B-A23B-W4A16-G128 | |
| **π (2026-04-13) Improved Quantization** - scale-up calibration dataset (# of Calibration Dataset 512, Sequence len 512) | |
| **π (2026-04-10) Initial commit** (# of Calibration Dataset 32, Sequence len 128) | |
| > **Note β Early release** | |
| > | |
| > This checkpoint was quantized with a **small calibration dataset**, so accuracy is noticeably lower than the original BF16 model. | |
| > A re-quantized version with a larger, more representative dataset is in progress β please wait for the next upload if quality matters for your use case. | |
| --- | |
| **W4A16 AWQ quantization** of [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B), produced with [llm-compressor](https://github.com/vllm-project/llm-compressor). | |
| This is the **first W4A16 AWQ checkpoint** for K-EXAONE-236B-A23B publicly available β the original model only has FP8 and GGUF variants on HuggingFace. | |
| --- | |
| ## Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | Base model | LGAI-EXAONE/K-EXAONE-236B-A23B | | |
| | Architecture | ExaoneMoeForCausalLM | | |
| | Total parameters | ~236B | | |
| | Active parameters | ~23B per token | | |
| | Quantization method | AWQ (Activation-aware Weight Quantization) | | |
| | Weight precision | INT4 (packed) | | |
| | Activation precision | BF16 | | |
| | Group size | 128 | | |
| | Quantization scope | All `Linear` layers except `lm_head` and gate projections | | |
| | Compressed-tensors version | 0.15.0 | | |
| | Context length | 262,144 tokens | | |
| | Languages | Korean, English | | |
| ### Architecture Highlights | |
| * **48 transformer layers** with mixed sliding-window (`LLLG` pattern) and full attention | |
| * **MoE layers**: 47 sparse MoE layers + 1 dense MLP (layer 0) | |
| * **128 routed experts** + 1 shared expert per MoE layer; top-8 experts activated per token | |
| * **Sigmoid scoring** with `norm_topk_prob=True` | |
| * **Hidden size**: 6144, **MoE intermediate size**: 2048 | |
| --- | |
| ## Quantization Details | |
| Quantization was performed using [llm-compressor](https://github.com/vllm-project/llm-compressor) with a **MoE-aware AWQ** recipe. | |
| The EXAONE specific MoE-aware AWQ recipe was developed in [SqueezeBits/llm-compressor-K-EXAONE](https://github.com/SqueezeBits/llm-compressor-K-EXAONE). | |
| **Method:** AWQ applies channel-wise scaling to minimize quantization error by protecting salient weights, using a calibration dataset to determine optimal scales. | |
| **Recipe highlights:** | |
| * `scheme`: W4A16 (INT4 weights, BF16 activations) | |
| * `group_size`: 128 | |
| * `n_grid`: 20 (search resolution for AWQ scale optimization) | |
| * `duo_scaling`: True | |
| * Smooth mappings cover all MoE expert layers (layers 1β47) independently, plus attention and MLP projections | |
| * Layer 0 (dense MLP) and `lm_head` are excluded from quantization | |
| * Gate weight tensors are excluded from quantization | |
| **Calibration dataset:** [`neuralmagic/LLM_compression_calibration`](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) (512 samples, sequence length 2048) | |
| --- | |
| ## Hardware Requirements | |
| | Precision | VRAM | | |
| |-----------|------| | |
| | This model (W4A16) | ~120 GB | | |
| | Original BF16 | ~480 GB | | |
| **Currently validated on: 2 Γ H200 only.** No other GPU configuration has been tested. | |
| **CUDA / driver requirement:** vLLM 0.19.0 wheels are compiled with the CUDA 12.9 toolkit, so you need **CUDA β₯ 12.9** (NVIDIA driver β₯ 575.x) to run without issues. If your driver is older, follow the monkey-patch workaround in the inference section below. | |
| --- | |
| ## Setup | |
| ```bash | |
| # 1. Install uv | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| # 2. Create a Python 3.12 virtual environment | |
| uv venv --python 3.12 | |
| # 3. Activate it | |
| source .venv/bin/activate | |
| # 4. Install vLLM and Transformers | |
| uv pip install "vllm==0.19.0" | |
| uv pip install "transformers==5.5.0" | |
| ``` | |
| ### Required patch β vLLM `rms_norm` contiguous buffer fix | |
| Before running inference you must apply one small fix to the installed vLLM package. | |
| Without it you will hit: | |
| ``` | |
| RuntimeError: Expected out.is_contiguous() to be true, but got false. | |
| in ops.rms_norm | |
| ``` | |
| Open `<venv>/lib/python3.12/site-packages/vllm/model_executor/layers/layernorm.py`, | |
| find the `rms_norm` function (around line 61), and replace: | |
| ```python | |
| out = torch.empty_like(x) | |
| ``` | |
| with: | |
| ```python | |
| out = torch.empty(x.shape, dtype=x.dtype, device=x.device) | |
| ``` | |
| This makes the output buffer always contiguous, regardless of the strides of the input tensor. | |
| --- | |
| ## Running Inference | |
| Save the script below as `vllm_inference.py` and run: | |
| ```bash | |
| python vllm_inference.py | |
| ``` | |
| ```python | |
| from vllm import LLM, SamplingParams | |
| MODEL_PATH = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" | |
| # ββ Monkey-patch required if NVIDIA driver < 575.x (CUDA < 12.9) βββββββββββββ | |
| # vLLM 0.19.0 is compiled with CUDA 12.9; older drivers cannot JIT-compile its | |
| # PTX and crash with "cudaErrorUnsupportedPtxVersion" during weight loading. | |
| # This patch forces vLLM to use WNA16MoEMethod (no Marlin CUDA kernels) instead | |
| # of MarlinMoEMethod. Safe to keep even after upgrading the driver. | |
| import vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe as _ct_moe | |
| _ct_moe.check_moe_marlin_supports_layer = lambda *args, **kwargs: False | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def main(): | |
| llm = LLM( | |
| model=MODEL_PATH, | |
| max_model_len=8192, | |
| trust_remote_code=True, # K-EXAONE uses custom modeling code | |
| tensor_parallel_size=2, # 2x H200; 236B W4A16 ~118 GB fits across both | |
| enforce_eager=True, | |
| ) | |
| sampling_params = SamplingParams( | |
| temperature=0, | |
| top_p=1.0, | |
| max_tokens=512, | |
| ) | |
| prompts = [ | |
| "What is the capital of South Korea?", | |
| "Explain the difference between MoE and dense transformer models.", | |
| "Write a short Python function to compute Fibonacci numbers.", | |
| ] | |
| tokenizer = llm.get_tokenizer() | |
| formatted_prompts = [ | |
| tokenizer.apply_chat_template( | |
| [{"role": "user", "content": p}], | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| for p in prompts | |
| ] | |
| outputs = llm.generate(formatted_prompts, sampling_params) | |
| for prompt, output in zip(prompts, outputs): | |
| print(f"Prompt : {prompt}") | |
| print(f"Response: {output.outputs[0].text.strip()}") | |
| print("-" * 60) | |
| if __name__ == "__main__": | |
| main() | |
| ``` | |
| --- | |
| ## Files | |
| | File | Description | | |
| |------|-------------| | |
| | `model-00001-of-00003.safetensors` | Model weights shard 1/3 | | |
| | `model-00002-of-00003.safetensors` | Model weights shard 2/3 | | |
| | `model-00003-of-00003.safetensors` | Model weights shard 3/3 | | |
| | `model.safetensors.index.json` | Weight shard index | | |
| | `config.json` | Model config with quantization metadata | | |
| | `recipe.yaml` | llm-compressor AWQ recipe used for quantization | | |
| | `tokenizer.json` | Tokenizer | | |
| | `tokenizer_config.json` | Tokenizer config | | |
| | `chat_template.jinja` | Chat template | | |
| | `generation_config.json` | Default generation config | | |
| --- | |
| ## License | |
| This model inherits the license of the base model [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B). Please refer to the original model page for license details. | |
| --- | |
| ## Citation | |
| If you use this model, please cite the original K-EXAONE work: | |
| ```bibtex | |
| @misc{k-exaone-236b, | |
| title = {K-EXAONE-236B-A23B}, | |
| author = {LG AI Research}, | |
| year = {2025}, | |
| url = {https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B} | |
| } | |
| ``` | |