Text Generation
Transformers
Safetensors
Korean
English
exaone_moe
Mixture of Experts
awq
quantized
w4a16
compressed-tensors
vllm
llm-compressor
conversational
Instructions to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128") model = AutoModelForCausalLM.from_pretrained("Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128
- SGLang
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128 with Docker Model Runner:
docker model run hf.co/Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128
| language: | |
| - ko | |
| - en | |
| license: llama3 | |
| library_name: transformers | |
| tags: | |
| - moe | |
| - awq | |
| - quantized | |
| - w4a16 | |
| - compressed-tensors | |
| - vllm | |
| - llm-compressor | |
| base_model: LGAI-EXAONE/K-EXAONE-236B-A23B | |
| # K-EXAONE-236B-A23B-W4A16-G128 | |
| **W4A16 AWQ quantization** of [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B), produced with [llm-compressor](https://github.com/vllm-project/llm-compressor). | |
| This is the **first W4A16 AWQ checkpoint** for K-EXAONE-236B-A23B publicly available โ the original model only has FP8 and GGUF variants on HuggingFace. | |
| --- | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | Base model | LGAI-EXAONE/K-EXAONE-236B-A23B | | |
| | Architecture | ExaoneMoeForCausalLM | | |
| | Total parameters | ~236B | | |
| | Active parameters | ~23B per token | | |
| | Quantization method | AWQ (Activation-aware Weight Quantization) | | |
| | Weight precision | INT4 (packed) | | |
| | Activation precision | BF16 | | |
| | Group size | 128 | | |
| | Quantization scope | All `Linear` layers except `lm_head` and gate projections | | |
| | Compressed-tensors version | 0.15.0 | | |
| | Context length | 262,144 tokens | | |
| | Languages | Korean, English | | |
| ### Architecture Highlights | |
| - **48 transformer layers** with mixed sliding-window (`LLLG` pattern) and full attention | |
| - **MoE layers**: 47 sparse MoE layers + 1 dense MLP (layer 0) | |
| - **128 routed experts** + 1 shared expert per MoE layer; top-8 experts activated per token | |
| - **Sigmoid scoring** with `norm_topk_prob=True` | |
| - **Hidden size**: 6144, **MoE intermediate size**: 2048 | |
| --- | |
| ## Quantization Details | |
| Quantization was performed using [llm-compressor](https://github.com/vllm-project/llm-compressor) with a **MoE-aware AWQ** recipe. | |
| **Method:** AWQ applies channel-wise scaling to minimize quantization error by protecting salient weights, using a calibration dataset to determine optimal scales. | |
| **Recipe highlights:** | |
| - `scheme`: W4A16 (INT4 weights, BF16 activations) | |
| - `group_size`: 128 | |
| - `n_grid`: 20 (search resolution for AWQ scale optimization) | |
| - `duo_scaling`: True | |
| - Smooth mappings cover all MoE expert layers (layers 1โ47) independently, plus attention and MLP projections | |
| - Layer 0 (dense MLP) and `lm_head` are excluded from quantization | |
| - Gate weight tensors are excluded from quantization | |
| The full recipe is available in `recipe.yaml`. | |
| **Calibration dataset:** [`neuralmagic/LLM_compression_calibration`](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) (512 samples, sequence length 2048) | |
| --- | |
| ## Usage | |
| ### vLLM (Recommended) | |
| Install vLLM (โฅ0.6.0 recommended for compressed-tensors support): | |
| ```bash | |
| pip install vllm | |
| ``` | |
| ```python | |
| from vllm import LLM, SamplingParams | |
| llm = LLM( | |
| model="Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128", | |
| max_model_len=8192, | |
| trust_remote_code=True, # K-EXAONE uses custom modeling code | |
| tensor_parallel_size=4, # adjust to the number of GPUs available | |
| ) | |
| sampling_params = SamplingParams( | |
| temperature=0.6, | |
| top_p=0.9, | |
| max_tokens=512, | |
| ) | |
| tokenizer = llm.get_tokenizer() | |
| prompts = [ | |
| "What is the capital of South Korea?", | |
| "Explain the difference between MoE and dense transformer models.", | |
| ] | |
| formatted_prompts = [ | |
| tokenizer.apply_chat_template( | |
| [{"role": "user", "content": p}], | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| for p in prompts | |
| ] | |
| outputs = llm.generate(formatted_prompts, sampling_params) | |
| for prompt, output in zip(prompts, outputs): | |
| print(f"Prompt : {prompt}") | |
| print(f"Response: {output.outputs[0].text.strip()}") | |
| ``` | |
| ### Transformers | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| messages = [{"role": "user", "content": "ํ๊ตญ์ ์๋๋ ์ด๋์ธ๊ฐ์?"}] | |
| input_ids = tokenizer.apply_chat_template( | |
| messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" | |
| ).to(model.device) | |
| output = model.generate(input_ids, max_new_tokens=256, temperature=0.6, top_p=0.9) | |
| print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)) | |
| ``` | |
| --- | |
| ## Hardware Requirements | |
| | Precision | Min VRAM | | |
| |---|---| | |
| | This model (W4A16) | ~120 GB | | |
| | Original BF16 | ~480 GB | | |
| Tested on: NVIDIA B200 (180 GB HBM3e). | |
| For multi-GPU inference, set `tensor_parallel_size` in vLLM to the number of GPUs. | |
| --- | |
| ## Files | |
| | File | Description | | |
| |---|---| | |
| | `model-00001-of-00003.safetensors` | Model weights shard 1/3 | | |
| | `model-00002-of-00003.safetensors` | Model weights shard 2/3 | | |
| | `model-00003-of-00003.safetensors` | Model weights shard 3/3 | | |
| | `model.safetensors.index.json` | Weight shard index | | |
| | `config.json` | Model config with quantization metadata | | |
| | `recipe.yaml` | llm-compressor AWQ recipe used for quantization | | |
| | `tokenizer.json` | Tokenizer | | |
| | `tokenizer_config.json` | Tokenizer config | | |
| | `chat_template.jinja` | Chat template | | |
| | `generation_config.json` | Default generation config | | |
| --- | |
| ## License | |
| This model inherits the license of the base model [`LGAI-EXAONE/K-EXAONE-236B-A23B`](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B). Please refer to the original model page for license details. | |
| --- | |
| ## Citation | |
| If you use this model, please cite the original K-EXAONE work: | |
| ``` | |
| @misc{k-exaone-236b, | |
| title = {K-EXAONE-236B-A23B}, | |
| author = {LG AI Research}, | |
| year = {2025}, | |
| url = {https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B} | |
| } | |
| ``` | |
| Quantization produced by [Hyun9junn](https://huggingface.co/Hyun9junn) using [llm-compressor](https://github.com/vllm-project/llm-compressor). | |