Hyun9junn commited on
Commit
1162da9
Β·
verified Β·
1 Parent(s): 8013ea7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -90,7 +90,8 @@ The EXAONE specific MoE-aware AWQ recipe was developed in [SqueezeBits/llm-compr
90
  | This model (W4A16) | ~120 GB |
91
  | Original BF16 | ~480 GB |
92
 
93
- **CUDA / driver requirement:** vLLM 0.19.0 wheels are compiled with the CUDA 12.9 toolkit, so you need **CUDA β‰₯ 12.9** (NVIDIA driver β‰₯ 575.x) to run without issues. If your driver is older, follow the monkey-patch workaround in the inference section below.
 
94
 
95
  ---
96
 
@@ -126,14 +127,15 @@ from vllm import LLM, SamplingParams
126
 
127
  MODEL_PATH = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128"
128
 
129
- # ── Monkey-patch required if NVIDIA driver < 575.x (CUDA < 12.9) ─────────────
130
- # vLLM 0.19.0 is compiled with CUDA 12.9; older drivers cannot JIT-compile its
 
131
  # PTX and crash with "cudaErrorUnsupportedPtxVersion" during weight loading.
132
  # This patch forces vLLM to use WNA16MoEMethod (no Marlin CUDA kernels) instead
133
  # of MarlinMoEMethod. Safe to keep even after upgrading the driver.
134
  import vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe as _ct_moe
135
  _ct_moe.check_moe_marlin_supports_layer = lambda *args, **kwargs: False
136
- # ─────────────────────────────────────────────────────────────────────────────
137
 
138
 
139
  def main():
 
90
  | This model (W4A16) | ~120 GB |
91
  | Original BF16 | ~480 GB |
92
 
93
+ **CUDA / driver requirement:** for H200(sm_90), vLLM 0.19.0 wheels are compiled with the CUDA 12.9 toolkit, so you need **CUDA β‰₯ 12.9** (NVIDIA driver β‰₯ 575.x) to run without issues. If your driver is older, follow the monkey-patch workaround in the inference section below.
94
+ * It is okay for A100(sm_80)
95
 
96
  ---
97
 
 
127
 
128
  MODEL_PATH = "Hyun9junn/K-EXAONE-236B-A23B-W4A16-G128"
129
 
130
+
131
+ """# ── Monkey-patch required if NVIDIA driver < 575.x (CUDA < 12.9) ─────────────
132
+ # for H200, vLLM 0.19.0 is compiled with CUDA 12.9; older drivers cannot JIT-compile its
133
  # PTX and crash with "cudaErrorUnsupportedPtxVersion" during weight loading.
134
  # This patch forces vLLM to use WNA16MoEMethod (no Marlin CUDA kernels) instead
135
  # of MarlinMoEMethod. Safe to keep even after upgrading the driver.
136
  import vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe as _ct_moe
137
  _ct_moe.check_moe_marlin_supports_layer = lambda *args, **kwargs: False
138
+ # ─────────────────────────────────────────────────────────────────────────────"""
139
 
140
 
141
  def main():