Etelis commited on
Commit
d1281df
·
verified ·
1 Parent(s): bfde6fa

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: deepseek-license
4
+ license_link: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL
5
+ base_model: deepseek-ai/DeepSeek-V2-Lite
6
+ tags:
7
+ - deepseek
8
+ - deepseek_v2
9
+ - fp8
10
+ - quantized
11
+ - compressed-tensors
12
+ - llmcompressor
13
+ library_name: transformers
14
+ ---
15
+
16
+ # DeepSeek-V2-Lite-FP8-BLOCK-padded
17
+
18
+ This model is a **FP8 block-quantized** version of [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) with **padding support** for non-divisible dimensions.
19
+
20
+ ## Overview
21
+
22
+ - **Base Model**: deepseek-ai/DeepSeek-V2-Lite (16B parameters)
23
+ - **Quantization**: FP8_BLOCK (128x128 block structure)
24
+ - **Purpose**: Demonstrates FP8 block quantization with weight padding for models with dimensions not evenly divisible by block size
25
+
26
+ ## Key Feature: Block Quantization Padding
27
+
28
+ DeepSeek-V2-Lite has `intermediate_size=10944`, which is not divisible by the block size of 128. This model uses **weight padding** to handle this:
29
+
30
+ - Original `intermediate_size`: 10944
31
+ - Padded `intermediate_size`: 11008 (86 × 128)
32
+
33
+ The padding is applied during quantization and the config.json reflects the padded dimensions for vLLM compatibility.
34
+
35
+ ## Usage with vLLM
36
+
37
+ ```python
38
+ from vllm import LLM, SamplingParams
39
+
40
+ llm = LLM(
41
+ model="Etelis/DeepSeek-V2-Lite-FP8-BLOCK-padded",
42
+ trust_remote_code=True,
43
+ tensor_parallel_size=1,
44
+ )
45
+
46
+ sampling_params = SamplingParams(max_tokens=100, temperature=0.7)
47
+ output = llm.generate(["Hello, world!"], sampling_params)
48
+ print(output[0].outputs[0].text)
49
+ ```
50
+
51
+ **Requirements**: H100 or newer GPU (SM 8.9+) for FP8 block quantization support.
52
+
53
+ ## Quantization Recipe
54
+
55
+ ```python
56
+ from transformers import AutoModelForCausalLM, AutoTokenizer
57
+ from llmcompressor import oneshot
58
+ from llmcompressor.modifiers.quantization import QuantizationModifier
59
+
60
+ MODEL_ID = "deepseek-ai/DeepSeek-V2-Lite"
61
+
62
+ model = AutoModelForCausalLM.from_pretrained(
63
+ MODEL_ID,
64
+ torch_dtype="auto",
65
+ trust_remote_code=True,
66
+ device_map="auto"
67
+ )
68
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
69
+
70
+ # FP8 block quantization - ignore layers with composite dimensions
71
+ recipe = QuantizationModifier(
72
+ targets="Linear",
73
+ scheme="FP8_BLOCK",
74
+ ignore=["lm_head", "re:.*kv_a_proj_with_mqa.*"]
75
+ )
76
+
77
+ oneshot(model=model, recipe=recipe)
78
+
79
+ model.save_pretrained("DeepSeek-V2-Lite-FP8-BLOCK-padded")
80
+ tokenizer.save_pretrained("DeepSeek-V2-Lite-FP8-BLOCK-padded")
81
+ ```
82
+
83
+ ## Created With
84
+
85
+ - [llm-compressor](https://github.com/vllm-project/llm-compressor) (with padding support)
86
+ - [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) (PR #547)
87
+
88
+ ## Ignored Layers
89
+
90
+ - `lm_head`: Not quantized (standard practice)
91
+ - `kv_a_proj_with_mqa`: Has composite dimensions (512 + 64 = 576) that cannot be safely padded
92
+
93
+ ## License
94
+
95
+ This model inherits the [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL) from the base model.