--- license: other license_name: deepseek-license license_link: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL base_model: deepseek-ai/DeepSeek-V2-Lite library_name: transformers tags: - deepseek - mla - moe - nvfp4 - fp4 - quantized - vllm pipeline_tag: text-generation model-index: - name: DeepSeek-V2-Lite-NVFP4 results: [] --- # DeepSeek-V2-Lite-NVFP4 NVFP4 (W4A4) quantized version of [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite), quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor). ## Model Details | | | |---|---| | **Base model** | deepseek-ai/DeepSeek-V2-Lite (15.7B params) | | **Architecture** | DeepseekV2ForCausalLM (MLA attention + MoE) | | **Quantization** | NVFP4 — 4-bit floating point weights and activations | | **Format** | compressed-tensors (nvfp4-pack-quantized) | | **Size** | ~8.9 GB (3.5x compression from BF16) | | **Group size** | 16 | | **Scale dtype** | float8_e4m3fn | ## Quantization Details - **Method**: Post-training quantization (PTQ) via `llm-compressor` `oneshot` - **Scheme**: NVFP4 — weights and input activations quantized to 4-bit float - **Calibration**: 20 samples from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-10BT) - **Ignored layers**: `lm_head` (kept in original precision) - **Scales**: per-tensor global scale (FP32) + per-group local scale (FP8, group size 16) ## Usage with vLLM Requires a GPU with NVFP4 tensor core support (NVIDIA Blackwell, SM100+). ```bash vllm serve carlyou/DeepSeek-V2-Lite-NVFP4 \ --trust-remote-code \ --max-model-len 2048 ``` ```python from vllm import LLM, SamplingParams llm = LLM( model="carlyou/DeepSeek-V2-Lite-NVFP4", trust_remote_code=True, max_model_len=2048, ) output = llm.generate("Hello, world!", SamplingParams(max_tokens=128)) print(output[0].outputs[0].text) ``` ## Intended Use This model is primarily intended for benchmarking and testing NVFP4 quantization support in vLLM, particularly MLA attention + quantization fusion patterns on Blackwell GPUs. ## Limitations - Requires Blackwell GPU (SM100+) for FP4 tensor core acceleration - Quantization may degrade output quality compared to FP8 or BF16 versions - Not evaluated on standard benchmarks — use for testing/benchmarking only