---
license: other
license_name: deepseek-license
license_link: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL
base_model: deepseek-ai/DeepSeek-V2-Lite
library_name: transformers
tags:
  - deepseek
  - mla
  - moe
  - nvfp4
  - fp4
  - quantized
  - vllm
pipeline_tag: text-generation
model-index:
  - name: DeepSeek-V2-Lite-NVFP4
    results: []
---

# DeepSeek-V2-Lite-NVFP4

NVFP4 (W4A4) quantized version of [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite), quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor).

## Model Details

| | |
|---|---|
| **Base model** | deepseek-ai/DeepSeek-V2-Lite (15.7B params) |
| **Architecture** | DeepseekV2ForCausalLM (MLA attention + MoE) |
| **Quantization** | NVFP4 — 4-bit floating point weights and activations |
| **Format** | compressed-tensors (nvfp4-pack-quantized) |
| **Size** | ~8.9 GB (3.5x compression from BF16) |
| **Group size** | 16 |
| **Scale dtype** | float8_e4m3fn |

## Quantization Details

- **Method**: Post-training quantization (PTQ) via `llm-compressor` `oneshot`
- **Scheme**: NVFP4 — weights and input activations quantized to 4-bit float
- **Calibration**: 20 samples from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-10BT)
- **Ignored layers**: `lm_head` (kept in original precision)
- **Scales**: per-tensor global scale (FP32) + per-group local scale (FP8, group size 16)

## Usage with vLLM

Requires a GPU with NVFP4 tensor core support (NVIDIA Blackwell, SM100+).

```bash
vllm serve carlyou/DeepSeek-V2-Lite-NVFP4 \
    --trust-remote-code \
    --max-model-len 2048
```

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="carlyou/DeepSeek-V2-Lite-NVFP4",
    trust_remote_code=True,
    max_model_len=2048,
)

output = llm.generate("Hello, world!", SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)
```

## Intended Use

This model is primarily intended for benchmarking and testing NVFP4 quantization support in vLLM, particularly MLA attention + quantization fusion patterns on Blackwell GPUs.

## Limitations

- Requires Blackwell GPU (SM100+) for FP4 tensor core acceleration
- Quantization may degrade output quality compared to FP8 or BF16 versions
- Not evaluated on standard benchmarks — use for testing/benchmarking only