Instructions to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8")
model = AutoModelForCausalLM.from_pretrained("RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8

SGLang

How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 with Docker Model Runner:
```
docker model run hf.co/RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8
```

kylesayrs commited on 29 days ago

Commit

b52eefa

verified ·

1 Parent(s): 116a6a9

Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

config.json +3 -25
generation_config.json +1 -1
model-00001-of-00004.safetensors +2 -2
model-00002-of-00004.safetensors +2 -2
model-00003-of-00004.safetensors +2 -2
model-00004-of-00004.safetensors +2 -2
model.safetensors.index.json +2 -2
tokenizer_config.json +1 -1

config.json CHANGED Viewed

@@ -118,7 +118,7 @@
         },
         "output_activations": null,
         "targets": [
-          "re:.*attn.*(wgate|wkv|wo_a|wo_b|wq_a|wq_b|fused_wkv_wgate|fused_wqa_wkv|gate_up_proj)$"
         ],
         "weights": {
           "actorder": null,
@@ -176,30 +176,7 @@
     },
     "format": "mixed-precision",
     "global_compression_ratio": null,
-    "ignore": [
-      "layers.2.attn.indexer.weights_proj",
-      "layers.4.attn.indexer.weights_proj",
-      "layers.6.attn.indexer.weights_proj",
-      "layers.8.attn.indexer.weights_proj",
-      "layers.10.attn.indexer.weights_proj",
-      "layers.12.attn.indexer.weights_proj",
-      "layers.14.attn.indexer.weights_proj",
-      "layers.16.attn.indexer.weights_proj",
-      "layers.18.attn.indexer.weights_proj",
-      "layers.20.attn.indexer.weights_proj",
-      "layers.22.attn.indexer.weights_proj",
-      "layers.24.attn.indexer.weights_proj",
-      "layers.26.attn.indexer.weights_proj",
-      "layers.28.attn.indexer.weights_proj",
-      "layers.30.attn.indexer.weights_proj",
-      "layers.32.attn.indexer.weights_proj",
-      "layers.34.attn.indexer.weights_proj",
-      "layers.36.attn.indexer.weights_proj",
-      "layers.38.attn.indexer.weights_proj",
-      "layers.40.attn.indexer.weights_proj",
-      "layers.42.attn.indexer.weights_proj",
-      "lm_head"
-    ],
     "kv_cache_scheme": null,
     "quant_method": "compressed-tensors",
     "quantization_status": "compressed",
@@ -234,3 +211,4 @@
   "v_head_dim": null,
   "vocab_size": 129280
 }

         },
         "output_activations": null,
         "targets": [
+          "re:.*attn.*(fused_wqa_wkv|wq_b|wo_a|wo_b)$"
         ],
         "weights": {
           "actorder": null,
     },
     "format": "mixed-precision",
     "global_compression_ratio": null,
+    "ignore": [],
     "kv_cache_scheme": null,
     "quant_method": "compressed-tensors",
     "quantization_status": "compressed",
   "v_head_dim": null,
   "vocab_size": 129280
 }

generation_config.json CHANGED Viewed

@@ -5,5 +5,5 @@
   "eos_token_id": 1,
   "temperature": 1.0,
   "top_p": 1.0,
-  "transformers_version": "5.7.0.dev0"
 }

   "eos_token_id": 1,
   "temperature": 1.0,
   "top_p": 1.0,
+  "transformers_version": "5.8.0.dev0"
 }

model-00001-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0cf9f0c2a476c8af9cdf2e57a6520e85aad9a326d693ee4295c133e5813c46f4
-size 50003512604

 version https://git-lfs.github.com/spec/v1
+oid sha256:ee89b216becf894d7e1b70e5f24c1a075745659780f16baf3106ba7946d58792
+size 50001268714

model-00002-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0ad2fcbcad9c0e629884923f9b95d9d34bce718fa33fcee1281246adf1b12966
-size 50004060560

 version https://git-lfs.github.com/spec/v1
+oid sha256:137eb45600eaad7925aeee49b3d993cfbcdc0d14bc391c0c710bb1bcbf3ba1d8
+size 50001796128

model-00003-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:488404f19b3e204d6efa9f25eba27a81cba72d6ebbb52bbf8d3def678914e1b9
-size 50000748024

 version https://git-lfs.github.com/spec/v1
+oid sha256:4c7e600665a9f9942499f34384d7e7b3423fa60cba1594e2ad7aa20032364460
+size 50001411336

model-00004-of-00004.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:96816f731721c7ab0edea737029e21ec16a4cefa889a67240c9e49d50884f22f
-size 13900109888

 version https://git-lfs.github.com/spec/v1
+oid sha256:df4089b6034d2cbe3202855b633d6b6a927190f55f4bafe6b65e1b2b8d2b6a7b
+size 14207891688

model.safetensors.index.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:01a21cbc495c19f8cdf3451828872e2c822b6fd1b5ab99463a22c3ffa689447e
-size 11755949

 version https://git-lfs.github.com/spec/v1
+oid sha256:125acfc024138d7fe7102fece5c3d609b015f2a3fed659f62164d6eb17ed60ba
+size 11744849

tokenizer_config.json CHANGED Viewed

@@ -3,7 +3,7 @@
   "bos_token": "<｜begin▁of▁sentence｜>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<｜end▁of▁sentence｜>",
-  "is_local": true,
   "legacy": true,
   "local_files_only": false,
   "model_max_length": 1048576,

   "bos_token": "<｜begin▁of▁sentence｜>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<｜end▁of▁sentence｜>",
+  "is_local": false,
   "legacy": true,
   "local_files_only": false,
   "model_max_length": 1048576,