Instructions to use bullerwins/DeepSeek-V3-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bullerwins/DeepSeek-V3-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="bullerwins/DeepSeek-V3-GGUF",
	filename="DeepSeek-V3-GGUF-bf16/DeepSeek-V3-Bf16-256x20B-BF16-00001-of-00035.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use bullerwins/DeepSeek-V3-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bullerwins/DeepSeek-V3-GGUF:Q4_K_M

Use Docker

docker model run hf.co/bullerwins/DeepSeek-V3-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use bullerwins/DeepSeek-V3-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bullerwins/DeepSeek-V3-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bullerwins/DeepSeek-V3-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/bullerwins/DeepSeek-V3-GGUF:Q4_K_M

Ollama
How to use bullerwins/DeepSeek-V3-GGUF with Ollama:
```
ollama run hf.co/bullerwins/DeepSeek-V3-GGUF:Q4_K_M
```

Unsloth Studio

How to use bullerwins/DeepSeek-V3-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bullerwins/DeepSeek-V3-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bullerwins/DeepSeek-V3-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for bullerwins/DeepSeek-V3-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use bullerwins/DeepSeek-V3-GGUF with Docker Model Runner:
```
docker model run hf.co/bullerwins/DeepSeek-V3-GGUF:Q4_K_M
```

Lemonade

How to use bullerwins/DeepSeek-V3-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull bullerwins/DeepSeek-V3-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-V3-GGUF-Q4_K_M

List all available models

lemonade list

Please quantize base model too

by Delta36652 - opened Jan 5, 2025

Discussion

Delta36652

Jan 5, 2025

DeepSeek-V3 base seems to be particularly interersting to try as no provider serves it.

bullerwins

Owner Jan 5, 2025

I currently have the server busy doing the imatrix for the lower bit quants, but will get to it if its not available (bartowski et al) by then

whatever1983

Jan 5, 2025

•

edited Jan 5, 2025

@bullerwins
IQ2_M is the most interesting one.(There isn't any reason to go below IQ2_M) Ironically, Q5K_M and Q4K_M also benefit from imatrix, but you statically quanted them first. If you were to calculate the imatrix first, the perplexity of the near perfect Q5K_M would have been better. Oh well.

bullerwins

Owner Jan 5, 2025

@bullerwins
IQ2M is the most interesting one. Ironically, Q5K_M and Q4K_M also benefit from imatrix, but you statically quanted them first. If you were to calculate the imatrix first, the perplexity of the near perfect Q5K_M would have been better. Oh well.

IQ2M would be really interesting yeah, it's the one most people would be able to run and provide best bang for the buck performance. The imatrix takes a long time so I wanted to make the static versions first. I'll reupload them once i have the importance matrix.

Delta36652

Jan 5, 2025

@whatever1983 2bit may be too dumb

deltanym

Jan 6, 2025

seconding this, please do base model!

guizpublic

Jan 8, 2025

Can't wait, you are doing super work buller!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment