How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="heath0xFF/VibeThinker-3B-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

VibeThinker-3B GGUF

GGUF quantizations of WeiboAI/VibeThinker-3B, a Qwen2-based 3B parameter thinking model with 131K context.

Converted with llama.cpp convert_hf_to_gguf.py.

Available Quantizations

File Size BPW Description
VibeThinker-3B-F16.gguf 5.8 GB 16.00 Full FP16 (reference)
VibeThinker-3B-Q8_0.gguf 3.1 GB 8.50 Near-lossless 8-bit
VibeThinker-3B-Q5_K_M.gguf 2.1 GB 5.75 High quality 5-bit
VibeThinker-3B-Q4_K_M.gguf 1.8 GB 4.99 Great size/quality tradeoff

Usage

llama.cpp

./llama-cli -m VibeThinker-3B-Q4_K_M.gguf -p "Hello!" -n 128

Chat Format

This model uses the Qwen2 chat format with thinking tags:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
<think>...reasoning...</think>
...response...
<|im_end|>

Model Details

  • Architecture: Qwen2ForCausalLM
  • Parameters: ~3B
  • Layers: 36
  • Hidden size: 2048
  • Heads: 16 (2 KV heads)
  • Context: 131,072 tokens
  • Vocab: 151,936
Downloads last month
389
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for heath0xFF/VibeThinker-3B-GGUF

Base model

Qwen/Qwen2.5-3B
Quantized
(48)
this model