How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf heath0xFF/VibeThinker-3B-GGUF:
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default heath0xFF/VibeThinker-3B-GGUF:
Run Hermes
hermes
Quick Links

VibeThinker-3B GGUF

GGUF quantizations of WeiboAI/VibeThinker-3B, a Qwen2-based 3B parameter thinking model with 131K context.

Converted with llama.cpp convert_hf_to_gguf.py.

Available Quantizations

File Size BPW Description
VibeThinker-3B-F16.gguf 5.8 GB 16.00 Full FP16 (reference)
VibeThinker-3B-Q8_0.gguf 3.1 GB 8.50 Near-lossless 8-bit
VibeThinker-3B-Q5_K_M.gguf 2.1 GB 5.75 High quality 5-bit
VibeThinker-3B-Q4_K_M.gguf 1.8 GB 4.99 Great size/quality tradeoff

Usage

llama.cpp

./llama-cli -m VibeThinker-3B-Q4_K_M.gguf -p "Hello!" -n 128

Chat Format

This model uses the Qwen2 chat format with thinking tags:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
<think>...reasoning...</think>
...response...
<|im_end|>

Model Details

  • Architecture: Qwen2ForCausalLM
  • Parameters: ~3B
  • Layers: 36
  • Hidden size: 2048
  • Heads: 16 (2 KV heads)
  • Context: 131,072 tokens
  • Vocab: 151,936
Downloads last month
399
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for heath0xFF/VibeThinker-3B-GGUF

Base model

Qwen/Qwen2.5-3B
Quantized
(49)
this model