How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF:BF16
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF:BF16
Run Hermes
hermes
Quick Links

Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Quantized GGUF versions of Fu01978/Llama-3.2-3B-MoE-4Expert for efficient local inference with llama.cpp and compatible tools.

Model Description

This repository contains GGUF quantizations of a 4-expert MoE model specializing in:

  • General chat & explanations
  • Code & programming
  • Creative writing
  • Mathematics

For full model details, see the original model card.

Available Files

Quant Type Size Use Case
BF16 19.1 GB Maximum quality, high VRAM
Q4_K_M 5.86 GB Best balance of quality/size

Quantization Details

  • BF16: Full precision GGUF format - identical quality to original model
  • Q4_K_M: 4-bit quantization with medium quality - recommended for most users

Usage

llama.cpp

# Download the model
huggingface-cli download Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf --local-dir .

# Run with llama.cpp
./llama-cli -m Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf -p "Write a Python function to reverse a string" -n 512

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Llama-3.2-3B-MoE-4Expert.Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=8,
)

output = llm(
    "Explain quantum entanglement in simple terms",
    max_tokens=512,
    temperature=0.7,
)
print(output['choices'][0]['text'])

Performance Notes

The Q4_K_M quantization provides excellent quality with minimal degradation compared to the original model while using ~65% less disk space and memory. Recommended for most use cases. The BF16 version maintains full original quality and is recommended if you have sufficient VRAM/RAM.

Conversion Details

  • Original Model: Fu01978/Llama-3.2-3B-MoE-4Expert
  • Conversion Tool: llama.cpp
  • Quantization Method: Q4_K_M via llama.cpp quantization — BF16 via llama.cpp convert_hf_to_gguf.py script

Acknowledgments

  • Original MoE model by Fu01978
  • GGUF conversion using llama.cpp
  • Base models: Meta AI (Llama 3.2), unsloth, prithivMLmods, DavidAU
Downloads last month
10
GGUF
Model size
10B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF

Quantized
(3)
this model

Collection including Fu01978/Llama-3.2-3B-MoE-4Expert-Q4_K_M-GGUF