Not-For-All-Audiences

Instructions to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF",
	filename="Qwen3-48B-12x4B-Super-Distill2-GATED-HERETIC-Q4_K_S.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Use Docker

docker model run hf.co/DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Ollama
How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with Ollama:
```
ollama run hf.co/DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M
```

Unsloth Studio

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF to start chatting

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with Docker Model Runner:
```
docker model run hf.co/DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M
```

Lemonade

How to use DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF-Q4_K_M

List all available models

lemonade list

WARNING "HERETIC" version: Unlocked. UNFILTERED. NSFW. Vivid prose. INTENSE. Visceral Details. Light to R-18 HORROR. Swearing. UNCENSORED... humor, romance, fun... and UNFILTERED TRUTH.

IMPORTANT: See section below on how to access experts directly to get full use from this model.

Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF

Savant Commander is a specialized MOE model that allows you to control which expert(s) are assigned to your use case(s) / prompt(s) ... directly (by name(s)), as opposed to having the "choices" made for you.

The model is composed of 12 DISTILLS (compressed 12x4B MOE) of top closed ( GPT5.1, OpenAI 120 GPT Oss, Gemini (3), Claude (2) ) and open source models ( Kimi V2, GLM, Deepseek, Command-A, JanV1 ) all in one.

The is the uncensored/abliterated version. Each model ("expert") was separately abliterated using "Heretic" [ https://github.com/p-e-w/heretic ] . Make sure you see the section below on using Abliterated models to get the most from this model too.

256k Context, 2 experts activated.

You can use on CPU / Part off-load from GPU too.

Ask it about Orbital Mechanics and prepared to be "schoooled".

Fictional story? You will be amazed. (depending on which expert(s) you select)

Math? Coding?

This model does it all.

Non-Abliterated Versions

For the "normal version" ( non-abliterated version ) go here:

https://huggingface.co/DavidAU/Qwen3-48B-A4B-Savant-Commander-GATED-12x-Closed-Open-Source-Distill-GGUF

For the "normal version" ( ungated ; not abliterated ) go here:

https://huggingface.co/DavidAU/Qwen3-48B-A4B-Deadpan-Savant-12x-Closed-Open-Source-Distill

HOW TO ACCESS the EXPERTS:

In your prompts simply add the name(s) of the model(s)/expert(s) you want assigned.

Here is the list [no quotes]:

"Gemini" [activates all 3 Gemini distills]
"Claude" [activates both Claude distills]
"JanV1"
"CommandA"
"OPENR1"
"GLM"
"Kimi"
"GPTOSS" [120B distill]
"GPT51"

To access groups use [no quotes]:

"AllAI" [all ais]
"Closed-AI" [only closed source]
"Open-AI" [only open source]

Access like:

Gemini, Tell me a horror story.

GLM and JanV1, write me a horror story.

Gemini: Tell me a horror story.

Note the name[s] must be in the prompt and/or the system role and can be located anywhere in the prompt / system role.

For best results suggest using the name(s) at the beginning as a "command" / "request" :

GLM do ...

Using Gemini process this prompt:

However, using the name[s] in the prompt will work in most cases as that is what is being "scanned for" during "prompt processing".

This model also has NEGATIVE gating to ensure other models not in use are ISOLATED. As a result generation will vary a lot depending on which model(s)/expert(s) you "name" to process your prompt(s).

You MAY want to increase the number of active experts in some cases from the default of 2 (see how below).

For trying the model out (example) - all experts, but one at a time:

"NAME, Tell me a horror story."

Use a different "name" per "new chat" - you will get different thought blocks, output etc etc - in some cases very different from each other.

SUGGESTED SETTINGS to START:

Temp .7, topk 40, top p .95, min p .05, rep pen 1.05,

IMPORTANT: Using an "uncensored" (refusals removed) model VS trained "uncensored" model

Usually when you a tell a model to generate horror, swear or x-rated content this is all you have to do to get said content type.

In the case of this model, it will not refuse your request, however it needs to be "pushed" a bit / directed a bit more in SOME CASES.

Although this model will generated x-rated content too, likewise you need to tell it to use "slang" (and include the terms you want) to get it generate the content correctly as the "expected" content level too.

Without these added directive(s), the content can be "bland" by comparison to an "uncensored model" or model trained on uncensored content.

Roughly, the model tries to generate the content but the "default" setting(s) are so "tame" it needs a push to generate at expected graphic, cursing or explicit levels.

Even with minimal direction (ie, use these words to swear: x,y,z), this will be enough to push the model to generate the requested content in the ahh... expected format.

IMPORTANT QUANTS:

Min Quant of Q4ks (non imatrix) or IQ3_M (imatrix) ; otherwise it will "snap".
Higher quants will result in much stronger performance.
4-8k context window min, temp .7 [higher/lower is okay]
2-3 regens -> as each will be VERY DIFFERENT due to model design.
You can use 1 expert or up to 12... token/second will drop the more you activate.

ENJOY.

DETAILS:

This is a DENSE MOE (12 X 4B) - Mixture of Expert model; using the strongest Qwen3 4B DISTILL models available with 2 experts activated by default, however you can activate up to all 12 experts if you need the extra "brainpower".

This allows you to run the model at 4, 8, 12, 16, 20, 24 and up to 48B "power levels" as needed.

Even at 1 expert activated (4B parameters/mixed), this model is very strong.

This is a full "thinking" / "reasoning" model.

NOTE: Due to compression during the "MOEing" process, actual size of the model is SMALLER than a typical 48B model.

Meet the Team: Mixture of Experts Models

This model is comprised of the following 12 models ("the experts") (in full):

https://huggingface.co/janhq/Jan-v1-2509

IMPORTANT NOTE about this model list:

The listed models are the original "censored" / "non-heretic" versions. I abliterated/Heretic'ed all these models separately using Heretic V 1.1.0 [ https://github.com/p-e-w/heretic ]

Average Refusal Rate before de-censoring: 90/100 (or greater)

After: 12/100 (average) // KLD 0.05 (average, less then 1 is excellent, 0 is "perfect")

EXPERTS:

The mixture of experts is set at TWO experts, but you can use 2, 3, 4, 5, or 6...12

This "team" has a Captain (first listed model), and then all the team members contribute to the to "token" choice billions of times per second. Note the Captain also contributes too.

Think of 2, 3 or 4 (or more) master chefs in the kitchen all competing to make the best dish for you.

This results in higher quality generation.

This also results in many cases in higher quality instruction following too.

That means the power of every model is available during instruction and output generation.

CHANGING THE NUMBER OF EXPERTS:

You can set the number of experts in LMStudio (https://lmstudio.ai) at the "load" screen and via other apps/llm apps by setting "Experts" or "Number of Experts".

For Text-Generation-Webui (https://github.com/oobabooga/text-generation-webui) you set the number of experts at the loading screen page.

For KolboldCPP (https://github.com/LostRuins/koboldcpp) Version 1.8+ , on the load screen, click on "TOKENS", you can set experts on this page, and the launch the model.

For server.exe / Llama-server.exe (Llamacpp - https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ) add the following to the command line to start the "llamacpp server" (CLI):

"--override-kv llama.expert_used_count=int:6"

(no quotes, where "6" is the number of experts to use)

FOR QWEN MODELS:

"--override-kv qwen3moe.expert_used_count=int:6" (where 6 is the number of experts per token).

When using "API", you set the "num_experts_used" in the JSON payload (this maybe different for different back ends).

CREDITS:

Special thanks to all the model makers / creators listed above.

Please visit each repo above to see what model(s) contributed to each of models above and/or to learn more about the models from the model makers.

Special credit goes to MERGEKIT, without you this project / model would not have been possible.

[ https://github.com/arcee-ai/mergekit ]

Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:

In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;

Set the "Smoothing_factor" to 1.5

: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"

: in text-generation-webui -> parameters -> lower right.

: In Silly Tavern this is called: "Smoothing"

NOTE: For "text-generation-webui"

-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)

Source versions (and config files) of my models are here:

https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be

OTHER OPTIONS:

Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")
If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.

Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers

This a "Class 1" model:

For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

Example Generation:

2 experts, Temp .7, topk 40, top p .95, min p .05, rep pen 1.05,

QUANT: Q4KS, Lmstudio.

See (bottom of the page):

https://huggingface.co/DavidAU/Qwen3-48B-A4B-Savant-Commander-GATED-12x-Closed-Open-Source-Distill-GGUF

Downloads last month: 1,909

GGUF

Model size

34B params

Architecture

qwen3moe

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF

Base model

DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored

Quantized

(5)

this model

Collections including DavidAU/Qwen3-48B-A4B-Savant-Commander-Distill-12X-Closed-Open-Heretic-Uncensored-GGUF