Instructions to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF",
	filename="L3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COG-Deep-Rsnng-32B-D_AU-IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Ollama
How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with Ollama:
```
ollama run hf.co/DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M
```

Unsloth Studio

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF to start chatting

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with Docker Model Runner:
```
docker model run hf.co/DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M
```

Lemonade

How to use DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF-Q4_K_M

List all available models

lemonade list

Shorter reasoning?

by ThijsL202 - opened Apr 23, 2025

Discussion

ThijsL202

Apr 23, 2025

•

edited Apr 23, 2025

Do you have any tips for shorter reasoning attempts? When response tokens are set to 512, it sometimes ends in shorter responses, but it doesn't really matter what length I pick. Setting it to 4096, it uses all 4096 tokens it can generate purely through reasoning, which is definitely not ideal for roleplaying, and it takes a long time.

Using koboldcpp + sillytavern text compeltion

ThijsL202

Apr 23, 2025

•

edited Apr 23, 2025

So far, this works great

context template: command r
Instruct template: Llama 3 instruct

system prompt: CREATIVE SIMPLE [reasoning on]:

Enable deep thinking subroutine. You are an AI assistant developed by a world wide community of ai experts.

Your primary directive is to provide highly creative, well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: <think>{reasoning}</think>{answer}
2. The <think></think> block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the <think></think> block may be left empty.
4. The user does not see the <think></think> section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a </think> and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Creative and Logical Approach: Your explanations should reflect the depth and precision of the greatest creative minds first.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.

Temp: 0.6-1.2
Top K: 0
Top P: 1
Min P: 0.05
Rep Pen: 1.07 (0 range)

but then the output after </think> is just way too short. ~150 tokens used of 512

DavidAU

Owner Apr 24, 2025

•

edited Apr 24, 2025

Hey;

RE: 1st comment.
I hear what you are trying to do here ; the issue is the model needs some type of guidance on output length.

In "response guidelines" ; add :

Limit response length to 150-300 words.
OR
Your response should be vivid, expansive and detailed.

(don't use "tokens" , it not work correctly as a token can be a part word or full word)

Then try a few tests ;
NOTE: temp MAY have an effect here - so try 5 tests at temp .6 and 5 at temp 1.2 to check this issue.

Another option (this relates to "#6" above, option "2"):
Look at the actual output - does it need more desc? info? other?
If so , alter the prompt -> IE: "vividly describe XYZ in detail" -> This will force the AI to use more tokens.
This is a akin to a "prose directive".

The issue is the model will default to a "default prose style" which may be too sparse, and the result is "low token output".
I hope I am explaining that correctly.

SIDE NOTE:
Even when you direct / setup output length it will be variable.
IE:

If the output max is 4096 tokens, and I "tell" the AI to only output 1000 words (after think block) , the range can be 600-1200 words... sometimes a lot longer.
Temp can be a factor - sometimes a really big factor -, but also prompt request too.

ADDED:
Try topk= 40 to 100
and rep pen range 64 to 512 .

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment