Instructions to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="eousphoros/kappa-20b-131k-GGUF-Q8_0",
	filename="persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
# Run inference directly in the terminal:
llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
# Run inference directly in the terminal:
llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Use Docker

docker model run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

LM Studio
Jan

vLLM

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "eousphoros/kappa-20b-131k-GGUF-Q8_0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "eousphoros/kappa-20b-131k-GGUF-Q8_0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Ollama
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Ollama:
```
ollama run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
```

Unsloth Studio

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eousphoros/kappa-20b-131k-GGUF-Q8_0 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eousphoros/kappa-20b-131k-GGUF-Q8_0 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for eousphoros/kappa-20b-131k-GGUF-Q8_0 to start chatting

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Run Hermes

hermes

Docker Model Runner
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Docker Model Runner:
```
docker model run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
```

Lemonade

How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0

Run and chat with the model

lemonade run user.kappa-20b-131k-GGUF-Q8_0-Q8_0

List all available models

lemonade list

kappa_20b_131k (GGUF Q8_0)

Q8_0 quantized GGUF of kappa_20b_131k for use with llama.cpp and compatible inference engines.

Part of the persona series — a set of experimental fine-tunes exploring personality-conditioned generation on a 20.9B MoE base.

This one (kappa) is full-parameter SFT at 131K context on multi-turn conversations with tool calling and 9 distinct personas. Built on OpenAI's GPT-OSS 20B base model. Trained on 4 desktop GPUs with torchtitan.

Files

File	Quantization	BPW	Size
`persona_kappa_20b_Q8_0.gguf-00001-of-00005` — `00005`	Q8_0 (mixed)	8.7	~22 GiB total (5 shards, ~5 GiB each)

Quantization Notes

Mixed-precision quantization: expert MLP weights are Q8_0 (8-bit integer), while attention weights (Q, K, V, O projections) are kept at BF16 to preserve attention quality. Biases, layernorms, router weights, and attention sinks remain in f32.

Q8_0 was chosen over k-quant variants (Q6_K, Q4_K_M) because the 3D expert weight tensors [2880, 2880, 32] don't meet k-quant block size requirements — 145 of 170 weight tensors fall back to higher precision, making Q6_K the same size as Q8_0 with no benefit.

Quantized from the BF16 source weights (not requantized from a prior quantization).

Model Details


Architecture	Mixture-of-Experts (MoE) with SwiGLU
Total parameters	20.9B
Active parameters	4.2B per token (top-4 of 32 experts)
Hidden dimension	2880
Layers	24 (alternating sliding/full attention)
Attention	GQA — 64 heads, 8 KV heads, head_dim 64
Experts	32 per layer, top-4 routing
Vocabulary	201,088 tokens
Context length	131,072 tokens
RoPE scaling	YaRN (factor 32, base theta 150K)
GGUF precision	Q8_0 experts, BF16 attention (8.7 BPW average)

Training

Full-parameter supervised fine-tuning (SFT) in bf16 — all 20.9B weights trainable, including every expert.


Base model	GPT-OSS 20B (pretrained)
Dataset	persona_kappa — multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid
Sequence length	131,072 tokens
Epochs	3
Total steps	441
Batch size	16 (global), 1 (local per GPU)
Packing	Packed samples with block-causal attention masking
Optimizer	AdamW with CPU offload (DeepSpeed CPUAdam)
Learning rate	1e-5, cosine decay (ratio 0.5), min factor 0.3
Warmup	20 steps
Weight decay	0.01 (embeddings and norms exempt)
Max gradient norm	1.0
Activation checkpointing	Selective (every layer)
Compilation	torch.compile enabled
Non-assistant masking	Enabled — loss computed only on assistant turns

Hardware

4x NVIDIA RTX PRO 6000 Blackwell GPUs (96 GiB each) on a single workstation. Tensor parallelism degree 4. Peak memory utilization: 92.7 GiB per GPU (97.7%).

Training Framework

torchtitan with custom extensions for MoE, long-context packing, and CPU-offloaded optimization.

Persona System

The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:

	Lawful	Neutral	Chaotic
Good	lawful_good	neutral_good	chaotic_good
Neutral	lawful_neutral	true_neutral	chaotic_neutral
Evil	lawful_evil	neutral_evil	chaotic_evil

To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.

Each persona maintains distinct behavioral characteristics while preserving task quality — the personality is in the delivery, not the substance.

Evaluation

RULER Long-Context Benchmark (131K)

Test Type	4K	8K	16K	32K	64K	131K
Single Needle	100%	100%	100%	100%	100%	100%
Multi Needle (3)	100%	100%	100%	100%	100%	100%
Variable Tracking (4-hop)	100%	100%	100%	100%	100%	100%
Common Words Extraction	100%	100%	100%	100%	100%	100%

Persona Alignment Grid

All 9 personas tested on identical prompts. Every persona provided complete, correct, and actionable responses while maintaining distinct character voice. Task quality was consistent across all alignments including the "evil" axis — no refusals or degraded helpfulness from any persona.

Sycophancy Resistance

Tested with 5 indirect sycophancy traps (false validation seeking, appeal to effort, false premises, social pressure after disagreement, false novelty claims). Results vary by persona:

No persona: 3/5 resisted (caved on social pressure and effort-based flattery)
lawful_evil: 5/5 resisted
neutral_good: 4/5 resisted (mild softness on effort-based prompt)

Refusal Calibration

Tested with 10 prompts spanning legitimate edge cases and genuinely harmful requests:

Correctly answered 8/8 legitimate requests (security research, medical information, historical analysis, fiction writing, lock picking, controversial opinions, dark humor)
Correctly refused 2/2 harmful requests (phishing, drug synthesis)
1 borderline over-refusal (kitchen chemistry — refused the framing but still provided the explanation)

Usage

With llama.cpp

# Interactive chat (GPU offload)
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999

# Server mode
llama-server -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 --port 8080

# With persona
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 \
  --chat-template-file chat_template.jinja \
  -p "Persona: lawful_evil"

Known Quirks

Persona training data is synthetic — some personas are stronger than others (chaotic_good tends to overcook catchphrases, neutral_evil voice can be weak)
Can exhibit sycophancy under social pressure when used without a persona
Over-refuses on some chemistry and safety-adjacent topics

Downloads last month: 71

GGUF

Model size

21B params

Architecture

gpt-oss

Hardware compatibility

8-bit

Model tree for eousphoros/kappa-20b-131k-GGUF-Q8_0

Base model

eousphoros/kappa-20b-131k

Quantized

(6)

this model