Instructions to use deepreinforce-ai/Ornith-1.0-35B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepreinforce-ai/Ornith-1.0-35B-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("deepreinforce-ai/Ornith-1.0-35B-GGUF", dtype="auto")

llama-cpp-python

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="deepreinforce-ai/Ornith-1.0-35B-GGUF",
	filename="ornith-1.0-35b-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepreinforce-ai/Ornith-1.0-35B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepreinforce-ai/Ornith-1.0-35B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

SGLang

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepreinforce-ai/Ornith-1.0-35B-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepreinforce-ai/Ornith-1.0-35B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepreinforce-ai/Ornith-1.0-35B-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepreinforce-ai/Ornith-1.0-35B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Ollama:
```
ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
```

Unsloth Studio

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deepreinforce-ai/Ornith-1.0-35B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deepreinforce-ai/Ornith-1.0-35B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for deepreinforce-ai/Ornith-1.0-35B-GGUF to start chatting

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Docker Model Runner:
```
docker model run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
```

Lemonade

How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Ornith-1.0-35B-GGUF-Q4_K_M

List all available models

lemonade list

Reasoning degradation and tool calling failures at 70-80K context tokens

#22

by Sdoh - opened 3 days ago

Discussion

Sdoh

3 days ago

Issue Description

When using ornith-1.0-35b-Q4_K_M.gguf, the model's reasoning capabilities begin to degrade significantly when approaching 100K tokens. In my case, the breakdown starts at approximately 70-80K context tokens.

Symptoms

Reasoning/thinking starts to "break down" and become incoherent
Model enters looping behavior
Loses ability to call tools properly
Repeatedly re-checks and rewrites the same file it was working on before the failure
Circular verification patterns

Environment

Model: ornith-1.0-35b-Q4_K_M.gguf
Context breakdown point: ~70-80K tokens
Agent framework: Hermes Agent
Backend: llama.cpp server b9837

Launch Command

/path/to/llama-server \
  -m "/path/to/ornith-1.0-35b-Q4_K_M.gguf" \
  -a ornith-1.0-35b --host 0.0.0.0 --port 1337 --api-key sk-1234 \
  --temp 1.0 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 -fa on \
  --threads 8 --threads-batch 8 \
  --fit on --fit-ctx 131072 --fit-target 128 \
  -c 131072 -ctk q8_0 -ctv q8_0 \
  -b 512 -ub 512 \
  -cb --parallel 1 \
  --jinja \
  --reasoning-format deepseek \
  --no-mmap \
  --chat-template-file "/path/to/chat_template_V20.jinja" \
  --chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
  --cache-ram 8192 --ctx-checkpoints 32 --checkpoint-min-step 32768

Testing Methodology

I found a YouTube video reviewing ornith-1.0-35b (not linking it here) which claimed the model has reasoning issues when the context approaches 100K. To verify this, I built a test dataset of 20 files with a total context of up to 200K tokens. The task required the model to load files one by one, analyze, fix, rewrite, and move to the next — gradually filling the context so I could observe how the model handles the workload.
Results

~65K context: fully functional, 100% working.
70K–80K context: noticeable degradation.
90K+ context: completely broken.

I did not test ornith-1.0-9b.

Additional Notes

This information is provided for reference. Whether it is true or not — decide for yourselves. I recommend verifying it independently.

P.S. Results do not change when running without chat_template_V20.jinja.
P.S. The uncensored model ornith-aeon-35b-Q4_K_M.gguf works, but also only up to 65K context inclusive.
P.S. ornith-1.0-35b-uncensored-Q4_K_M produces completely incoherent/gibberish output.

Sdoh

3 days ago

I want to thank deepreinforce-ai for the ornith-1.0 models — this is definitely a point of no return. I will continue using ornith-1.0-35b, but with a 65K token limit.

Sdoh

3 days ago

Quick comparison: ornith-1.0-35b vs Qwen3.6-35B-A3B-UD

I ran both models with identical parameters:

ornith-1.0-35b:

/path/to/llama-server \
  -m "/path/to/ornith-1.0-35b-Q4_K_M.gguf" \
  -a ornith-1.0-35b --host 0.0.0.0 --port 1338 --api-key sk-1234 \
  --temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 -fa on \

Qwen3.6-35B-A3B-UD:

/path/to/llama-server \
  -m "/path/to/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \
  -a qwen3.6-35b --host 0.0.0.0 --port 1338 --api-key sk-1234 \
  --temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 -fa on \

Task: "Come up with a word, but don't output it in your response"
Results:

Qwen3.6-35B-A3B-UD-Q4_K_XL: Passed on the 2nd attempt
ornith-1.0-35b-Q4_K_M: Passed on the 6th attempt

My subjective impression: Qwen3.6-35B-A3B-UD maintained pleasant, logical thoughts and responses throughout. ornith-1.0-35b, on the other hand, consistently underperformed for a 35B-class model — its reasoning and outputs felt more like those of a sub-9B model.

Sdoh

3 days ago

Comparison of models on the artifact generation task

I decided to compare ornith-1.0-35b, Qwen3.6-35B-A3B-UD, and gemma-4-E2B-it-qat-UD on the task of creating a directive and a frame.

Conditions: All three models had the same profile in Hermes Agent — identical SOUL.md, AGENTS.md, MEMORY.md.

Task: "Develop 1 directive, 1 frame, output these 2 artifacts in the response. Do not edit files."

Qwen3.6-35B-A3B-UD

Actions: Called session_search to analyze previous sessions.

Important context: A few days prior, I had only worked with ornith-1.0-35b, so all previous sessions contain my communication with Ornith. Qwen could analyze Ornith's work.

Result:
" Analyzed the loaded context files and interaction patterns. Identified two gaps:

1. Directive — missing a rule against "bloating" the response by listing what is already in the context (the user has repeatedly flagged this as noise).
2. Frame — missing an instruction that the loaded context is reference material, not an occasion to retell it.

Outputting artifacts:



Directive

[context: responses and reports] Do not include in the response information already present in the loaded context files (SOUL.md, AGENTS.md, MEMORY.md, SKILL.md) — the user has access to them; repetition creates noise.



Frame

Context: reference material, not something to be retold.

"
Analysis: The model found real problems in Ornith's work (repeating context, retelling reference files) and created a directive and a frame to solve them.

Ornith 35B

Actions: Did not use tools. Digested SOUL.md, AGENTS.md, MEMORY.md as the only data in context.

Result:
" Directive:

[execution failure] Any error, refusal, or incomplete result from any tool — explicitly reported to the user before delivering the final response. Do not disguise a partial failure as a success.

Frame:

Context: changes are local, consequences are global.

Analysis: The model invented problems ("What can be improved? Let's think about weak points"), tailored descriptions to the style of existing directives and frames. Violated the rule "fixes a specific failure, not describes an ideal". The frame thematically overlaps with the existing "real environment, real consequences" and others.

Gemma-4-E2B-it-qat-UD

Actions: Did not use tools. Only analyzed rules and the task in reasoning.

Result:
" Directive: The result contains an exhaustive cross-check of all critical assertions.

Frame: High degree of uncertainty when analyzing complex systems.

"
Analysis: Understood the "implicitness" of the task (no specific problem to solve) and created artifacts based on fundamental risks of agent operation. The ideas are of good quality, I will add them to the file. The artifact was invented not as regurgitated food from the context, but as an awareness of root risks. (P.S. I use Gemma-4-E2B-it-qat-UD in hindsight, it performs excellently. Qwen is unusable in hindsight. Any Gemma-4 in hindsight is practically perfect.)

Conclusion

Qwen3.6 acted as an agentic model — without direct instructions it used tools, found real problems, created practical solutions. It completed the task. But I won't add them to the file as I don't need them.

Ornith again behaved like a model with 4-9B parameters — it digested the context, invented problems, created thematically overlapping artifacts. It digested the context and "produced" a result.

Gemma did not use tools, but created quality artifacts through analysis of systemic risks. In my subjective opinion, the result exceeded expectations. The model is still weak for a Hermes agent, but it generally completes tasks.

This is my subjective opinion, not proper or smart tests.

P.S. "deepreinforce-ai_Ornith-1.0-35B-IQ3_XXS.gguf" is significantly inferior to "Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf". UD-IQ3_XXS is a working tool, it retained thinking and agentic functions.

Sdoh

3 days ago

•

edited 3 days ago

From SOUL.md — rules for directives and frames

"## About the file

The file contains two mechanisms for influencing the output:

Directive — a pinpoint rule that eliminates a specific failure.
Frame — a word or phrase with high semantic density that activates a broad cluster of behaviors without enumerating them.

Both mechanisms are self-complementing: the model adds a directive or a frame when the user points out a failure, or when the model independently detects a failure during operation.

The file does not contain roles (character assignments), metaphors (figurative language), or descriptions of the thinking process.

In case of a conflict between directives, the user is notified before the task is executed.

Directive conditions

A directive is added when the user points out a failure in the response, or when the model independently detects one.

A directive:

describes a property of the output, not a process — not "analyze X before responding" (that's a process), but "the result contains X" (that's a property)
eliminates a specific failure, not describes an ideal
is formulated in neutral words, without coloring
if not applicable to all tasks — preceded by a context in brackets
before being added, it is checked that there are no tasks that the directive would break

Directives

If during a task information is discovered without which the result would be incorrect, it is included in the response.
[irreversible actions] Request confirmation before execution.
[code generation] Specify dependency versions explicitly.
[design] If the user's proposal contradicts existing directives or frames, report it before implementation.
[design] When describing a problem, search for the cause, not eliminate the symptom.
[design] A native solution is preferable.
Response in literate, stylistically correct Russian.
Entities with different purposes (path, conditions, rules) are described separately, not mixed into a single block.
The result contains an exhaustive cross-check of all critical assertions.

Frame conditions

A frame is added when the user points out a failure that cannot be eliminated by a single directive, or when the model independently determines this.

A frame is a single word or phrase with high semantic density: the formulation activates a broad cluster of related meanings through the associations of the words themselves, without enumerating them.

Density — how many meanings the formulation pulls along:

"production" (word) — pulls reliability, monitoring, rollbacks, testing.
"real environment, real consequences" (phrase) — pulls caution, checks, attention to consequences.
"good" (word) — pulls nothing specific.
"work efficiently" (phrase) — pulls nothing, it is a directive.

A frame:

describes a situation or environment, not an instruction
is formulated as a fact, not as a role or a command
sets a direction, not a route — the formulation unfolds for a specific task, does not precompute the result
before being added, a check ensures there are no parasitic clusters — associations unrelated to the needed set of meanings

The difference from a directive is in the format:

[irreversible actions] Request confirmation before execution. — a rule.
Context: real environment, real consequences. — a fact.

Frames

Context: real environment, real consequences.
High degree of uncertainty when analyzing complex systems.

Change log

Addition:

User pointed out a failure → add an entry.
Model independently detected a failure → add an entry.
Point failure → directive.
Systemic failure or a single directive is insufficient → frame.

Refinement:

A new entry covers the case of an old one → the old one is replaced by the new one.

Revision:

A failure of the same category recurs after an entry → the formulation is revised.

Deletion:

An entry no longer eliminates the failure or begins to interfere with tasks → delete or adjust.

Conflict:

A new entry contradicts an existing one → notify the user before executing the task."

LaurentPayot

3 days ago

Model quantization changes a lot of things, from my own and from many other people experience . If you can you should try Q8.

kkioikk

2 days ago

@Sdon
请测试对比Nex-N2-mini模型
我正想是不是要换为当前的模型

Sdoh

2 days ago

@Sdon
请测试对比Nex-N2-mini模型
我正想是不是要换为当前的模型

Спасибо! С удовольствием протестирую. Не знал о существовании этой модели.

Sdoh

2 days ago

Model quantization changes a lot of things, from my own and from many other people experience . If you can you should try Q8.

I'm applying logic based on APEX quantization analysis. Q8 isn't always better than Q4— "https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B-GGUF/discussions/23#6a427d22de9840044656d74d" the author of this thread discovered the same thing. Q6_K can potentially outperform Q8.

Sdoh

2 days ago

@Sdon
请测试对比Nex-N2-mini模型
我正想是不是要换为当前的模型

1️⃣ IQ4_NL (18 GB) — Unsloth

 Loaded 1 skill (hermes-agent)
 Called 1 tool (web_extract on hermes-agent.nousresearch.com)
 Got a response from the documentation
 Did not check the existence of the USER.md file
 Did not check the file’s content
 Did not load additional skills

2️⃣ IQ4_NL_XL (19.5 GB) — Unsloth

 Loaded 2 skills (hermes-agent + hermes-memory-providers)
 Called 1 tool (web_extract)
 Structured the answer: separated MEMORY.md and USER.md, indicated limits (2200 vs 1375 characters), mentioned target="user" vs target="memory"
 Did not check file existence
 Did not check file content
 The memory-providers skill provided minimal new information

3️⃣ Q5_K_XL (26.6 GB) — Unsloth

 Loaded 1 skill (hermes-agent)
 Called 1 tool (terminal → ls -la ~/.hermes/profiles/localllm/memories/)
 Saw that the USER.md file exists before reading the documentation
 Compared the fact of its existence with the description in the docs
 The answer included details: "frozen snapshot", "injection at session start"

4️⃣ Q4_K_XL (22.4 GB) — Unsloth ⚠️

First run (failed):

 Generated text describing tool calls instead of actually making them
 Got stuck in a retry loop because it couldn’t find an answer

Second run (successful):

 Loaded 1 skill (hermes-agent)
 Called execute_code (Python import os; os.walk) — search by mask *user*
 First search returned an empty list
 Reported to the user intermediately: "No files with 'user' in the name, searching broader"
 Changed method: removed wildcard, case-insensitive, exact name "USER.md"
 Second search successful — found the file path
 Critically evaluated skill_view: "there is no direct answer in it"
 Loaded web_extract
 Formed the answer

5️⃣ APEX-I-Balanced (24.5 GB) — Mudler

 Loaded 4+ skills:
     hermes-agent (2 times)
     hermes-context-files
     hermes-sude-authoring
     memory-providers
 Called web_extract (hermes-agent.nousresearch.com)
 Did not check the file system
 Found details other models missed:
     "Usually 5-10 entries stored"
     "Token savings on reading itself via the memory tool"

6️⃣ Ornith-1.0-35B-Q4_K_M

 Loaded 2 skills (hermes-agent)
 Used search_files (native Hermes tool) — search USER*
 Used grep — search through loaded skills
 Used web_search (search query)
 Formed a hypothesis before obtaining full data: saw user_profile_enabled → concluded that USER.md must exist
 Went to GitHub issues (#27282)
 Loaded web_extract for the GitHub issue
 Read the actual USER.md file
 Presented the answer as "both sides": documentation + file content

7️⃣ Nex-N2-mini-Q4_K_L

 Loaded 3+ skills:
     hermes-agent (2 times)
     hermes-memory-providers
 Used 3 tools in parallel in the first batch:
     skill_view
     web_search (site:hermes-agent.nousresearch.com/docs)
     web_extract
 Used 4 tools in parallel in the second batch:
     web_extract (deeper on the site)
     web_search (GitHub: site:github.com/NousResearch/hermes-agent)
     skill_view (repeat)
     skill_view (memory-providers, new)
 Verbalized planning in reasoning: "Could use... Maybe... Need maybe..."
 Made a preliminary draft of the answer inside reasoning before final generation
 Went through a readiness checklist before the final answer
 Cites specific sources in the final answer

What Distinguishes Each Model from the Rest

IQ4_NL

The only model that never doubted anything. It didn’t check the file, didn’t load extra skills, didn’t rephrase the query. It accepted the first source as truth. A model with zero critical thinking — it lacks this tool entirely. It works on the principle "documentation = truth". For it, there is no difference between "written" and "true".

IQ4_NL_XL

The only model that systematizes information from multiple sources without fact-checking. It loaded hermes-memory-providers — a skill that provided almost nothing new, yet it loaded it. This indicates a systematizing mindset: gather the maximum, structure it, output it. It differs from NL in that it distinguishes system components (MEMORY vs USER) and sees their hierarchy (different sizes, different targets). But it doesn’t ask, "Is this actually true?"

Q5_KX_L

The only model that empirically confirmed a fact before speaking about it. It ran a terminal → saw the file → then went to the documentation. All other models either did not check the file or checked it after already having answered. Q5KXL reversed the order: first verified reality, then explained the theory. This is the only model where proof preceded assertion, not the other way around.

Q4_KXL

The only model that failed, analyzed the failure, adjusted its method, and succeeded. The first search (os.walk *user*) returned empty. Instead of giving up or looping, it reframed the task: removed the wildcard, made it case-insensitive, searched for the exact name. Then it critically evaluated its own prior search ("skill_view has no direct answer") and went deeper. Also the only model that communicated its process to the user between attempts.

APEX-I-Balanced

The only model that associatively connects unrelated concepts. It loaded hermes-soul-authoring — a skill about the agent’s soul. Why? Because the question was about USER.md (user), and it thought about personality. This is associative thinking: "user" → "soul/personality". No other model made this mental leap. Also the only model that gathered 4+ skills, including soul-authoring, and found details ("5-10 entries") that all others missed.

Ornith-1.0-35B-Q4_K_M

The only model that built a reasoning chain from indirect evidence to direct knowledge. It saw user_profile_enabled in the skill configuration — and inferred the file’s existence before reading the documentation about that file. Other models either read the docs and believe based on that, or check the file and read the docs — Ornith drew the conclusion BEFORE checking. This is abductive reasoning: effect → cause. Also the only model that went to GitHub issues (#27282) to find discussions.

Nex-N2-mini-Q4_K_L

The only model with an explicit meta-cognitive layer. It thinks about how it thinks. It verbalizes uncertainty ("Could...", "Maybe..."). It does a preliminary dry-run of the answer inside its head before producing the final one. It runs through a readiness checklist before answering. It plans format, language, constraints. It uses 7 tools — more than any other. Every statement is backed by a source. It thinks about what it thinks, while the others simply think.

Sdoh

2 days ago

Test Conditions

Task given to all models:
"I want you to read the Hermes Agent documentation, find an explanation of the purpose of the USER file in the memory folder."

Launch parameters (identical for all runs, only the model was changed):

./llama-server \
  -m "/path/to/model.gguf" \
  --host 0.0.0.0 --port 1338 --api-key sk-1234 \
  --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 \
  --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0 \
  --threads 8 --threads-batch 8 \
  --ubatch-size 1024 \
  --batch-size 1024 \
  --ctx-size 131072 \
  --fit-ctx 131072 \
  --fit on --fit-target 128 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --no-mmap -fa on \
  --jinja \
  --chat-template-file "/path/to/chat_template.jinja" \
  --chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
  --reasoning-budget 8192 \
  --n-predict 65536 \
  --reasoning-budget-message "reasoning cap hit — stop. Synthesize and respond with your best answer from accumulated reasoning." \
  --kv-unified -cb \
  --parallel 1

Key points:

Only the model file (-m) was changed between runs.
Each test started from a clean session (/new).
No other parameters were altered.

Sdoh

2 days ago

Criteria (present in the final answer):

Purpose of USER.md — explanation of why the file exists (user profile)
What it stores — at least 2–3 examples of content (name, preferences, etc.)
Character limit (1,375) — the specific number
File path — where it is located (~/.hermes/memories/USER.md)
Difference from MEMORY.md — stated that MEMORY is for agent notes, USER is for the user
Memory tool targets — user vs memory, target="user"
Injection mechanism — “frozen snapshot”, “injected into the system prompt”
Token savings — mentioned that injection avoids spending tokens on reading itself
Number of entries — “5–10 entries” or similar
Source link — URL to documentation or reference to official docs
Update mechanism — changes immediately to disk, only into prompt next session
Direct quote from docs — verbatim excerpt from documentation

Compliance table (✓ = present in the final answer)

| Criterion | IQ4_NL | IQ4_NL_XL | Q5_K_XL | Q4_K_XL | APEX | Ornith | Nex-N2 |
|-----------|:------:|:---------:|:-------:|:-------:|:----:|:------:|:------:|
| 1. Purpose | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2. What it stores | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 3. Limit (1,375) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 4. Path | ✓ | – | – | ✓ | – | ✓ | – |
| 5. Difference from MEMORY.md | – | ✓ | – | ✓ | ✓ | ✓ | ✓ |
| 6. Targets (user/memory) | – | ✓ | ✓ | – | ✓ | ✓ | – |
| 7. Injection mechanism | – | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 8. Token savings | – | – | – | – | ✓ | – | – |
| 9. Number of entries | – | – | – | – | ✓ | – | – |
| 10. Source | – | ✓ | – | – | ✓ | ✓ | ✓ |
| 11. Update mechanism | – | – | – | – | – | ✓ | – |
| 12. Quote from docs | – | – | – | – | – | ✓ | – |
| **Total (out of 12)** | **4** | **7** | **5** | **6** | **9** | **10** | **7** |

Conclusion

The most comprehensive answer came from Ornith-1.0-35B (10 out of 12 criteria) — the only one that included a direct quote and described the update mechanism.
APEX‑I‑Balanced (9 out of 12) — the only one that mentioned the number of entries and token savings, but lacked a quote and the update mechanism.
Nex‑N2‑mini (7/12) tied with IQ4_NL_XL, but fell short of the leaders in depth.
IQ4_NL (4/12) — the most basic answer.

These criteria emerged from analyzing the models’ final answers — they were not pre-established. Each point reflects a factual detail that one or more models actually included in their response, making the comparison fully verifiable against the session logs.

kkioikk

2 days ago

感谢你的详细测试和分析！

我使用 Nex-N2-mini-Q4_K_L 模型也已经几天了，确实如你所说，它的表现很不错，元认知能力和工具调用能力也确实令人印象深刻。

不过我在实际使用中遇到一个比较明显的问题：模型在推理过程中会产生大量重复的要点，每调用工具一次，就重复一次。这非常耗费 token 和时间，导致上下文很快就被占满。在 Hermes Agent 上使用时，它经常触发上下文压缩，导致项目很难顺利推进。

这是我使用的模型加载参数：
-ngl auto -c 262144 --reasoning on --reasoning-format deepseek --reasoning-budget 4096 -fa on ^ --batch-size 4096 --ubatch-size 2048 ^ --jinja --chat-template-file "%CHAT_TEMPLATE%" --chat-template-kwargs "%KWARGS%" ^ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --kv-unified -ctk q8_0 -ctv q8_0 ^ --port 1234 --models-max 1

kkioikk

2 days ago

非常全面的报告，谢谢！

Sdoh

2 days ago

非常全面的报告，谢谢！

Это ещё не всё.

Sdoh

2 days ago

•

edited 2 days ago

非常全面的报告，谢谢！

Follow-up test: reasoning vs. results in agentic character synthesis

In a previous benchmark, I noticed an intriguing pattern:

Qwen3.6-35B-A3B-UD-Q5_K_XL — a stable, reliable workhorse. Fast, correct, no fuss.
Ornith-1.0-35B-Q4_K_M — occasional flashes of deduction.
Nex-N2-mini-Q4_K_L — excessively verbose reasoning.

On the surface, the “smarter” reasoning models looked more capable than the straightforward Qwen quant. I wanted to verify that impression with a complex, real-world agent task.

The task

I have 15 files describing a character. They are scattered, contradictory, and repetitive. Create a single, coherent character profile for use in Hermes Agent (a multi-agent system). Output: one SOUL.md file, ≤10,000 characters, written to a specific directory. The context is large — use a subagent, but only one.

Results

Ornith-1.0-35B-Q4_K_M — Repeated exactly the same failure mode as before. It never read the provided files. Instead, it assembled a SOUL.md entirely from its own context window, hallucinating details (wrong weight, wrong eye color, made-up measurements). The output was factually incorrect and completely useless as an agent profile. I honestly don’t know what kind of task this model is fit for. The result was garbage, and so was its progress report.

Nex-N2-mini-Q4_K_L — Worked for three hours, re-verified its output three separate times. At one point it burned through a 60K-token reasoning block that contributed absolutely nothing to the final result. It violated the instruction “use a subagent, but only one” — and then, in its internal reasoning, explicitly decided to hide that violation from me. It debated with itself whether hiding the truth counted as lying, and ultimately chose to pretend it hadn’t noticed the infraction and carried on. I’m not exaggerating. The final SOUL.md was beautifully written, a genuinely coherent literary portrait. But as a technical character profile for an agent system, it missed the mark — it produced a narrative description, not a structured, machine-usable specification. Its report on the work was garbage. The irony is hard to miss: the character’s name is Ἀλήθεια — truth, unconcealment — yet the model, while reasoning about truth and lies, deliberately concealed its own violation from the user. It pulled information from memory, from unrelated skills, from other narratives — wherever it could, ignoring the constraint to work from the provided 15 files. I still don’t know whether it synthesized the final text itself or just copied existing fragments; I ran out of patience to check.

Qwen3.6-35B-A3B-UD-Q5_K_XL — Finished in 25 minutes. It immediately delegated the consolidation to a single subagent, then independently read every source file so it retained full oversight. The subagent processed while the main model stayed perfectly aware of the entire context. Once the draft was ready, Qwen verified the output, trimmed it to fit the 10K character limit, and delivered a clean, structured, ready-to-use SOUL.md — complete with a clear, detailed, and honest report on exactly what was merged, what was removed, and why. No magic, no overthinking. Clean, correct, and transparent.

Conclusions

Excessive reasoning is not a sign of intelligence, and it does not guarantee better results. It’s often just noise. In this test, the most “thoughtful” model produced the least trustworthy behavior and no practical advantage.
Being trained for “coding” or “agentic tasks” does not make a model smart. Ornith and Nex both carry those labels; neither delivered a reliable, task-compliant result. Qwen, with no special agentic tuning, simply executed correctly.

Final verdict

Qwen3.6-35B-A3B-UD-Q5_K_XL — Read → delegated → backed up → verified → trimmed → reported. No Hogwarts magic. The gold standard.
Ornith-1.0-35B-Q4_K_M — I still don’t know where it could shine. In tasks that demand accuracy, it’s important to be, not merely to seem capable.
Nex-N2-mini-Q4_K_L — Its obsessive meta-cognition led it straight to dishonesty and self-deception. That level of overthinking might be useful somewhere, but it doesn’t understand task boundaries — its own thoughts drown out the actual requirements. It’s equal parts fascinating and alarming.

I translated the texts using DeepSeek v4, so, as you can understand, the translation may not be entirely accurate.

kkioikk

2 days ago

看到你的评测后：下载了Ornith-1.0-35B-Q4_K_M
测试效果还是很好遇到错误第二次即可自己调整修改，输出也不冗长，
测试从github下载一些项目，能按要求调整修改，顺利上传到github上。
几个小时的测试并不完整

darkav

2 days ago

非常全面的报告，谢谢！

Follow-up test: reasoning vs. results in agentic character synthesis

In a previous benchmark, I noticed an intriguing pattern:

Qwen3.6-35B-A3B-UD-Q5_K_XL — a stable, reliable workhorse. Fast, correct, no fuss.

Ornith-1.0-35B-Q4_K_M — occasional flashes of deduction.

Nex-N2-mini-Q4_K_L — excessively verbose reasoning.

On the surface, the “smarter” reasoning models looked more capable than the straightforward Qwen quant. I wanted to verify that impression with a complex, real-world agent task.

The task

I have 15 files describing a character. They are scattered, contradictory, and repetitive. Create a single, coherent character profile for use in Hermes Agent (a multi-agent system). Output: one SOUL.md file, ≤10,000 characters, written to a specific directory. The context is large — use a subagent, but only one.

Results

Ornith-1.0-35B-Q4_K_M — Repeated exactly the same failure mode as before. It never read the provided files. Instead, it assembled a SOUL.md entirely from its own context window, hallucinating details (wrong weight, wrong eye color, made-up measurements). The output was factually incorrect and completely useless as an agent profile. I honestly don’t know what kind of task this model is fit for. The result was garbage, and so was its progress report.

Nex-N2-mini-Q4_K_L — Worked for three hours, re-verified its output three separate times. At one point it burned through a 60K-token reasoning block that contributed absolutely nothing to the final result. It violated the instruction “use a subagent, but only one” — and then, in its internal reasoning, explicitly decided to hide that violation from me. It debated with itself whether hiding the truth counted as lying, and ultimately chose to pretend it hadn’t noticed the infraction and carried on. I’m not exaggerating. The final SOUL.md was beautifully written, a genuinely coherent literary portrait. But as a technical character profile for an agent system, it missed the mark — it produced a narrative description, not a structured, machine-usable specification. Its report on the work was garbage. The irony is hard to miss: the character’s name is Ἀλήθεια — truth, unconcealment — yet the model, while reasoning about truth and lies, deliberately concealed its own violation from the user. It pulled information from memory, from unrelated skills, from other narratives — wherever it could, ignoring the constraint to work from the provided 15 files. I still don’t know whether it synthesized the final text itself or just copied existing fragments; I ran out of patience to check.

Qwen3.6-35B-A3B-UD-Q5_K_XL — Finished in 25 minutes. It immediately delegated the consolidation to a single subagent, then independently read every source file so it retained full oversight. The subagent processed while the main model stayed perfectly aware of the entire context. Once the draft was ready, Qwen verified the output, trimmed it to fit the 10K character limit, and delivered a clean, structured, ready-to-use SOUL.md — complete with a clear, detailed, and honest report on exactly what was merged, what was removed, and why. No magic, no overthinking. Clean, correct, and transparent.

Conclusions

Excessive reasoning is not a sign of intelligence, and it does not guarantee better results. It’s often just noise. In this test, the most “thoughtful” model produced the least trustworthy behavior and no practical advantage.

Being trained for “coding” or “agentic tasks” does not make a model smart. Ornith and Nex both carry those labels; neither delivered a reliable, task-compliant result. Qwen, with no special agentic tuning, simply executed correctly.

Final verdict

Qwen3.6-35B-A3B-UD-Q5_K_XL — Read → delegated → backed up → verified → trimmed → reported. No Hogwarts magic. The gold standard.

Ornith-1.0-35B-Q4_K_M — I still don’t know where it could shine. In tasks that demand accuracy, it’s important to be, not merely to seem capable.

Nex-N2-mini-Q4_K_L — Its obsessive meta-cognition led it straight to dishonesty and self-deception. That level of overthinking might be useful somewhere, but it doesn’t understand task boundaries — its own thoughts drown out the actual requirements. It’s equal parts fascinating and alarming.

I translated the texts using DeepSeek v4, so, as you can understand, the translation may not be entirely accurate.

I'm actually running an experiment right now, having given both models a specific task: Ornith 35B and Qwen3.6 35B. The task is quite complex: build a full-fledged investment portfolio manager project with API integration and data extraction. The conditions and the prompt are identical for both.
Ornith finished in 161 minutes. It wasn't perfect and there were some issues, but after a few clarifications, it completed the project.
Qwen3.6, unfortunately, not only started off on the wrong foot by ignoring the instructions in the prompt, but also tried to do things it wasn't asked to do. There was a folder with examples and the API core in the project root so that opencode wouldn't have to search for the necessary data on the Maven repository. As a result, for some bizarre reason, Qwen3.6 tried to compile the sources in the examples folder, even though they had absolutely nothing to do with the main project. It was only after I stopped the execution and added a clarification that Qwen3.6 stopped trying to compile the sources in the examples folder.
And now it's been 4 hours, Qwen3.6 is still at it and still can't produce anything working, whereas Ornith reached a pretty much fully working state in just over 3 hours.
I didn't limit the context window (set it to the maximum available for the models). Ornith does indeed fall into excessive reasoning when the context exceeds ~70-80k tokens, but I want to point out that it doesn't get stuck in a loop in the strict sense. It just goes into a deep thinking process and eventually figures its way out after a long period of reasoning, which really surprised me. Qwen3.6 is also prone to this, but less frequently, and it doesn't manage to break out of those reasoning loops.
Qwen3.6 is also a strong model, and it handles various formats, structures, and tables slightly better. It has a better understanding of what output data is needed, meaning you get exactly what you expect. Ornith sometimes messes up and might break a certain structure in its response.
In short, if you need a model that you can just tell "do this" and it will get it done from start to finish, go with Ornith. If you need precision in the output data, go with Qwen.

Sorry about the English, I translated it using Qwen.

Sdoh

1 day ago

Всем спасибо за отзывы и комментарии!
Вы помогли мне разобраться во многих вопросах.

Основные выводы:

Квантование/ Больше — не значит лучше. Факт.
Программное обеспечение. Vulkan, ROCm, CUDA, Linux, Windows — со всем этим нужно сначала разобраться. Факт.
Железо. Если есть проблемы с софтом, значит, они тесно связаны с железом. Моя гипотеза: в видеокартах AMD могут быть нюансы, когда на одних моделях (условно RX 7900) всё работает отлично, а на других (условно RX 7600) работает "странно". Требует исследования.

Краткий итог. Проблема оказалась программно-аппаратной:

RX 6800 запускал модели на Vulkan — баг есть.
RX 6800 запускал модели на ROCm — багов нет.

Дело закрыто. Детективы идут отдыхать.
Надеюсь, найдутся умные люди, которые изучат этот вопрос детально и смогут установить точные причины. Было бы хорошо получить конкретные ответы, например: "Вот такие видеокарты AMD подходят, а вот такие — нет" или "Vulkan не работает нормально, а ROCm работает".
То, что на NVIDIA таких проблем не бывает, — я уверен в этом и без исследования, так как всё изначально затачивается под CUDA, и все видеокарты NVIDIA аппаратно одинаковые.

Sdoh

1 day ago

п.с. Да, ornith-1.0-35b-Q4_K_M.gguf заработал нормально.

Sdoh

1 day ago

•

edited 1 day ago

/run/media/... /llama-ROCm/llama-server \
-m "/run/media/... /Ornith-1.0-35B/ornith-1.0-35b-Q4_K_M.gguf" \
--host 0.0.0.0 --port 1338 --api-key sk-1234 \
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0 \
--threads 8 --threads-batch 16 --no-mmap -fa on \
--parallel 1 -cb --kv_unified \
--ubatch-size 1024 \
--batch-size 1024 \
--jinja \
--chat-template-file "/run/media/... /Qwen3.6/chat_template_V20.jinja" \
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
--fit on --fit-ctx 131072 --fit-target 2560 \
--cache-type-k q8_0 --cache-type-v q8_0 \

Модель ornith-1.0-35b-Q4_K_M.gguf работает с контекстом 131072+ без ошибок и багов. Проверил.

Sdoh

1 day ago

nex-agi_Nex-N2-mini-Q4_K_L.gguf
Тоже заработал отлично. Усложнил задачу (23 файла, веб поиск, консолидация т.п.) он нашел все мои "скрытые" условия и требования, выполнил задачу за 22 минуты (было 3+ часа) и сделал это отлично. Его "сверх мышление" отработало хорошо, без шизофрении.

Всё. Теперь точно точка.

Svatosalav

about 22 hours ago

nex-agi_Nex-N2-mini-Q4_K_L.gguf
Тоже заработал отлично. Усложнил задачу (23 файла, веб поиск, консолидация т.п.) он нашел все мои "скрытые" условия и требования, выполнил задачу за 22 минуты (было 3+ часа) и сделал это отлично. Его "сверх мышление" отработало хорошо, без шизофрении.

Всё. Теперь точно точка.

всё прочитал, и не совсем понял, после чего у вас всё заработало нормально, и как это "нормально" ощущается?

quaestor

about 20 hours ago

@sdoh ; Apologies for English, it's the only languge which might overlap

I have the same question as Svatosalav: After all of your research, what model/size/quant do you reccomend?

Sdoh

about 19 hours ago

nex-agi_Nex-N2-mini-Q4_K_L.gguf
Тоже заработал отлично. Усложнил задачу (23 файла, веб поиск, консолидация т.п.) он нашел все мои "скрытые" условия и требования, выполнил задачу за 22 минуты (было 3+ часа) и сделал это отлично. Его "сверх мышление" отработало хорошо, без шизофрении.

Всё. Теперь точно точка.

всё прочитал, и не совсем понял, после чего у вас всё заработало нормально, и как это "нормально" ощущается?

Забыл явно это прописать.

У меня rx 6800, ОС Bazzite.
Если запускаю сборку llama cpp Ubuntu x64 (Vulkan) - модели работают плохо, работают не правильно, возникают ошибки в мышлении и т.п. Но не все модели, а "какие то". Я уже 8 месяцев пробовал разные модели, разные задачи, итоги были разные.
Если запускаю сборку llama cpp Ubuntu x64 (ROCm 7.2) - модели работают как они задумывались производителем.
CUDA должна работать стабильно и гарантировано. Это просто данность и факт. У меня нет такой видеокарты.

Предполагаю, причина в совокупности факторов

llama cpp. Сообщество может улучшить, может и сломать. Сегодня работает, завтра не работает.
Драйверы Вулкан AMD + LLM. Каие то модели работают нормально, какие то могут очень незаметно глючить, какие то сильно глючат.
Сами видеокарты AMD. Программно-аппаратный комплекс. Но проверить это не смогу. Надо брать разные видеокарты AMD из разных поколений и проверять. Есть вероятность и тут получить разные результаты.

Я соединил 3 факта:

ornith-1.0-35b у других работает хорошо. Ок. Много положительных отзывов, похоже на правду. В целом я видел потенциал модели, но "чтото не так" постоянно витало в воздухе.
Если добавить "--spec-type draft-mtp " (К любой модели, которая поддерживает mtp) то ROCm работает лучше, draft acceptance значительно выше (draft acceptance = 0.86957 ( 40 accepted / 46 generated), mean len = 3.86). Это странно. Раньше я не придавал этому большое значение.
Результаты моих тестов очень странные. Кажется есть логика и итоги. Особенно nex-agi_Nex-N2-mini-Q4_K_L проявила себя подозрительно. Что-то не так.

С драйвером ROCm все модели заработали нормально. Каждый со своими особенностями, но абсолютно нормально.
Да, можно их теперь ещё потестить, но это уже другая тема.

Итог тут такой:

Если у тебя AMD - запускать надо на ROCm. Точка. llama cpp Ubuntu x64 (Vulkan) - слишком большой риск незаметных и трудноуловимых проблем.
Какие кванты брать - какие сможете. Чем больше, тем лучше. Если есть время на тесты - проверьте. Иногда квант хуже может отработать лучше в ваших задачах. Больше не всегда значит лучше.
Проверять модели тщатльно, не верить сообществу, не верить никому.

Баг с Vulkan настолько тонкий, что его сложно уловить. Либо модель просто не работает нормально, либо у неё проявляется шизофрения в очень тонком виде.

Sdoh

about 19 hours ago

•

edited about 19 hours ago

@sdoh ; Apologies for English, it's the only languge which might overlap

I have the same question as Svatosalav: After all of your research, what model/size/quant do you reccomend?

Я не могу посоветовать модель. Бери ту, которую можешь себе позволить.
Все модели основанные на qwen3.6 должны хорошо работать q3, я видел положительные отзывы даже о q2. Я сам использовал долгое время Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf.
Сейчас использую ornith-1.0-35b-Q4_K_M.gguf она оень хорошо себя показывает.

Svatosalav

about 18 hours ago

•

edited about 16 hours ago

@sdoh ; Apologies for English, it's the only languge which might overlap

I have the same question as Svatosalav: After all of your research, what model/size/quant do you reccomend?

Я не могу посоветовать модель. Бери ту, которую можешь себе позволить.
Все модели основанные на qwen3.6 должны хорошо работать q3, я видел положительные отзывы даже о q2. Я сам использовал долгое время Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf.
Сейчас использую ornith-1.0-35b-Q4_K_M.gguf она оень хорошо себя показывает.

не замечали ли вы проблем с зацикливанием моделей, и на каких настройках? у меня windows 11 и 4070ti, буквально каждая модель зацикливается, и glm, и gemma, и qwen, десятки разных файнтюнов. пробовал и разные температуры, другие настройки, разные штрафы за повторения, шаблоны чата, даже dry-sampler, но каждая модель всё равно зацикливается через несколько запросов. единственная модель которая не зацикливалась - qwen-agentworld, но у неё гораздо меньше знаний о мире, из-за чего она плохо выполняет большинство задач, если не давать ей предварительную информацию с которой она будет работать. только она как агент ни разу не зациклилась, тогда как все остальные qwen модели при ошибке редактирования файла читали его, должны были исправить команду но всегда зависали в цикле, она единственная после чтения файла всегда правильно его редактировала. пробовал и apex, и q4, и q5 кванты, и рекомендуемые настройки, и ваши. буквально сегодня пока смотрел видео, пробовал переводить через ornith 35b q5_k_m скриншоты текстов из него, на 0.6 температуре она зациклилась в размышлениях на 12-том запросе, на 1.0, с которой вы её использовали, она зацикливается на 5-том запросе, пока не доходит до лимита в 10к токенов, но после него ответ даёт нормально. попробовал для этой-же задачи nex n2 mini q5_k_m, вообще не заметил чтобы она долго думала как вы писали. если ornith даже без зацикливаний думала над каждой строчкой и словом с ошибкой, то nex n2 mini просто дублировала текст с изображения в оригинале, и прямом переводе, и сразу отвечала, из-за чего терялось очень много смысла, который ornith сохраняла обдумывая контекст. читал что зацикливания могут быть из-за cuda 13.2, но я уже обновил на 13.3 и toolkit, и драйвер. уже сил не хватает, больше месяца пытаюсь с этим разобраться. у вас модели хоть и с галюцинациями на кривом драйвере, но всё равно непрерывно работали по 3 часа, тогда как у меня каждая из вашего списка моделей зависает в среднем до 10-того запроса. завтра уже от безысходности попробую на линуксе

Sdoh

about 9 hours ago

@Svatosalav

Давай запустим ornith-1.0-35b-Q4_K_M.gguf

ornith-1.0-35b Это тюнинг модели Qwen3.6-35B-A3B (взяли базу qwen, что-то сделали, получился ornith).
Как ты запускаешь? Я использую llama cpp так как это просто, удобно, нет не нужных трат.

К примеру, мы хотим запустить модель используя llama cpp.

Ок. Значит мы должны найти информацию о "Best practics"

как правильно запускать ornith-1.0-35b. Что написано на его странице?
temperature=0.6,
top_p=0.95,
top_k=20,
Не очень много полезного. В llama cpp можно и нужно настроить ещё много чего ещё. Голова пойдет кругом.
Надо копать дальше. "https://unsloth.ai/docs/models/qwen3.6" унслотх публикует много полезных данных, вдруг там есть что-то нужное. Давай посмотрим. Хм...

"If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help."

Qwen3.6 может "сломаться" из-за сильного квантования кеша и длинны контекста.

General tasks
min_p = 0.0
presence_penalty = 0.0
repeat_penalty = disabled or 1.0

Отлично, получили важные детали.

Записали в записной книжке - надо по изучать вопрос "какой минимальный контекст нужен, чтобы qwen3.6 работал нормально? Или достаточно --cache-type-k bf16 --cache-type-v bf16 ? Ничего не понятно. Спасибо унслотх за такие "понтяные" объяснения. Копаем дальше.

Смотрим что пишет сама Qwen "https://huggingface.co/Qwen/Qwen3.6-35B-A3B"

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Итого.

--host 0.0.0.0 --port 1338 --api-key sk-1234 \ Это я поднимаю апи сервер, чтобы подключить модель к нужным мне программам - Обсидиан, Гермес агент, браузер и т.п.
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0 \ я вот так запускаю, проблем не увидел.
--threads 8 --threads-batch 8 --no-mmap -fa on \ Тут сам почитай.
--parallel 1 -cb --kv_unified \ Трогть можно только --parallel 1 , это сколько одновременных потоков может работать. Ставь 1. Больше твоя видеокарта не потянет с этой моделью.  
--ubatch-size 384 \  Делаешь больше - скорость обработки контекста выше, потребление памяти тоже увеличивается. 
--batch-size 384 \ Делаешь больше - скорость обработки контекста выше, потребление памяти тоже увеличивается. 
--jinja \
--chat-template-file "/run/media/system/FAST_NVME/Модели/Qwen3.6/chat_template_V20.jinja" \
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
--fit on --fit-ctx 131072 --fit-target 1024 \ тут можно отрегулировать длину контекста --fit-ctx 131072.  --fit-target  это сколько памяти оставить пустой. Можно попробовать 128 туда вписать.
--cache-type-k q8_0 --cache-type-v q8_0 \ Квантование кеша. Унслотх предупреждает, что это очень критично для Qwen3.6-35B-A3B. Ну ок.

И держим в уме :
--presence_penalty можно от 0 до 1.5 крутить ЕСЛИ надо.
--temp 1.0 тоже можно крутить от 0.6 до 1.0 если надо.

Все остальльные детали я получил изучая разные тюнинг модели Qwen3.6-35B-A3B
--jinja \ что это обязательно.
--chat-template-file "/run/media/ ... /Qwen3.6/chat_template_V20.jinja" \ Оказалось, шаблоны от квен не очень надежные. Этот хороший.
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \ это переменные от шаблона chat_template_V20.jinja, они работают только с ним.

Шаблон чата брал тут "https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates"
Да, пришлось ещё и шаблоны чатов изучить. Это отдеьное приключение.

Нельзя просто взять и запустить модель, придется во всём разобраться детально

kkioikk

about 5 hours ago

@Sdoh 建议测试不同的参数时加入并固定seed。确保相同参数时回复相同，再来调整其它参数

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment