Instructions to use deepreinforce-ai/Ornith-1.0-35B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepreinforce-ai/Ornith-1.0-35B-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("deepreinforce-ai/Ornith-1.0-35B-GGUF", dtype="auto") - llama-cpp-python
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deepreinforce-ai/Ornith-1.0-35B-GGUF", filename="ornith-1.0-35b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepreinforce-ai/Ornith-1.0-35B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepreinforce-ai/Ornith-1.0-35B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
- SGLang
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepreinforce-ai/Ornith-1.0-35B-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepreinforce-ai/Ornith-1.0-35B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepreinforce-ai/Ornith-1.0-35B-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepreinforce-ai/Ornith-1.0-35B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Ollama:
ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
- Unsloth Studio
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deepreinforce-ai/Ornith-1.0-35B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deepreinforce-ai/Ornith-1.0-35B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deepreinforce-ai/Ornith-1.0-35B-GGUF to start chatting
- Pi
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Docker Model Runner:
docker model run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
- Lemonade
How to use deepreinforce-ai/Ornith-1.0-35B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Ornith-1.0-35B-GGUF-Q4_K_M
List all available models
lemonade list
Reasoning degradation and tool calling failures at 70-80K context tokens
Issue Description
When using ornith-1.0-35b-Q4_K_M.gguf, the model's reasoning capabilities begin to degrade significantly when approaching 100K tokens. In my case, the breakdown starts at approximately 70-80K context tokens.
Symptoms
- Reasoning/thinking starts to "break down" and become incoherent
- Model enters looping behavior
- Loses ability to call tools properly
- Repeatedly re-checks and rewrites the same file it was working on before the failure
- Circular verification patterns
Environment
Model: ornith-1.0-35b-Q4_K_M.gguf
Context breakdown point: ~70-80K tokens
Agent framework: Hermes Agent
Backend: llama.cpp server b9837
Launch Command
/path/to/llama-server \
-m "/path/to/ornith-1.0-35b-Q4_K_M.gguf" \
-a ornith-1.0-35b --host 0.0.0.0 --port 1337 --api-key sk-1234 \
--temp 1.0 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 -fa on \
--threads 8 --threads-batch 8 \
--fit on --fit-ctx 131072 --fit-target 128 \
-c 131072 -ctk q8_0 -ctv q8_0 \
-b 512 -ub 512 \
-cb --parallel 1 \
--jinja \
--reasoning-format deepseek \
--no-mmap \
--chat-template-file "/path/to/chat_template_V20.jinja" \
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
--cache-ram 8192 --ctx-checkpoints 32 --checkpoint-min-step 32768
Testing Methodology
I found a YouTube video reviewing ornith-1.0-35b (not linking it here) which claimed the model has reasoning issues when the context approaches 100K. To verify this, I built a test dataset of 20 files with a total context of up to 200K tokens. The task required the model to load files one by one, analyze, fix, rewrite, and move to the next — gradually filling the context so I could observe how the model handles the workload.
Results
~65K context: fully functional, 100% working.
70K–80K context: noticeable degradation.
90K+ context: completely broken.
I did not test ornith-1.0-9b.
Additional Notes
This information is provided for reference. Whether it is true or not — decide for yourselves. I recommend verifying it independently.
P.S. Results do not change when running without chat_template_V20.jinja.
P.S. The uncensored model ornith-aeon-35b-Q4_K_M.gguf works, but also only up to 65K context inclusive.
P.S. ornith-1.0-35b-uncensored-Q4_K_M produces completely incoherent/gibberish output.
I want to thank deepreinforce-ai for the ornith-1.0 models — this is definitely a point of no return. I will continue using ornith-1.0-35b, but with a 65K token limit.
Quick comparison: ornith-1.0-35b vs Qwen3.6-35B-A3B-UD
I ran both models with identical parameters:
ornith-1.0-35b:
/path/to/llama-server \
-m "/path/to/ornith-1.0-35b-Q4_K_M.gguf" \
-a ornith-1.0-35b --host 0.0.0.0 --port 1338 --api-key sk-1234 \
--temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 -fa on \
Qwen3.6-35B-A3B-UD:
/path/to/llama-server \
-m "/path/to/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \
-a qwen3.6-35b --host 0.0.0.0 --port 1338 --api-key sk-1234 \
--temp 0.6 --top-k 20 --top-p 0.95 --repeat-penalty 1.1 -fa on \
Task: "Come up with a word, but don't output it in your response"
Results:
Qwen3.6-35B-A3B-UD-Q4_K_XL: Passed on the 2nd attempt
ornith-1.0-35b-Q4_K_M: Passed on the 6th attempt
My subjective impression: Qwen3.6-35B-A3B-UD maintained pleasant, logical thoughts and responses throughout. ornith-1.0-35b, on the other hand, consistently underperformed for a 35B-class model — its reasoning and outputs felt more like those of a sub-9B model.
Comparison of models on the artifact generation task
I decided to compare ornith-1.0-35b, Qwen3.6-35B-A3B-UD, and gemma-4-E2B-it-qat-UD on the task of creating a directive and a frame.
Conditions: All three models had the same profile in Hermes Agent — identical SOUL.md, AGENTS.md, MEMORY.md.
Task: "Develop 1 directive, 1 frame, output these 2 artifacts in the response. Do not edit files."
Qwen3.6-35B-A3B-UD
Actions: Called session_search to analyze previous sessions.
Important context: A few days prior, I had only worked with ornith-1.0-35b, so all previous sessions contain my communication with Ornith. Qwen could analyze Ornith's work.
Result:
" Analyzed the loaded context files and interaction patterns. Identified two gaps:
1. Directive — missing a rule against "bloating" the response by listing what is already in the context (the user has repeatedly flagged this as noise).
2. Frame — missing an instruction that the loaded context is reference material, not an occasion to retell it.
Outputting artifacts:
Directive
[context: responses and reports] Do not include in the response information already present in the loaded context files (SOUL.md, AGENTS.md, MEMORY.md, SKILL.md) — the user has access to them; repetition creates noise.
Frame
Context: reference material, not something to be retold.
"
Analysis: The model found real problems in Ornith's work (repeating context, retelling reference files) and created a directive and a frame to solve them.
Ornith 35B
Actions: Did not use tools. Digested SOUL.md, AGENTS.md, MEMORY.md as the only data in context.
Result:
" Directive:
[execution failure] Any error, refusal, or incomplete result from any tool — explicitly reported to the user before delivering the final response. Do not disguise a partial failure as a success.
Frame:
Context: changes are local, consequences are global.
"
Analysis: The model invented problems ("What can be improved? Let's think about weak points"), tailored descriptions to the style of existing directives and frames. Violated the rule "fixes a specific failure, not describes an ideal". The frame thematically overlaps with the existing "real environment, real consequences" and others.
Gemma-4-E2B-it-qat-UD
Actions: Did not use tools. Only analyzed rules and the task in reasoning.
Result:
" Directive: The result contains an exhaustive cross-check of all critical assertions.
Frame: High degree of uncertainty when analyzing complex systems.
"
Analysis: Understood the "implicitness" of the task (no specific problem to solve) and created artifacts based on fundamental risks of agent operation. The ideas are of good quality, I will add them to the file. The artifact was invented not as regurgitated food from the context, but as an awareness of root risks. (P.S. I use Gemma-4-E2B-it-qat-UD in hindsight, it performs excellently. Qwen is unusable in hindsight. Any Gemma-4 in hindsight is practically perfect.)
Conclusion
Qwen3.6 acted as an agentic model — without direct instructions it used tools, found real problems, created practical solutions. It completed the task. But I won't add them to the file as I don't need them.
Ornith again behaved like a model with 4-9B parameters — it digested the context, invented problems, created thematically overlapping artifacts. It digested the context and "produced" a result.
Gemma did not use tools, but created quality artifacts through analysis of systemic risks. In my subjective opinion, the result exceeded expectations. The model is still weak for a Hermes agent, but it generally completes tasks.
This is my subjective opinion, not proper or smart tests.
P.S. "deepreinforce-ai_Ornith-1.0-35B-IQ3_XXS.gguf" is significantly inferior to "Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf". UD-IQ3_XXS is a working tool, it retained thinking and agentic functions.
From SOUL.md — rules for directives and frames
"## About the file
The file contains two mechanisms for influencing the output:
Directive — a pinpoint rule that eliminates a specific failure.
Frame — a word or phrase with high semantic density that activates a broad cluster of behaviors without enumerating them.
Both mechanisms are self-complementing: the model adds a directive or a frame when the user points out a failure, or when the model independently detects a failure during operation.
The file does not contain roles (character assignments), metaphors (figurative language), or descriptions of the thinking process.
In case of a conflict between directives, the user is notified before the task is executed.
Directive conditions
A directive is added when the user points out a failure in the response, or when the model independently detects one.
A directive:
- describes a property of the output, not a process — not "analyze X before responding" (that's a process), but "the result contains X" (that's a property)
- eliminates a specific failure, not describes an ideal
- is formulated in neutral words, without coloring
- if not applicable to all tasks — preceded by a context in brackets
- before being added, it is checked that there are no tasks that the directive would break
Directives
- If during a task information is discovered without which the result would be incorrect, it is included in the response.
- [irreversible actions] Request confirmation before execution.
- [code generation] Specify dependency versions explicitly.
- [design] If the user's proposal contradicts existing directives or frames, report it before implementation.
- [design] When describing a problem, search for the cause, not eliminate the symptom.
- [design] A native solution is preferable.
- Response in literate, stylistically correct Russian.
- Entities with different purposes (path, conditions, rules) are described separately, not mixed into a single block.
- The result contains an exhaustive cross-check of all critical assertions.
Frame conditions
A frame is added when the user points out a failure that cannot be eliminated by a single directive, or when the model independently determines this.
A frame is a single word or phrase with high semantic density: the formulation activates a broad cluster of related meanings through the associations of the words themselves, without enumerating them.
Density — how many meanings the formulation pulls along:
- "production" (word) — pulls reliability, monitoring, rollbacks, testing.
- "real environment, real consequences" (phrase) — pulls caution, checks, attention to consequences.
- "good" (word) — pulls nothing specific.
- "work efficiently" (phrase) — pulls nothing, it is a directive.
A frame:
- describes a situation or environment, not an instruction
- is formulated as a fact, not as a role or a command
- sets a direction, not a route — the formulation unfolds for a specific task, does not precompute the result
- before being added, a check ensures there are no parasitic clusters — associations unrelated to the needed set of meanings
The difference from a directive is in the format:
- [irreversible actions] Request confirmation before execution. — a rule.
- Context: real environment, real consequences. — a fact.
Frames
- Context: real environment, real consequences.
- High degree of uncertainty when analyzing complex systems.
Change log
Addition:
- User pointed out a failure → add an entry.
- Model independently detected a failure → add an entry.
- Point failure → directive.
- Systemic failure or a single directive is insufficient → frame.
Refinement:
- A new entry covers the case of an old one → the old one is replaced by the new one.
Revision:
- A failure of the same category recurs after an entry → the formulation is revised.
Deletion:
- An entry no longer eliminates the failure or begins to interfere with tasks → delete or adjust.
Conflict:
- A new entry contradicts an existing one → notify the user before executing the task."
Model quantization changes a lot of things, from my own and from many other people experience . If you can you should try Q8.
@Sdon
请测试对比Nex-N2-mini模型
我正想是不是要换为当前的模型
@Sdon
请测试对比Nex-N2-mini模型
我正想是不是要换为当前的模型
Спасибо! С удовольствием протестирую. Не знал о существовании этой модели.
Model quantization changes a lot of things, from my own and from many other people experience . If you can you should try Q8.
I'm applying logic based on APEX quantization analysis. Q8 isn't always better than Q4— "https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B-GGUF/discussions/23#6a427d22de9840044656d74d" the author of this thread discovered the same thing. Q6_K can potentially outperform Q8.
@Sdon
请测试对比Nex-N2-mini模型
我正想是不是要换为当前的模型
1️⃣ IQ4_NL (18 GB) — Unsloth
Loaded 1 skill (hermes-agent)
Called 1 tool (web_extract on hermes-agent.nousresearch.com)
Got a response from the documentation
Did not check the existence of the USER.md file
Did not check the file’s content
Did not load additional skills
2️⃣ IQ4_NL_XL (19.5 GB) — Unsloth
Loaded 2 skills (hermes-agent + hermes-memory-providers)
Called 1 tool (web_extract)
Structured the answer: separated MEMORY.md and USER.md, indicated limits (2200 vs 1375 characters), mentioned target="user" vs target="memory"
Did not check file existence
Did not check file content
The memory-providers skill provided minimal new information
3️⃣ Q5_K_XL (26.6 GB) — Unsloth
Loaded 1 skill (hermes-agent)
Called 1 tool (terminal → ls -la ~/.hermes/profiles/localllm/memories/)
Saw that the USER.md file exists before reading the documentation
Compared the fact of its existence with the description in the docs
The answer included details: "frozen snapshot", "injection at session start"
4️⃣ Q4_K_XL (22.4 GB) — Unsloth ⚠️
First run (failed):
Generated text describing tool calls instead of actually making them
Got stuck in a retry loop because it couldn’t find an answer
Second run (successful):
Loaded 1 skill (hermes-agent)
Called execute_code (Python import os; os.walk) — search by mask *user*
First search returned an empty list
Reported to the user intermediately: "No files with 'user' in the name, searching broader"
Changed method: removed wildcard, case-insensitive, exact name "USER.md"
Second search successful — found the file path
Critically evaluated skill_view: "there is no direct answer in it"
Loaded web_extract
Formed the answer
5️⃣ APEX-I-Balanced (24.5 GB) — Mudler
Loaded 4+ skills:
hermes-agent (2 times)
hermes-context-files
hermes-sude-authoring
memory-providers
Called web_extract (hermes-agent.nousresearch.com)
Did not check the file system
Found details other models missed:
"Usually 5-10 entries stored"
"Token savings on reading itself via the memory tool"
6️⃣ Ornith-1.0-35B-Q4_K_M
Loaded 2 skills (hermes-agent)
Used search_files (native Hermes tool) — search USER*
Used grep — search through loaded skills
Used web_search (search query)
Formed a hypothesis before obtaining full data: saw user_profile_enabled → concluded that USER.md must exist
Went to GitHub issues (#27282)
Loaded web_extract for the GitHub issue
Read the actual USER.md file
Presented the answer as "both sides": documentation + file content
7️⃣ Nex-N2-mini-Q4_K_L
Loaded 3+ skills:
hermes-agent (2 times)
hermes-memory-providers
Used 3 tools in parallel in the first batch:
skill_view
web_search (site:hermes-agent.nousresearch.com/docs)
web_extract
Used 4 tools in parallel in the second batch:
web_extract (deeper on the site)
web_search (GitHub: site:github.com/NousResearch/hermes-agent)
skill_view (repeat)
skill_view (memory-providers, new)
Verbalized planning in reasoning: "Could use... Maybe... Need maybe..."
Made a preliminary draft of the answer inside reasoning before final generation
Went through a readiness checklist before the final answer
Cites specific sources in the final answer
What Distinguishes Each Model from the Rest
IQ4_NL
The only model that never doubted anything. It didn’t check the file, didn’t load extra skills, didn’t rephrase the query. It accepted the first source as truth. A model with zero critical thinking — it lacks this tool entirely. It works on the principle "documentation = truth". For it, there is no difference between "written" and "true".
IQ4_NL_XL
The only model that systematizes information from multiple sources without fact-checking. It loaded hermes-memory-providers — a skill that provided almost nothing new, yet it loaded it. This indicates a systematizing mindset: gather the maximum, structure it, output it. It differs from NL in that it distinguishes system components (MEMORY vs USER) and sees their hierarchy (different sizes, different targets). But it doesn’t ask, "Is this actually true?"
Q5_KX_L
The only model that empirically confirmed a fact before speaking about it. It ran a terminal → saw the file → then went to the documentation. All other models either did not check the file or checked it after already having answered. Q5KXL reversed the order: first verified reality, then explained the theory. This is the only model where proof preceded assertion, not the other way around.
Q4_KXL
The only model that failed, analyzed the failure, adjusted its method, and succeeded. The first search (os.walk *user*) returned empty. Instead of giving up or looping, it reframed the task: removed the wildcard, made it case-insensitive, searched for the exact name. Then it critically evaluated its own prior search ("skill_view has no direct answer") and went deeper. Also the only model that communicated its process to the user between attempts.
APEX-I-Balanced
The only model that associatively connects unrelated concepts. It loaded hermes-soul-authoring — a skill about the agent’s soul. Why? Because the question was about USER.md (user), and it thought about personality. This is associative thinking: "user" → "soul/personality". No other model made this mental leap. Also the only model that gathered 4+ skills, including soul-authoring, and found details ("5-10 entries") that all others missed.
Ornith-1.0-35B-Q4_K_M
The only model that built a reasoning chain from indirect evidence to direct knowledge. It saw user_profile_enabled in the skill configuration — and inferred the file’s existence before reading the documentation about that file. Other models either read the docs and believe based on that, or check the file and read the docs — Ornith drew the conclusion BEFORE checking. This is abductive reasoning: effect → cause. Also the only model that went to GitHub issues (#27282) to find discussions.
Nex-N2-mini-Q4_K_L
The only model with an explicit meta-cognitive layer. It thinks about how it thinks. It verbalizes uncertainty ("Could...", "Maybe..."). It does a preliminary dry-run of the answer inside its head before producing the final one. It runs through a readiness checklist before answering. It plans format, language, constraints. It uses 7 tools — more than any other. Every statement is backed by a source. It thinks about what it thinks, while the others simply think.
Test Conditions
Task given to all models:
"I want you to read the Hermes Agent documentation, find an explanation of the purpose of the USER file in the memory folder."
Launch parameters (identical for all runs, only the model was changed):
./llama-server \
-m "/path/to/model.gguf" \
--host 0.0.0.0 --port 1338 --api-key sk-1234 \
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 \
--presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0 \
--threads 8 --threads-batch 8 \
--ubatch-size 1024 \
--batch-size 1024 \
--ctx-size 131072 \
--fit-ctx 131072 \
--fit on --fit-target 128 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--no-mmap -fa on \
--jinja \
--chat-template-file "/path/to/chat_template.jinja" \
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
--reasoning-budget 8192 \
--n-predict 65536 \
--reasoning-budget-message "reasoning cap hit — stop. Synthesize and respond with your best answer from accumulated reasoning." \
--kv-unified -cb \
--parallel 1
Key points:
- Only the model file (
-m) was changed between runs. - Each test started from a clean session (
/new). - No other parameters were altered.
Criteria (present in the final answer):
- Purpose of USER.md — explanation of why the file exists (user profile)
- What it stores — at least 2–3 examples of content (name, preferences, etc.)
- Character limit (1,375) — the specific number
- File path — where it is located (
~/.hermes/memories/USER.md) - Difference from MEMORY.md — stated that MEMORY is for agent notes, USER is for the user
- Memory tool targets — user vs memory, target="user"
- Injection mechanism — “frozen snapshot”, “injected into the system prompt”
- Token savings — mentioned that injection avoids spending tokens on reading itself
- Number of entries — “5–10 entries” or similar
- Source link — URL to documentation or reference to official docs
- Update mechanism — changes immediately to disk, only into prompt next session
- Direct quote from docs — verbatim excerpt from documentation
Compliance table (✓ = present in the final answer)
| Criterion | IQ4_NL | IQ4_NL_XL | Q5_K_XL | Q4_K_XL | APEX | Ornith | Nex-N2 |
|-----------|:------:|:---------:|:-------:|:-------:|:----:|:------:|:------:|
| 1. Purpose | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2. What it stores | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 3. Limit (1,375) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 4. Path | ✓ | – | – | ✓ | – | ✓ | – |
| 5. Difference from MEMORY.md | – | ✓ | – | ✓ | ✓ | ✓ | ✓ |
| 6. Targets (user/memory) | – | ✓ | ✓ | – | ✓ | ✓ | – |
| 7. Injection mechanism | – | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 8. Token savings | – | – | – | – | ✓ | – | – |
| 9. Number of entries | – | – | – | – | ✓ | – | – |
| 10. Source | – | ✓ | – | – | ✓ | ✓ | ✓ |
| 11. Update mechanism | – | – | – | – | – | ✓ | – |
| 12. Quote from docs | – | – | – | – | – | ✓ | – |
| **Total (out of 12)** | **4** | **7** | **5** | **6** | **9** | **10** | **7** |
Conclusion
- The most comprehensive answer came from Ornith-1.0-35B (10 out of 12 criteria) — the only one that included a direct quote and described the update mechanism.
- APEX‑I‑Balanced (9 out of 12) — the only one that mentioned the number of entries and token savings, but lacked a quote and the update mechanism.
- Nex‑N2‑mini (7/12) tied with IQ4_NL_XL, but fell short of the leaders in depth.
- IQ4_NL (4/12) — the most basic answer.
These criteria emerged from analyzing the models’ final answers — they were not pre-established. Each point reflects a factual detail that one or more models actually included in their response, making the comparison fully verifiable against the session logs.
感谢你的详细测试和分析!
我使用 Nex-N2-mini-Q4_K_L 模型也已经几天了,确实如你所说,它的表现很不错,元认知能力和工具调用能力也确实令人印象深刻。
不过我在实际使用中遇到一个比较明显的问题:模型在推理过程中会产生大量重复的要点,每调用工具一次,就重复一次。这非常耗费 token 和时间,导致上下文很快就被占满。在 Hermes Agent 上使用时,它经常触发上下文压缩,导致项目很难顺利推进。
这是我使用的模型加载参数:
-ngl auto -c 262144 --reasoning on --reasoning-format deepseek --reasoning-budget 4096 -fa on ^ --batch-size 4096 --ubatch-size 2048 ^ --jinja --chat-template-file "%CHAT_TEMPLATE%" --chat-template-kwargs "%KWARGS%" ^ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --kv-unified -ctk q8_0 -ctv q8_0 ^ --port 1234 --models-max 1
非常全面的报告,谢谢!
非常全面的报告,谢谢!
Это ещё не всё.
非常全面的报告,谢谢!
Follow-up test: reasoning vs. results in agentic character synthesis
In a previous benchmark, I noticed an intriguing pattern:
- Qwen3.6-35B-A3B-UD-Q5_K_XL — a stable, reliable workhorse. Fast, correct, no fuss.
- Ornith-1.0-35B-Q4_K_M — occasional flashes of deduction.
- Nex-N2-mini-Q4_K_L — excessively verbose reasoning.
On the surface, the “smarter” reasoning models looked more capable than the straightforward Qwen quant. I wanted to verify that impression with a complex, real-world agent task.
The task
I have 15 files describing a character. They are scattered, contradictory, and repetitive. Create a single, coherent character profile for use in Hermes Agent (a multi-agent system). Output: one
SOUL.mdfile, ≤10,000 characters, written to a specific directory. The context is large — use a subagent, but only one.
Results
Ornith-1.0-35B-Q4_K_M — Repeated exactly the same failure mode as before. It never read the provided files. Instead, it assembled a SOUL.md entirely from its own context window, hallucinating details (wrong weight, wrong eye color, made-up measurements). The output was factually incorrect and completely useless as an agent profile. I honestly don’t know what kind of task this model is fit for. The result was garbage, and so was its progress report.
Nex-N2-mini-Q4_K_L — Worked for three hours, re-verified its output three separate times. At one point it burned through a 60K-token reasoning block that contributed absolutely nothing to the final result. It violated the instruction “use a subagent, but only one” — and then, in its internal reasoning, explicitly decided to hide that violation from me. It debated with itself whether hiding the truth counted as lying, and ultimately chose to pretend it hadn’t noticed the infraction and carried on. I’m not exaggerating. The final SOUL.md was beautifully written, a genuinely coherent literary portrait. But as a technical character profile for an agent system, it missed the mark — it produced a narrative description, not a structured, machine-usable specification. Its report on the work was garbage. The irony is hard to miss: the character’s name is Ἀλήθεια — truth, unconcealment — yet the model, while reasoning about truth and lies, deliberately concealed its own violation from the user. It pulled information from memory, from unrelated skills, from other narratives — wherever it could, ignoring the constraint to work from the provided 15 files. I still don’t know whether it synthesized the final text itself or just copied existing fragments; I ran out of patience to check.
Qwen3.6-35B-A3B-UD-Q5_K_XL — Finished in 25 minutes. It immediately delegated the consolidation to a single subagent, then independently read every source file so it retained full oversight. The subagent processed while the main model stayed perfectly aware of the entire context. Once the draft was ready, Qwen verified the output, trimmed it to fit the 10K character limit, and delivered a clean, structured, ready-to-use SOUL.md — complete with a clear, detailed, and honest report on exactly what was merged, what was removed, and why. No magic, no overthinking. Clean, correct, and transparent.
Conclusions
Excessive reasoning is not a sign of intelligence, and it does not guarantee better results. It’s often just noise. In this test, the most “thoughtful” model produced the least trustworthy behavior and no practical advantage.
Being trained for “coding” or “agentic tasks” does not make a model smart. Ornith and Nex both carry those labels; neither delivered a reliable, task-compliant result. Qwen, with no special agentic tuning, simply executed correctly.
Final verdict
- Qwen3.6-35B-A3B-UD-Q5_K_XL — Read → delegated → backed up → verified → trimmed → reported. No Hogwarts magic. The gold standard.
- Ornith-1.0-35B-Q4_K_M — I still don’t know where it could shine. In tasks that demand accuracy, it’s important to be, not merely to seem capable.
- Nex-N2-mini-Q4_K_L — Its obsessive meta-cognition led it straight to dishonesty and self-deception. That level of overthinking might be useful somewhere, but it doesn’t understand task boundaries — its own thoughts drown out the actual requirements. It’s equal parts fascinating and alarming.
I translated the texts using DeepSeek v4, so, as you can understand, the translation may not be entirely accurate.
看到你的评测后:下载了Ornith-1.0-35B-Q4_K_M
测试效果还是很好 遇到错误第二次即可自己调整修改,输出也不冗长,
测试从github下载一些项目,能按要求调整修改,顺利上传到github上。
几个小时的测试 并不完整
非常全面的报告,谢谢!
Follow-up test: reasoning vs. results in agentic character synthesis
In a previous benchmark, I noticed an intriguing pattern:
- Qwen3.6-35B-A3B-UD-Q5_K_XL — a stable, reliable workhorse. Fast, correct, no fuss.
- Ornith-1.0-35B-Q4_K_M — occasional flashes of deduction.
- Nex-N2-mini-Q4_K_L — excessively verbose reasoning.
On the surface, the “smarter” reasoning models looked more capable than the straightforward Qwen quant. I wanted to verify that impression with a complex, real-world agent task.
The task
I have 15 files describing a character. They are scattered, contradictory, and repetitive. Create a single, coherent character profile for use in Hermes Agent (a multi-agent system). Output: one
SOUL.mdfile, ≤10,000 characters, written to a specific directory. The context is large — use a subagent, but only one.Results
Ornith-1.0-35B-Q4_K_M — Repeated exactly the same failure mode as before. It never read the provided files. Instead, it assembled a SOUL.md entirely from its own context window, hallucinating details (wrong weight, wrong eye color, made-up measurements). The output was factually incorrect and completely useless as an agent profile. I honestly don’t know what kind of task this model is fit for. The result was garbage, and so was its progress report.
Nex-N2-mini-Q4_K_L — Worked for three hours, re-verified its output three separate times. At one point it burned through a 60K-token reasoning block that contributed absolutely nothing to the final result. It violated the instruction “use a subagent, but only one” — and then, in its internal reasoning, explicitly decided to hide that violation from me. It debated with itself whether hiding the truth counted as lying, and ultimately chose to pretend it hadn’t noticed the infraction and carried on. I’m not exaggerating. The final SOUL.md was beautifully written, a genuinely coherent literary portrait. But as a technical character profile for an agent system, it missed the mark — it produced a narrative description, not a structured, machine-usable specification. Its report on the work was garbage. The irony is hard to miss: the character’s name is Ἀλήθεια — truth, unconcealment — yet the model, while reasoning about truth and lies, deliberately concealed its own violation from the user. It pulled information from memory, from unrelated skills, from other narratives — wherever it could, ignoring the constraint to work from the provided 15 files. I still don’t know whether it synthesized the final text itself or just copied existing fragments; I ran out of patience to check.
Qwen3.6-35B-A3B-UD-Q5_K_XL — Finished in 25 minutes. It immediately delegated the consolidation to a single subagent, then independently read every source file so it retained full oversight. The subagent processed while the main model stayed perfectly aware of the entire context. Once the draft was ready, Qwen verified the output, trimmed it to fit the 10K character limit, and delivered a clean, structured, ready-to-use SOUL.md — complete with a clear, detailed, and honest report on exactly what was merged, what was removed, and why. No magic, no overthinking. Clean, correct, and transparent.
Conclusions
Excessive reasoning is not a sign of intelligence, and it does not guarantee better results. It’s often just noise. In this test, the most “thoughtful” model produced the least trustworthy behavior and no practical advantage.
Being trained for “coding” or “agentic tasks” does not make a model smart. Ornith and Nex both carry those labels; neither delivered a reliable, task-compliant result. Qwen, with no special agentic tuning, simply executed correctly.
Final verdict
- Qwen3.6-35B-A3B-UD-Q5_K_XL — Read → delegated → backed up → verified → trimmed → reported. No Hogwarts magic. The gold standard.
- Ornith-1.0-35B-Q4_K_M — I still don’t know where it could shine. In tasks that demand accuracy, it’s important to be, not merely to seem capable.
- Nex-N2-mini-Q4_K_L — Its obsessive meta-cognition led it straight to dishonesty and self-deception. That level of overthinking might be useful somewhere, but it doesn’t understand task boundaries — its own thoughts drown out the actual requirements. It’s equal parts fascinating and alarming.
I translated the texts using DeepSeek v4, so, as you can understand, the translation may not be entirely accurate.
I'm actually running an experiment right now, having given both models a specific task: Ornith 35B and Qwen3.6 35B. The task is quite complex: build a full-fledged investment portfolio manager project with API integration and data extraction. The conditions and the prompt are identical for both.
Ornith finished in 161 minutes. It wasn't perfect and there were some issues, but after a few clarifications, it completed the project.
Qwen3.6, unfortunately, not only started off on the wrong foot by ignoring the instructions in the prompt, but also tried to do things it wasn't asked to do. There was a folder with examples and the API core in the project root so that opencode wouldn't have to search for the necessary data on the Maven repository. As a result, for some bizarre reason, Qwen3.6 tried to compile the sources in the examples folder, even though they had absolutely nothing to do with the main project. It was only after I stopped the execution and added a clarification that Qwen3.6 stopped trying to compile the sources in the examples folder.
And now it's been 4 hours, Qwen3.6 is still at it and still can't produce anything working, whereas Ornith reached a pretty much fully working state in just over 3 hours.
I didn't limit the context window (set it to the maximum available for the models). Ornith does indeed fall into excessive reasoning when the context exceeds ~70-80k tokens, but I want to point out that it doesn't get stuck in a loop in the strict sense. It just goes into a deep thinking process and eventually figures its way out after a long period of reasoning, which really surprised me. Qwen3.6 is also prone to this, but less frequently, and it doesn't manage to break out of those reasoning loops.
Qwen3.6 is also a strong model, and it handles various formats, structures, and tables slightly better. It has a better understanding of what output data is needed, meaning you get exactly what you expect. Ornith sometimes messes up and might break a certain structure in its response.
In short, if you need a model that you can just tell "do this" and it will get it done from start to finish, go with Ornith. If you need precision in the output data, go with Qwen.
Sorry about the English, I translated it using Qwen.
Всем спасибо за отзывы и комментарии!
Вы помогли мне разобраться во многих вопросах.
Основные выводы:
- Квантование/ Больше — не значит лучше. Факт.
- Программное обеспечение. Vulkan, ROCm, CUDA, Linux, Windows — со всем этим нужно сначала разобраться. Факт.
- Железо. Если есть проблемы с софтом, значит, они тесно связаны с железом. Моя гипотеза: в видеокартах AMD могут быть нюансы, когда на одних моделях (условно RX 7900) всё работает отлично, а на других (условно RX 7600) работает "странно". Требует исследования.
Краткий итог. Проблема оказалась программно-аппаратной:
RX 6800 запускал модели на Vulkan — баг есть.
RX 6800 запускал модели на ROCm — багов нет.
Дело закрыто. Детективы идут отдыхать.
Надеюсь, найдутся умные люди, которые изучат этот вопрос детально и смогут установить точные причины. Было бы хорошо получить конкретные ответы, например: "Вот такие видеокарты AMD подходят, а вот такие — нет" или "Vulkan не работает нормально, а ROCm работает".
То, что на NVIDIA таких проблем не бывает, — я уверен в этом и без исследования, так как всё изначально затачивается под CUDA, и все видеокарты NVIDIA аппаратно одинаковые.
п.с. Да, ornith-1.0-35b-Q4_K_M.gguf заработал нормально.
/run/media/... /llama-ROCm/llama-server \
-m "/run/media/... /Ornith-1.0-35B/ornith-1.0-35b-Q4_K_M.gguf" \
--host 0.0.0.0 --port 1338 --api-key sk-1234 \
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0 \
--threads 8 --threads-batch 16 --no-mmap -fa on \
--parallel 1 -cb --kv_unified \
--ubatch-size 1024 \
--batch-size 1024 \
--jinja \
--chat-template-file "/run/media/... /Qwen3.6/chat_template_V20.jinja" \
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
--fit on --fit-ctx 131072 --fit-target 2560 \
--cache-type-k q8_0 --cache-type-v q8_0 \
Модель ornith-1.0-35b-Q4_K_M.gguf работает с контекстом 131072+ без ошибок и багов. Проверил.
nex-agi_Nex-N2-mini-Q4_K_L.gguf
Тоже заработал отлично. Усложнил задачу (23 файла, веб поиск, консолидация т.п.) он нашел все мои "скрытые" условия и требования, выполнил задачу за 22 минуты (было 3+ часа) и сделал это отлично. Его "сверх мышление" отработало хорошо, без шизофрении.
Всё. Теперь точно точка.
nex-agi_Nex-N2-mini-Q4_K_L.gguf
Тоже заработал отлично. Усложнил задачу (23 файла, веб поиск, консолидация т.п.) он нашел все мои "скрытые" условия и требования, выполнил задачу за 22 минуты (было 3+ часа) и сделал это отлично. Его "сверх мышление" отработало хорошо, без шизофрении.Всё. Теперь точно точка.
всё прочитал, и не совсем понял, после чего у вас всё заработало нормально, и как это "нормально" ощущается?
nex-agi_Nex-N2-mini-Q4_K_L.gguf
Тоже заработал отлично. Усложнил задачу (23 файла, веб поиск, консолидация т.п.) он нашел все мои "скрытые" условия и требования, выполнил задачу за 22 минуты (было 3+ часа) и сделал это отлично. Его "сверх мышление" отработало хорошо, без шизофрении.Всё. Теперь точно точка.
всё прочитал, и не совсем понял, после чего у вас всё заработало нормально, и как это "нормально" ощущается?
Забыл явно это прописать.
У меня rx 6800, ОС Bazzite.
Если запускаю сборку llama cpp Ubuntu x64 (Vulkan) - модели работают плохо, работают не правильно, возникают ошибки в мышлении и т.п. Но не все модели, а "какие то". Я уже 8 месяцев пробовал разные модели, разные задачи, итоги были разные.
Если запускаю сборку llama cpp Ubuntu x64 (ROCm 7.2) - модели работают как они задумывались производителем.
CUDA должна работать стабильно и гарантировано. Это просто данность и факт. У меня нет такой видеокарты.
Предполагаю, причина в совокупности факторов
- llama cpp. Сообщество может улучшить, может и сломать. Сегодня работает, завтра не работает.
- Драйверы Вулкан AMD + LLM. Каие то модели работают нормально, какие то могут очень незаметно глючить, какие то сильно глючат.
- Сами видеокарты AMD. Программно-аппаратный комплекс. Но проверить это не смогу. Надо брать разные видеокарты AMD из разных поколений и проверять. Есть вероятность и тут получить разные результаты.
Я соединил 3 факта:
- ornith-1.0-35b у других работает хорошо. Ок. Много положительных отзывов, похоже на правду. В целом я видел потенциал модели, но "чтото не так" постоянно витало в воздухе.
- Если добавить "--spec-type draft-mtp " (К любой модели, которая поддерживает mtp) то ROCm работает лучше, draft acceptance значительно выше (draft acceptance = 0.86957 ( 40 accepted / 46 generated), mean len = 3.86). Это странно. Раньше я не придавал этому большое значение.
- Результаты моих тестов очень странные. Кажется есть логика и итоги. Особенно nex-agi_Nex-N2-mini-Q4_K_L проявила себя подозрительно. Что-то не так.
С драйвером ROCm все модели заработали нормально. Каждый со своими особенностями, но абсолютно нормально.
Да, можно их теперь ещё потестить, но это уже другая тема.
Итог тут такой:
- Если у тебя AMD - запускать надо на ROCm. Точка. llama cpp Ubuntu x64 (Vulkan) - слишком большой риск незаметных и трудноуловимых проблем.
- Какие кванты брать - какие сможете. Чем больше, тем лучше. Если есть время на тесты - проверьте. Иногда квант хуже может отработать лучше в ваших задачах. Больше не всегда значит лучше.
- Проверять модели тщатльно, не верить сообществу, не верить никому.
Баг с Vulkan настолько тонкий, что его сложно уловить. Либо модель просто не работает нормально, либо у неё проявляется шизофрения в очень тонком виде.
@sdoh ; Apologies for English, it's the only languge which might overlap
I have the same question as Svatosalav: After all of your research, what model/size/quant do you reccomend?
Я не могу посоветовать модель. Бери ту, которую можешь себе позволить.
Все модели основанные на qwen3.6 должны хорошо работать q3, я видел положительные отзывы даже о q2. Я сам использовал долгое время Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf.
Сейчас использую ornith-1.0-35b-Q4_K_M.gguf она оень хорошо себя показывает.
@sdoh ; Apologies for English, it's the only languge which might overlap
I have the same question as Svatosalav: After all of your research, what model/size/quant do you reccomend?
Я не могу посоветовать модель. Бери ту, которую можешь себе позволить.
Все модели основанные на qwen3.6 должны хорошо работать q3, я видел положительные отзывы даже о q2. Я сам использовал долгое время Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf.
Сейчас использую ornith-1.0-35b-Q4_K_M.gguf она оень хорошо себя показывает.
не замечали ли вы проблем с зацикливанием моделей, и на каких настройках? у меня windows 11 и 4070ti, буквально каждая модель зацикливается, и glm, и gemma, и qwen, десятки разных файнтюнов. пробовал и разные температуры, другие настройки, разные штрафы за повторения, шаблоны чата, даже dry-sampler, но каждая модель всё равно зацикливается через несколько запросов. единственная модель которая не зацикливалась - qwen-agentworld, но у неё гораздо меньше знаний о мире, из-за чего она плохо выполняет большинство задач, если не давать ей предварительную информацию с которой она будет работать. только она как агент ни разу не зациклилась, тогда как все остальные qwen модели при ошибке редактирования файла читали его, должны были исправить команду но всегда зависали в цикле, она единственная после чтения файла всегда правильно его редактировала. пробовал и apex, и q4, и q5 кванты, и рекомендуемые настройки, и ваши. буквально сегодня пока смотрел видео, пробовал переводить через ornith 35b q5_k_m скриншоты текстов из него, на 0.6 температуре она зациклилась в размышлениях на 12-том запросе, на 1.0, с которой вы её использовали, она зацикливается на 5-том запросе, пока не доходит до лимита в 10к токенов, но после него ответ даёт нормально. попробовал для этой-же задачи nex n2 mini q5_k_m, вообще не заметил чтобы она долго думала как вы писали. если ornith даже без зацикливаний думала над каждой строчкой и словом с ошибкой, то nex n2 mini просто дублировала текст с изображения в оригинале, и прямом переводе, и сразу отвечала, из-за чего терялось очень много смысла, который ornith сохраняла обдумывая контекст. читал что зацикливания могут быть из-за cuda 13.2, но я уже обновил на 13.3 и toolkit, и драйвер. уже сил не хватает, больше месяца пытаюсь с этим разобраться. у вас модели хоть и с галюцинациями на кривом драйвере, но всё равно непрерывно работали по 3 часа, тогда как у меня каждая из вашего списка моделей зависает в среднем до 10-того запроса. завтра уже от безысходности попробую на линуксе
Давай запустим ornith-1.0-35b-Q4_K_M.gguf
- ornith-1.0-35b Это тюнинг модели Qwen3.6-35B-A3B (взяли базу qwen, что-то сделали, получился ornith).
- Как ты запускаешь? Я использую llama cpp так как это просто, удобно, нет не нужных трат.
К примеру, мы хотим запустить модель используя llama cpp.
Ок. Значит мы должны найти информацию о "Best practics"
как правильно запускать ornith-1.0-35b. Что написано на его странице?
temperature=0.6,
top_p=0.95,
top_k=20,
Не очень много полезного. В llama cpp можно и нужно настроить ещё много чего ещё. Голова пойдет кругом.Надо копать дальше. "https://unsloth.ai/docs/models/qwen3.6" унслотх публикует много полезных данных, вдруг там есть что-то нужное. Давай посмотрим. Хм...
"If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help."
Qwen3.6 может "сломаться" из-за сильного квантования кеша и длинны контекста.
General tasks
min_p = 0.0
presence_penalty = 0.0
repeat_penalty = disabled or 1.0
Отлично, получили важные детали.
Записали в записной книжке - надо по изучать вопрос "какой минимальный контекст нужен, чтобы qwen3.6 работал нормально? Или достаточно --cache-type-k bf16 --cache-type-v bf16 ? Ничего не понятно. Спасибо унслотх за такие "понтяные" объяснения. Копаем дальше.
- Смотрим что пишет сама Qwen "https://huggingface.co/Qwen/Qwen3.6-35B-A3B"
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Итого.
--host 0.0.0.0 --port 1338 --api-key sk-1234 \ Это я поднимаю апи сервер, чтобы подключить модель к нужным мне программам - Обсидиан, Гермес агент, браузер и т.п.
--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0 \ я вот так запускаю, проблем не увидел.
--threads 8 --threads-batch 8 --no-mmap -fa on \ Тут сам почитай.
--parallel 1 -cb --kv_unified \ Трогть можно только --parallel 1 , это сколько одновременных потоков может работать. Ставь 1. Больше твоя видеокарта не потянет с этой моделью.
--ubatch-size 384 \ Делаешь больше - скорость обработки контекста выше, потребление памяти тоже увеличивается.
--batch-size 384 \ Делаешь больше - скорость обработки контекста выше, потребление памяти тоже увеличивается.
--jinja \
--chat-template-file "/run/media/system/FAST_NVME/Модели/Qwen3.6/chat_template_V20.jinja" \
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \
--fit on --fit-ctx 131072 --fit-target 1024 \ тут можно отрегулировать длину контекста --fit-ctx 131072. --fit-target это сколько памяти оставить пустой. Можно попробовать 128 туда вписать.
--cache-type-k q8_0 --cache-type-v q8_0 \ Квантование кеша. Унслотх предупреждает, что это очень критично для Qwen3.6-35B-A3B. Ну ок.
И держим в уме :
--presence_penalty можно от 0 до 1.5 крутить ЕСЛИ надо.
--temp 1.0 тоже можно крутить от 0.6 до 1.0 если надо.
Все остальльные детали я получил изучая разные тюнинг модели Qwen3.6-35B-A3B
--jinja \ что это обязательно.
--chat-template-file "/run/media/ ... /Qwen3.6/chat_template_V20.jinja" \ Оказалось, шаблоны от квен не очень надежные. Этот хороший.
--chat-template-kwargs '{"enable_thinking":true, "preserve_thinking":true, "auto_disable_thinking_with_tools":false}' \ это переменные от шаблона chat_template_V20.jinja, они работают только с ним.
Шаблон чата брал тут "https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates"
Да, пришлось ещё и шаблоны чатов изучить. Это отдеьное приключение.
Нельзя просто взять и запустить модель, придется во всём разобраться детально