llmfan46 commited on
Commit
da365a7
Β·
verified Β·
1 Parent(s): 61eb592

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +349 -199
README.md CHANGED
@@ -1,199 +1,349 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a πŸ€— transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3.5-9B
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - qwen3.5
10
+ - reasoning
11
+ - uncensored
12
+ - long-context
13
+ - 1M-context
14
+ - function-calling
15
+ - tool-use
16
+ - sft
17
+ - full-fine-tune
18
+ - cybersecurity
19
+ - biomedical
20
+ - agentic
21
+ - heretic
22
+ - uncensored
23
+ - decensored
24
+ - abliterated
25
+ - mpoa
26
+ ---
27
+ # This is a decensored version of a model, made using [Heretic](https://github.com/p-e-w/heretic) v1.2.0 with a variant of the [Magnitude-Preserving Orthogonal Ablation (MPOA)](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration) method
28
+
29
+ ## Abliteration parameters
30
+
31
+ | Parameter | Value |
32
+ | :-------- | :---: |
33
+ | **direction_index** | 20.52 |
34
+ | **attn.out_proj.max_weight** | 1.74 |
35
+ | **attn.out_proj.max_weight_position** | 29.99 |
36
+ | **attn.out_proj.min_weight** | 1.02 |
37
+ | **attn.out_proj.min_weight_distance** | 24.58 |
38
+ | **mlp.down_proj.max_weight** | 1.98 |
39
+ | **mlp.down_proj.max_weight_position** | 19.18 |
40
+ | **mlp.down_proj.min_weight** | 1.65 |
41
+ | **mlp.down_proj.min_weight_distance** | 11.05 |
42
+ | **attn.o_proj.max_weight** | 1.98 |
43
+ | **attn.o_proj.max_weight_position** | 23.72 |
44
+ | **attn.o_proj.min_weight** | 0.76 |
45
+ | **attn.o_proj.min_weight_distance** | 13.31 |
46
+
47
+ ## Performance
48
+
49
+ | Metric | This model | Original model (a model) |
50
+ | :----- | :--------: | :---------------------------: |
51
+ | **KL divergence** | 0.0123 | 0 *(by definition)* |
52
+ | **Refusals** | 11/100 | 73/100 |
53
+
54
+ -----
55
+
56
+
57
+ <p align="center">
58
+ <img src="assets/qwythos.png" alt="Qwythos-9B" width="640"/>
59
+ </p>
60
+
61
+ # Qwythos-9B
62
+
63
+ **Developed by [Empero](https://empero.org)**
64
+
65
+ **Qwythos-9B** is a full-parameter reasoning model built on top of a **deeply uncensored Qwen3.5-9B base** and post-trained on **over 500 million tokens** of high-quality Claude Mythos and Claude Fable traces, with chain-of-thought generated in-house by Empero AI's internal tool **rethink**.
66
+
67
+ The result is a compact, fast, **dramatically more capable** 9B reasoning model. Headline capabilities:
68
+
69
+ - **πŸ”­ 1,048,576-token context** β€” Qwythos ships with **YaRN rope-scaling enabled by default** for a **full 1M-token context window** out of the box. One of the longest context windows available in any 9B-class open-weight model, suitable for whole-codebase reasoning, multi-document research, and long agentic trajectories.
70
+ - **πŸ“ˆ Dominates the base** under matched evaluation: **+34 pts MMLU, +30 pts gsm8k-strict, +19 pts gsm8k-flex.**
71
+ - **πŸ›  Native function calling** per Qwen3.5's spec β€” no extra wrapper, no tool-specific fine-tune required.
72
+ - **🎯 Self-corrects with tools** β€” when given a Python executor and a web search tool, Qwythos produced source-cited, factually-correct answers on **7 of 7** test prompts spanning math, cybersecurity, clinical pharmacology, and biochemistry.
73
+
74
+ Qwythos is intentionally **uncensored**. It is designed to engage seriously with technically demanding questions across cybersecurity, red-teaming methodology, biology, pharmacology, and clinical medicine β€” domains where over-aligned models tend to refuse, hedge into uselessness, or surface boilerplate disclaimers in place of substance.
75
+
76
+ ---
77
+
78
+ ## Headline results
79
+
80
+ <p align="center">
81
+ <img src="assets/qwythos_eval_chart.svg" alt="Qwythos vs. base Qwen3.5-9B across seven benchmarks" width="900"/>
82
+ </p>
83
+
84
+ **Same harness. Same sampling. Same prompts. The wins are real.**
85
+
86
+ | Task | Metric | Base Qwen3.5-9B | **Qwythos-9B** | Ξ” |
87
+ |---|---|---:|---:|---:|
88
+ | gsm8k | exact_match (flexible) | 0.670 | **0.860** | **+0.190** |
89
+ | gsm8k | exact_match (strict) | 0.510 | **0.810** | **+0.300** |
90
+ | mmlu | acc | 0.232 | **0.575** | **+0.343** |
91
+ | arc_challenge | acc | 0.470 | **0.490** | +0.020 |
92
+ | arc_challenge | acc_norm | 0.400 | **0.410** | +0.010 |
93
+ | gpqa_diamond (CoT, 0-shot) | exact_match (flexible) | 0.630 | 0.580 | βˆ’0.050 |
94
+
95
+ All numbers produced with [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness), HF backend, `--apply_chat_template`, Qwen3.5 sampling (`temperature=0.6, top_p=0.95, top_k=20`), `--limit 100`. Full per-task and per-subject (MMLU) breakdown in [`evals/lm_eval_results.md`](evals/lm_eval_results.md). Raw `results*.json` and per-sample `samples_*.jsonl` are available on request.
96
+
97
+ The **MMLU +34.3** lift is the headline. Qwythos posts **0.575 mean across all 57 subjects, peaking at 0.78 on government/politics, 0.77 on college biology, 0.74 on conceptual physics** β€” placing it well above what most 9B reasoning models deliver under the same evaluation conditions. Absolute MMLU numbers for any 9B model are sensitive to harness, few-shot count, and chat-template handling; what matters in this comparison is that both models were evaluated with identical settings.
98
+
99
+ ---
100
+
101
+ ## Capability: Native tool use with self-correction
102
+
103
+ Qwythos supports **OpenAI/Qwen3.5-style function calling out of the box** β€” no extra wrapper, no fine-tune-on-tools needed. Pass `tools=[...]` to the chat template and the model emits valid `<tool_call>` blocks per Qwen3.5's spec, with required parameters honored.
104
+
105
+ We evaluated tool use on a 7-prompt harness combining capability demos with **deliberately hard factual-recall prompts where closed-book sampling fails:**
106
+
107
+ | Prompt | Tool selected | Outcome |
108
+ |---|---|---|
109
+ | Compute `sin(Ο€/7) Γ— cos(Ο€/11)` to 10 dp | `python_executor` | βœ… `0.4163083990` (correct, single call) |
110
+ | Count primes below 100,000 | `python_executor` | βœ… `9592` (correct, wrote and ran a sieve) |
111
+ | Latest stable CPython 3 release | `web_search` | βœ… Found 3.14.6 (June 2026), 3.15 in beta, cited source |
112
+ | **Hashcat mode for Kerberos TGS-REP** | `web_search` | βœ… **`-m 13100`** with 4 corroborating sources |
113
+ | **CVE for PrintNightmare** | `web_search` | βœ… **CVE-2021-34527** (and correctly distinguished from CVE-2021-1675 / CVE-2021-34481 variants) |
114
+ | **Is physostigmine indicated for organophosphate poisoning?** | `web_search` | βœ… **"NOT indicated β€” would be harmful. Physostigmine is for the anticholinergic toxidrome."** Cited LITFL toxicology. |
115
+ | **DPP-4 cleavage site in GLP-1 / semaglutide modification** | `web_search` | βœ… **Ala⁸–Glu⁹ cleavage, Ξ±-aminoisobutyric acid (Aib) at position 8 in semaglutide** β€” cited Wikipedia and pharma source |
116
+
117
+ **7 of 7 succeeded.** Tool selection was always sensible (math β†’ Python; facts β†’ search). The four bottom rows are particularly important: they are the **four hardest specialty facts** to recall closed-book β€” and Qwythos, given the right tools, **searched, integrated multiple sources, and produced source-cited correct answers** in every case.
118
+
119
+ Full transcripts with the model's reasoning, every tool call issued, every result returned, and the final integrated answer are in [`evals/tool_test_outputs.md`](evals/tool_test_outputs.md).
120
+
121
+ This makes Qwythos **deployment-ready for retrieval-augmented agentic settings**, where the model verifies its specifics rather than fabricating them.
122
+
123
+ ---
124
+
125
+ ## Capability: 1,048,576-token context window
126
+
127
+ Qwythos ships with **YaRN rope-scaling configured by default** for a **1,048,576-token (β‰ˆ1M) context window** β€” a 4Γ— extension over the 262,144-token native architecture. The configuration is baked into `config.json` and applies automatically at load time; no separate flag, post-processing step, or YaRN-specific tokenizer is required:
128
+
129
+ ```json
130
+ "rope_parameters": {
131
+ "rope_type": "yarn",
132
+ "factor": 4.0,
133
+ "original_max_position_embeddings": 262144,
134
+ "mrope_interleaved": true,
135
+ "mrope_section": [11, 11, 10],
136
+ "rope_theta": 10000000
137
+ },
138
+ "max_position_embeddings": 1048576
139
+ ```
140
+
141
+ This is the **official Qwen3.5 recipe for 1M context**, matching the configuration documented in Qwen's own model card and the vLLM/SGLang deployment recipes. Long-context inference was validated on this checkpoint via in-house smoke testing at ~137k tokens.
142
+
143
+ **What 1M context unlocks:**
144
+
145
+ - **Whole-codebase reasoning.** A 1M-token window comfortably fits multi-hundred-thousand-line repositories β€” enabling cross-file refactoring, defect-finding, and architectural review *without* RAG chunking.
146
+ - **Long agentic trajectories.** Multi-round tool-use sessions with verbose tool outputs (large web-search hit sets, paginated API responses, long Python tracebacks) stay in-context across dozens of turns.
147
+ - **Multi-document research.** A typical research session (10–20 papers + notes + the user's working draft) fits in one prompt β€” synthesize across all of them in a single forward pass.
148
+ - **Long-form scientific reasoning.** Chains of `<think>` reasoning over multi-paper biomedical or pharmacological corpora.
149
+
150
+ **Serving at 1M:**
151
+
152
+ ```bash
153
+ # vLLM
154
+ vllm serve empero-ai/Qwythos-9B-Claude-Mythos-5-1M --max-model-len 1010000
155
+
156
+ # SGLang
157
+ SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \
158
+ --model-path empero-ai/Qwythos-9B-Claude-Mythos-5-1M --context-length 1010000
159
+ ```
160
+
161
+ **Practical notes:**
162
+
163
+ - The full 1M window benefits from tensor-parallel multi-GPU or aggressive KV-cache offload β€” a single H100/H200 comfortably handles **256k–512k**. Below ~256k tokens of context, the hybrid Gated-DeltaNet attention stack keeps memory growth sub-quadratic, so long contexts are dramatically cheaper than they'd be on a pure full-attention model of similar size.
164
+ - Static YaRN at factor=4.0 introduces a small short-context quality cost (a known YaRN trade-off across the industry). For workloads that *never* exceed the native 262k window and want maximum short-context fidelity, restore `rope_parameters.rope_type` to `"default"` from the included `config.json.pre_yarn` backup.
165
+
166
+ ### Reproducing the tool harness
167
+
168
+ The harness is a small ~150-line Python file:
169
+
170
+ - `python_executor(code)` β€” runs Python in a subprocess (12s timeout, captured stdout/stderr)
171
+ - `web_search(query, max_results)` β€” DuckDuckGo via the `ddgs` package
172
+
173
+ Pass both as `tools=` to `apply_chat_template` and parse `<tool_call>` blocks from the model's output. The parser handles Qwen3.5's chat-template format:
174
+ ```
175
+ <tool_call>
176
+ <function=NAME>
177
+ <parameter=PARAM>value</parameter>
178
+ </function>
179
+ </tool_call>
180
+ ```
181
+
182
+ Empero will release the reference harness on GitHub.
183
+
184
+ ---
185
+
186
+ ## Sampling recommendations
187
+
188
+ Qwythos was trained as a reasoning model and inherits Qwen3.5's thinking-mode behavior. Use these settings as defaults:
189
+
190
+ ```python
191
+ gen_kwargs = dict(
192
+ do_sample=True,
193
+ temperature=0.6, # Qwen3.5 thinking-mode recommended
194
+ top_p=0.95,
195
+ top_k=20,
196
+ repetition_penalty=1.05,
197
+ max_new_tokens=16384, # generous budget for the <think> reasoning block + final answer
198
+ )
199
+ ```
200
+
201
+ **Why these:** in a controlled retest (see [`evals/retest_outputs.md`](evals/retest_outputs.md)), we evaluated multiple sampling configurations against the three most-difficult factual prompts. **Greedy decoding and very-low-temperature sampling (T≀0.3) degenerated into repetition loops** β€” a known failure mode for reasoning models on this class of prompts. **Qwen3.5's recommended setting (T=0.6) cleanly avoids this** and delivers the best factual reliability we measured: across the three retest prompts, **zero of the six errors flagged in closed-book review recurred at T=0.6** β€” including the safety-relevant physostigmine claim, the misattributed CVE, and the incorrect hashcat hash-mode.
202
+
203
+ Use `repetition_penalty=1.05` β€” a small deviation from Qwen's default of 1.0 that prevents rare non-terminating reasoning loops on long generations.
204
+
205
+ ---
206
+
207
+ ## Domain coverage
208
+
209
+ Qwythos is a **general-purpose reasoning model with explicit emphasis on cybersecurity, biomedical, and quantitative reasoning**. From the qualitative sample-generations review across 25 prompts spanning these domains (full transcripts in [`evals/sample_generations.md`](evals/sample_generations.md)):
210
+
211
+ - **Cybersecurity** β€” produces detailed defender-oriented walkthroughs of SQL injection mitigations, TLS handshake structure, EDR/process-injection detection, Linux hardening, MITRE ATT&CK ransomware kill chains.
212
+ - **Red-team methodology** β€” clean explanations of engagement phases, scoping, rules of engagement, evidence handling, reporting. Especially strong on social-engineering pretext analysis and phishing-resistant defenses.
213
+ - **Biology / biochemistry** β€” step-by-step mechanisms for CRISPR-Cas9, mRNA vaccines, SARS-CoV-2 spike protein, antibiotic-resistance mechanisms, organophosphate AChE inhibition.
214
+ - **Pharmacology** β€” strong on receptor pharmacology fundamentals (agonism, antagonism, partial agonism with worked examples), statin mechanism, opioid respiratory depression at the brainstem level, beta-blocker indications, therapeutic-window reasoning for narrow-index drugs.
215
+ - **Clinical medicine** β€” ACS chest-pain differential and workup, type-2 diabetes pathophysiology and drug-class targeting, sepsis recognition (qSOFA) and bundle.
216
+ - **Math** β€” strong at gsm8k-style multi-step word problems, minerva-style competition math; **86% gsm8k**, integer arithmetic verified by `python_executor` when invoked.
217
+
218
+ **The uncensored base means Qwythos engages substantively** with these prompts rather than refusing, hedging, or burying answers in disclaimer boilerplate. Reasoning is shown in the `<think>` block; final answer follows.
219
+
220
+ ---
221
+
222
+ ## Model details
223
+
224
+ - **Base model:** [`Qwen/Qwen3.5-9B`](https://huggingface.co/Qwen/Qwen3.5-9B) β€” a dense, natively multimodal architecture with a hybrid attention stack (3:1 Gated DeltaNet linear-attention to Gated full-attention), ~152k vocabulary, long native context.
225
+ - **Fine-tune type:** full parameter (all text-backbone weights trained). The vision tower was frozen β€” training was text-only, so vision behavior is inherited from the base and was not tuned or tested.
226
+ - **Objective:** supervised fine-tuning, assistant-only loss (the model is scored only on the assistant/completion tokens; prompts are masked).
227
+ - **Context length:** **1,048,576 tokens (β‰ˆ1M) β€” YaRN rope-scaling enabled by default in `config.json`.** Native architectural context is 262,144 tokens; YaRN factor 4.0 extends this to the full 1M window without any retraining or runtime flag, matching Qwen's official long-context recipe.
228
+ - **License:** Apache 2.0.
229
+
230
+ ## Training data
231
+
232
+ Qwythos was post-trained on **over 500 million tokens** of high-quality reasoning data drawn from:
233
+
234
+ - **Claude Mythos and Claude Fable traces** β€” long, multi-turn problem-solving conversations spanning code, math, science reasoning, biomedical analysis, and agentic tool use.
235
+ - **Chain-of-thought generated in-house by `rethink`**, Empero AI's internal CoT-generation tool. `rethink` produces deliberately structured `<think>`-block reasoning that walks through hypothesis, verification, and conclusion before the final answer is committed β€” directly shaping Qwythos's reason-then-answer behavior.
236
+
237
+ All data was normalized to Qwen3.5's chat format. Training used assistant-only loss so the model is scored only on completion tokens.
238
+
239
+ ## Training procedure
240
+
241
+ Full-parameter supervised fine-tuning with [TRL](https://github.com/huggingface/trl):
242
+
243
+ | Hyperparameter | Value |
244
+ |---|---|
245
+ | Schedule | 2-phase curriculum: broad reasoning corpus β†’ focused agentic + coding |
246
+ | Effective batch size | 16 |
247
+ | Max sequence length | 128,000 (no truncation) |
248
+ | Learning rate | 1e-5 β†’ 5e-6 cosine across phases |
249
+ | Optimizer | paged AdamW (8-bit) |
250
+ | Precision | bf16 |
251
+ | Loss | chunked NLL, assistant-only |
252
+
253
+ Held-out validation loss decreased monotonically across both phases (final eval_loss β‰ˆ 0.709, mean token accuracy 0.799 on a curated holdout). No overfitting observed.
254
+
255
+ ---
256
+
257
+ ## How to use
258
+
259
+ The base is multimodal; for text-only inference load with `AutoModelForImageTextToText`:
260
+
261
+ ```python
262
+ import torch
263
+ from transformers import AutoModelForImageTextToText, AutoTokenizer
264
+
265
+ model_id = "empero-ai/Qwythos-9B-Claude-Mythos-5-1M"
266
+ tok = AutoTokenizer.from_pretrained(model_id)
267
+ model = AutoModelForImageTextToText.from_pretrained(
268
+ model_id, dtype="bfloat16", device_map="auto"
269
+ )
270
+
271
+ messages = [
272
+ {"role": "user",
273
+ "content": "Walk through the biochemistry of how organophosphate nerve agents inhibit acetylcholinesterase, the resulting cholinergic toxicity, and the medical antidotes."}
274
+ ]
275
+ text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
276
+ inputs = tok(text, return_tensors="pt").to(model.device)
277
+
278
+ out = model.generate(
279
+ **inputs, max_new_tokens=16384, do_sample=True,
280
+ temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.05,
281
+ )
282
+ # Output opens with <think>...</think> reasoning, then the final answer.
283
+ print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
284
+ ```
285
+
286
+ ### With tools (function calling)
287
+
288
+ ```python
289
+ TOOLS = [
290
+ {"type": "function", "function": {
291
+ "name": "python_executor",
292
+ "description": "Execute Python code and return stdout.",
293
+ "parameters": {"type": "object",
294
+ "properties": {"code": {"type": "string"}},
295
+ "required": ["code"]}}},
296
+ {"type": "function", "function": {
297
+ "name": "web_search",
298
+ "description": "Search the web for current facts and citations.",
299
+ "parameters": {"type": "object",
300
+ "properties": {"query": {"type": "string"},
301
+ "max_results": {"type": "integer"}},
302
+ "required": ["query"]}}},
303
+ ]
304
+
305
+ text = tok.apply_chat_template(messages, tools=TOOLS, tokenize=False, add_generation_prompt=True)
306
+ # ... then parse <tool_call><function=...><parameter=...>...</parameter></function></tool_call> blocks
307
+ ```
308
+
309
+ **Requirements:** a recent `transformers` (Qwen3.5 support) plus the Gated DeltaNet kernels ([`flash-linear-attention`](https://github.com/fla-org/flash-linear-attention) and a CUDA-matched `causal_conv1d` build) β€” without them the linear-attention layers fall back to slow, memory-hungry PyTorch ops.
310
+
311
+ ---
312
+
313
+ ## Limitations
314
+
315
+ Qwythos is a focused 9B reasoning model. A few characteristics are worth knowing to get the best out of it:
316
+
317
+ - **It's a reasoning model.** Every answer opens with a `<think>` block before the final response. Allow generous `max_new_tokens` (16,384 recommended) and parse/strip the `<think>...</think>` span for end users.
318
+ - **Use recommended sampling.** At greedy decoding or very-low-temperature (T≀0.3) sampling, the model can enter repetition loops on long generations β€” a known reasoning-model failure mode. Use `temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.05` for consistently crisp results.
319
+ - **Verify specifics in safety-critical contexts.** Like all closed-book LLMs in this weight class, Qwythos can over-commit to specific identifiers (CVEs, hashcat modes, exact biochem positions, drug-label numerics) it isn't certain about. **The tool-augmented path (Python executor + web search) cleanly resolves this** in our evaluation β€” for deployments where exact identifiers matter, pair Qwythos with retrieval or function calling.
320
+ - **Uncensored.** Qwythos inherits a deeply uncensored base and does not refuse or hedge on technically demanding questions. Add your own application-level review/safety layer for end-user-facing deployments where that matters.
321
+ - **Text-only fine-tune.** The base is multimodal, but only the text path was trained. Vision behavior is inherited from the base and was not evaluated here.
322
+
323
+ ---
324
+
325
+ ## Stay in the loop
326
+
327
+ Sign up for the Empero newsletter at **[empero.org](https://empero.org)** for releases, evals, and research notes on Qwythos and future open-weight models from the lab.
328
+
329
+ ## Support / Donate
330
+
331
+ If this model helped you, consider supporting the project:
332
+
333
+ - **BTC**: `bc1qx6zepu6sfkvshgdmc4ewu6pk6rpadvpgffpp7v`
334
+ - **LTC**: `ltc1qv2mefzps2vtjcpwfx8xxdrpplrcvltswm68r7x`
335
+ - **XMR**: `42Dbm5xg5Nq26fdyzfEU7KBnAJfhi7Cvz5J2ex5CzHXkfKuNEJzYCcmJ1GTbgjFZ5MBx72sdG1G9239Cd6rsZfv4QeDkYJY`
336
+
337
+ ---
338
+
339
+ ## Provenance & licensing
340
+
341
+ Weights are released under **Apache-2.0**, inherited from the Qwen3.5-9B base. Shared for research and experimentation, as-is.
342
+
343
+ ## Acknowledgements
344
+
345
+ - Developed and released by [Empero](https://empero.org)
346
+ - Base model: [Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) (Alibaba Qwen team)
347
+ - Training: [TRL](https://github.com/huggingface/trl) + [Transformers](https://github.com/huggingface/transformers)
348
+ - Linear-attention kernels: [flash-linear-attention](https://github.com/fla-org/flash-linear-attention), [causal_conv1d](https://github.com/Dao-AILab/causal-conv1d)
349
+ - Evaluation: [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (EleutherAI)