Instructions to use yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5") model = AutoModelForMultimodalLM.from_pretrained("yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5
- SGLang
How to use yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5 with Docker Model Runner:
docker model run hf.co/yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5")
model = AutoModelForMultimodalLM.from_pretrained("yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Qwythos-9B-oQ3.5
Apple Silicon Optimized oQ3.5 MLX Quantized Release
This repository contains an oQ3.5 mixed-precision MLX quantized version of Qwythos-9B, optimized for efficient local inference on Apple Silicon devices.
The original Qwythos-9B model was developed by Empero AI. This repository contains an optimized MLX/oQ3.5 conversion only—no additional fine-tuning or retraining has been performed.
About Qwythos
Qwythos-9B is a full-parameter reasoning model built upon Qwen3.5-9B and trained on over 500 million tokens of carefully curated reasoning data.
The original model specializes in:
- 🧠 Advanced reasoning
- 💻 Programming
- 🛠 Native function calling
- 🤖 Tool use
- 🔐 Cybersecurity
- 🧬 Biomedical reasoning
- ➗ Mathematics
- 🔬 Scientific reasoning
- 📚 Long-context agent workflows
Key capabilities include:
- 1,048,576 token context window
- Native function calling
- Excellent coding performance
- Strong mathematical reasoning
- Tool-assisted self-correction
- Long-context understanding
- Uncensored technical reasoning
For complete benchmark results, training methodology, and evaluation details, please visit the original repository:
https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M
Quantization
This release uses oQ3.5 mixed-precision quantization.
Specifications
- Format: MLX
- Quantization: oQ3.5
- Method: Sensitivity-Aware Mixed Precision
- Target Platform: Apple Silicon
- Inference: MLX / oMLX
Unlike traditional uniform quantization, oQ dynamically allocates precision according to layer sensitivity, preserving higher precision for the most important weights while aggressively compressing less sensitive regions.
This provides an excellent balance between:
- Higher reasoning quality
- Better coding performance
- Lower memory usage
- Faster inference
- Excellent Apple Silicon efficiency
Recommended Settings
For the best reasoning performance:
temp: 0.6
top_p: 0.95
top_k: 20
min_p: 0
rep_penalty: 1.05
presence_penalty: 1.5
enable_thinking: true
These settings provide excellent performance across:
- Reasoning
- Mathematics
- Programming
- Tool Use
- Scientific Questions
- Agent Workflows
Example Usage
from mlx_lm import load, generate
model, tokenizer = load("YOUR_USERNAME/Qwythos-9B-oQ3.5")
messages = [
{
"role": "user",
"content": "Explain speculative decoding."
}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
response = generate(
model,
tokenizer,
prompt=prompt,
temp=0.6,
top_p=0.95,
top_k=20,
max_tokens=16384,
)
print(response)
Optimized For
This release is optimized for:
- Apple M1
- Apple M1 Pro / Max / Ultra
- Apple M2 Series
- Apple M3 Series
- Apple M4 Series
Compatible with:
- MLX
- oMLX
- Open WebUI
- LM Studio (MLX)
- MLX-LM
- Local AI Applications
Intended Use
Qwythos-9B-oQ3.5 is well suited for:
- Software Engineering
- AI Coding Assistants
- Long Context Analysis
- Scientific Research
- Mathematical Reasoning
- Cybersecurity
- Biomedical Analysis
- Local AI Agents
- Tool Calling Applications
- Research & Education
Hardware Recommendations
Recommended systems:
- Apple M1 Pro / Max / Ultra
- Apple M2 Pro / Max / Ultra
- Apple M3 Series
- Apple M4 Series
Higher-memory configurations are recommended when utilizing the full 1M context window.
About oQ Quantization
oQ is a sensitivity-aware mixed-precision quantization technique designed to maximize model quality while significantly reducing memory usage.
Instead of quantizing every layer identically, oQ analyzes layer importance and preserves additional precision where it matters most.
Benefits include:
- Better reasoning retention
- Improved coding performance
- Higher mathematical accuracy
- Lower memory usage
- Faster inference
- Excellent Apple Silicon optimization
Credits
Original Model
All credit for the original model, datasets, training methodology, evaluation, benchmarks, and research belongs entirely to:
Empero AI
Original Repository:
https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M
Base Model:
https://huggingface.co/Qwen/Qwen3.5-9B
oQ3.5 MLX Quantized Release
This repository provides an Apple Silicon optimized oQ3.5 MLX quantized version of the original model.
No additional fine-tuning has been performed.
Acknowledgements
- Empero AI
- Alibaba Qwen Team
- Apple MLX
- Hugging Face
- Transformers
- TRL
- EleutherAI
- oMLX
- OptiQ Quantization
Citation
If you use this model in research, please cite the original Qwythos-9B model and the Qwen3.5 base model.
License
This release inherits the Apache-2.0 license from the original model.
Please refer to the original repository for complete licensing information.
Disclaimer
This repository contains an optimized oQ3.5 MLX quantized conversion intended for efficient local inference on Apple Silicon devices.
All original model architecture, datasets, training, benchmarks, evaluations, and research remain entirely the work of the original authors.
- Downloads last month
- 25
Model tree for yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5
Base model
Qwen/Qwen3.5-9B-Base
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yugeshkarunamurthy/Qwythos-9B-Claude-Mythos-5-1M-oQ3.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)