Text Generation
Transformers
Safetensors
qwen3
math
code
Merge
uncensored
conversational
agent
athenea
text-generation-inference
Instructions to use Aquiles-ai/Athenea-4B-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Aquiles-ai/Athenea-4B-Thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Aquiles-ai/Athenea-4B-Thinking") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking") model = AutoModelForMultimodalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Aquiles-ai/Athenea-4B-Thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Aquiles-ai/Athenea-4B-Thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Aquiles-ai/Athenea-4B-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Aquiles-ai/Athenea-4B-Thinking
- SGLang
How to use Aquiles-ai/Athenea-4B-Thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Aquiles-ai/Athenea-4B-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Aquiles-ai/Athenea-4B-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Aquiles-ai/Athenea-4B-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Aquiles-ai/Athenea-4B-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Aquiles-ai/Athenea-4B-Thinking with Docker Model Runner:
docker model run hf.co/Aquiles-ai/Athenea-4B-Thinking
File size: 5,993 Bytes
0c789b9 f49859c 0c789b9 3f07c23 0954679 0abb959 0c789b9 060befc 54dab4b f2043c7 43daa2d f2043c7 75b395c f2043c7 1dfe6c7 b3e3e17 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | ---
license: apache-2.0
datasets:
- Aquiles-ai/Athenea-40k
language:
- en
- es
- de
- fr
- it
base_model:
- huihui-ai/Huihui-Qwen3-4B-Thinking-2507-abliterated
tags:
- math
- code
- merge
- qwen3
- uncensored
- conversational
- agent
- athenea
pipeline_tag: text-generation
library_name: transformers
---
<h1 align="center">Athenea-4B-Thinking</h1>

**Athenea-4B-Thinking** is a fine-tuned version of [huihui-ai/Huihui-Qwen3-4B-Thinking-2507-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3-4B-Thinking-2507-abliterated), designed as a **general-purpose reasoning model** capable of handling mathematical, multilingual, and conversational reasoning tasks.
Trained on diverse, high-quality reasoning data with explicit `<think>` and `</think>` traces, this model represents the **core generalist** version of the Athenea family, intended as a foundation for specialized reasoning variants.
> ⚠️ **Important Note:** This model uses an *abliterated (uncensored)* base version, providing full expressive freedom and unrestricted output generation. Users are fully responsible for any use or content produced by the model. It is intended exclusively for research and experimentation purposes.
## 🎯 Model Description
Athenea-4B-Thinking leverages the structured reasoning framework of Huihui-Qwen3 and expands it across multiple domains and languages. It serves as a **multidomain reasoning model**, performing well in both conversational and analytical contexts.
Key features:
* **Step-by-step reasoning** within `<think>` blocks
* **General reasoning across math, language, and logic**
* **Multilingual understanding and response generation**
* **Uncensored reasoning output** for transparency
* **Improved logical consistency** through focused fine-tuning
* **Compatible with open inference frameworks** (Transformers, vLLM, etc.)
The model was fine-tuned using the dataset [Aquiles-ai/Athenea-40k](https://huggingface.co/datasets/Aquiles-ai/Athenea-40k).
> Note: Fine-tuning was performed using **Kronos**, Aquiles-ai’s proprietary enterprise fine-tuning system.
## 💻 Usage
### Installation
```bash
uv pip install transformers torch accelerate
```
### Basic Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking",
dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
attn_implementation="flash_attention_2") # Requires flash-attn
# Without flash-attn:
# model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking",
# dtype="auto",
# device_map="auto"
# )
tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", trust_remote_code=True)
messages = [
{"role": "user", "content": "Hey, explain to me in simple terms how reinforcement learning works."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to('cuda')
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=8092,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode and print the output
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Streaming Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread
model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking",
dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", trust_remote_code=True)
messages = [
{"role": "user", "content": "Hey, explain the difference between artificial intelligence, machine learning, and deep learning."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to('cuda')
# Create the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Build kwargs for generate
generate_kwargs = dict(
**inputs,
max_new_tokens=8092,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
streamer=streamer,
)
def _generate_thread(model, kwargs):
with torch.no_grad():
model.generate(**kwargs)
thread = Thread(target=_generate_thread, args=(model, generate_kwargs))
thread.start()
for chunk in streamer:
print(chunk, end="", flush=True)
```
### Production Deployment with vLLM
**Start server:**
```bash
vllm serve Aquiles-ai/Athenea-4B-Thinking \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--max-model-len=16384 \
--async-scheduling \
--gpu-memory-utilization=0.90
```
**Request to the server from the OpenAI client:**
```python
from openai import OpenAI
client = OpenAI(api_key="dummyapikey", base_url="http://127.0.0.1:8000/v1")
stream = client.chat.completions.create(
model="Aquiles-ai/Athenea-4B-Thinking",
messages=[{
"role": "user",
"content": "Hey, tell me how a large language model like Llama or GPT is trained."
}],
max_tokens=8092,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
```
**vLLM Benefits:** 20-30x faster inference, OpenAI-compatible API, continuous batching, async scheduling.
### Aquiles-playground
In addition to code usage, you can also try our models locally through an [open-source playground on GitHub](https://github.com/Aquiles-ai/aquiles-playground).

<p align="center">
Made with ❤️ by <a href="https://github.com/Aquiles-ai">Aquiles-ai</a>
</p> |