---
license: apache-2.0
datasets:
- Aquiles-ai/Athenea-40k
language:
- en
- es
- de
- fr
- it
base_model:
- huihui-ai/Huihui-Qwen3-4B-Thinking-2507-abliterated
tags:
- math
- code
- merge
- qwen3
- uncensored
- conversational
- agent
- athenea
---
Athenea-4B-Thinking

## 💻 Usage
### Installation
```bash
uv pip install transformers torch accelerate
```
### Basic Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking",
dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
attn_implementation="flash_attention_2") # Requires flash-attn
# Without flash-attn:
# model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking",
# dtype="auto",
# device_map="auto"
# )
tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", trust_remote_code=True)
messages = [
{"role": "user", "content": "Hey, explain to me in simple terms how reinforcement learning works."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to('cuda')
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=8092,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode and print the output
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Streaming Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread
model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking",
dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", trust_remote_code=True)
messages = [
{"role": "user", "content": "Hey, explain the difference between artificial intelligence, machine learning, and deep learning."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to('cuda')
# Create the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Build kwargs for generate
generate_kwargs = dict(
**inputs,
max_new_tokens=8092,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
streamer=streamer,
)
def _generate_thread(model, kwargs):
with torch.no_grad():
model.generate(**kwargs)
thread = Thread(target=_generate_thread, args=(model, generate_kwargs))
thread.start()
for chunk in streamer:
print(chunk, end="", flush=True)
```
### Production Deployment with vLLM
**Start server:**
```bash
vllm serve Aquiles-ai/Athenea-4B-Thinking \
--host 0.0.0.0 \
--port 8000 \
--api-key dummyapikey \
--max-model-len=16384 \
--async-scheduling \
--gpu-memory-utilization=0.90
```
**Request to the server from the OpenAI client:**
```python
from openai import OpenAI
client = OpenAI(api_key="dummyapikey", base_url="http://127.0.0.1:8000/v1")
stream = client.chat.completions.create(
model="Aquiles-ai/Athenea-4B-Thinking",
messages=[{
"role": "user",
"content": "Hey, tell me how a large language model like Llama or GPT is trained."
}],
max_tokens=8092,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
```
**vLLM Benefits:** 20-30x faster inference, OpenAI-compatible API, continuous batching, async scheduling.