--- license: apache-2.0 datasets: - Aquiles-ai/Athenea-40k language: - en - es - de - fr - it base_model: - huihui-ai/Huihui-Qwen3-4B-Thinking-2507-abliterated tags: - math - code - merge - qwen3 - uncensored - conversational - agent - athenea ---

Athenea-4B-Thinking

![image](atheneamodel.png) ## 💻 Usage ### Installation ```bash uv pip install transformers torch accelerate ``` ### Basic Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", attn_implementation="flash_attention_2") # Requires flash-attn # Without flash-attn: # model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", # dtype="auto", # device_map="auto" # ) tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", trust_remote_code=True) messages = [ {"role": "user", "content": "Hey, explain to me in simple terms how reinforcement learning works."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to('cuda') with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=8092, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id, ) # Decode and print the output print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### Streaming Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer import torch from threading import Thread model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", attn_implementation="flash_attention_2") tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Thinking", trust_remote_code=True) messages = [ {"role": "user", "content": "Hey, explain the difference between artificial intelligence, machine learning, and deep learning."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to('cuda') # Create the streamer streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) # Build kwargs for generate generate_kwargs = dict( **inputs, max_new_tokens=8092, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id, streamer=streamer, ) def _generate_thread(model, kwargs): with torch.no_grad(): model.generate(**kwargs) thread = Thread(target=_generate_thread, args=(model, generate_kwargs)) thread.start() for chunk in streamer: print(chunk, end="", flush=True) ``` ### Production Deployment with vLLM **Start server:** ```bash vllm serve Aquiles-ai/Athenea-4B-Thinking \ --host 0.0.0.0 \ --port 8000 \ --api-key dummyapikey \ --max-model-len=16384 \ --async-scheduling \ --gpu-memory-utilization=0.90 ``` **Request to the server from the OpenAI client:** ```python from openai import OpenAI client = OpenAI(api_key="dummyapikey", base_url="http://127.0.0.1:8000/v1") stream = client.chat.completions.create( model="Aquiles-ai/Athenea-4B-Thinking", messages=[{ "role": "user", "content": "Hey, tell me how a large language model like Llama or GPT is trained." }], max_tokens=8092, stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) ``` **vLLM Benefits:** 20-30x faster inference, OpenAI-compatible API, continuous batching, async scheduling. ### Aquiles-playground In addition to code usage, you can also try our models locally through an [open-source playground on GitHub](https://github.com/Aquiles-ai/aquiles-playground). ![preview](preview_playground.png)