Instructions to use BSC-LT/ALIA-40b-fc-2605 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BSC-LT/ALIA-40b-fc-2605 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="BSC-LT/ALIA-40b-fc-2605")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("BSC-LT/ALIA-40b-fc-2605")
model = AutoModelForCausalLM.from_pretrained("BSC-LT/ALIA-40b-fc-2605")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use BSC-LT/ALIA-40b-fc-2605 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BSC-LT/ALIA-40b-fc-2605"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BSC-LT/ALIA-40b-fc-2605",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/BSC-LT/ALIA-40b-fc-2605

SGLang

How to use BSC-LT/ALIA-40b-fc-2605 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "BSC-LT/ALIA-40b-fc-2605" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BSC-LT/ALIA-40b-fc-2605",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "BSC-LT/ALIA-40b-fc-2605" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BSC-LT/ALIA-40b-fc-2605",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use BSC-LT/ALIA-40b-fc-2605 with Docker Model Runner:
```
docker model run hf.co/BSC-LT/ALIA-40b-fc-2605
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

WARNING: This model has been trained on instructions but has not undergone safety or value alignment.

Work In Progress: New versions will be released over the coming months.

ALIA-40b-fc Model Card

The ALIA-40b-fc-2605 model is a fine-tuned variant of a context-extended base ALIA-40b model, which was pre-trained from scratch on 9.83 trillion tokens of carefully curated data spanning 35 European languages (including code). This version is primarily optimized for robust, reliable function calling, while still capable of following user prompts and engaging in multi-turn dialogue.

In keeping with our commitment to open-source development, all tools and sources used to process and create the training data are open-licensed. For clarity, our definition of open-licensed excludes any source, tool, model, or dataset whose terms of use impose restrictive conditions that impede standard open reuse.

This model is released under the permissive Apache 2.0 license.

To visit the model cards of other model versions, please refer to the Model Index.

Model Details

Description

The ALIA-40b is a transformer-based, decoder-only language model that was pre-trained from scratch on 9.37 trillion tokens of meticulously curated data. It subsequently underwent continued pretraining on additional 424 billion high-quality tokens, and was further extended with a supplementary 39 billion tokens drawn from a similarly diverse mixture, totalling 9.83 trillion tokens.

ALIA-40b-fc is an fine-tuned variant of ALIA-40b. Its development process comprises, in contrast to previous version, only two consecutive stages, each targeting a specific capability: (1) long-context adaptation to extend the model’s context window, (2) supervised fine-tuning to improve function calling capabilities. This means that this checkpoint has not yet undergone an alignment process, unlike previous versions.

After long-context adaptation, our post-training process consists of a supervised fine-tuning (SFT) stage to strengthen function calling and include conversational capabilities.

Although the base model is highly multilingual, the post-training process focused primarily on English due to the limited availability of high-quality datasets in other languages. Evaluation coverage outside English also remains limited. Future releases aim to further strengthen multilingual capabilities through the generation of high-quality synthetic data.

Hyperparameters

Here we list the specific hyperparameters used during the different training stages.

Long context CPT

Hyperparameter	Value
Learning rate	9e-7
LR Scheduler	Constant
Tokens per update	4M
Training tokens (4k →32k).	2B
Training tokens (32k →160k).	36.8B

Supervised Fine-Tuning (SFT)

Hyperparameter	Value
Learning rate	5e-6
Batch size	256
Epochs	1
LR Scheduler	Cosine
Warmup Ratio	4 %
Total Steps	5,687

Architecture

Attribute	Value
Total Parameters	40,433,885,184
Embedding Parameters	2,097,152,000
Layers	48
Hidden size	8,192
Attention heads	64
Context length	163,840
Vocabulary size	256,000
Precision	bfloat16
Embedding type	RoPE
Activation Function	SwiGLU
Layer normalization	RMS Norm
Flash attention	✅
Grouped Query Attention	✅
Num. query groups	8

Intended Use

Direct Use

ALIA‑40b‑fc is primarily optimized for robust and reliable function calling in tool-augmented and multi-turn conversational settings, while remaining capable of supporting other general-purpose language tasks. As with all models in the ALIA family, it is released openly to support both research and commercial use in any of the covered languages.

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.

Hardware and Software

Training Framework

The post-training process was conducted in NeMo-RL, with minor modifications to adapt it to our infraestructure.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

4x Nvidia Hopper GPUs with 64GB HBM2 memory
2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
4x NDR200 (BW per node 800Gb/s)
512 GB of Main memory (DDR5)
460GB of NVMe storage

The SFT stage was run across 8 nodes with a total of 32 GPUs.

How to use

The model can be used either directly in Python using the transformers library or deployed as a service and used through standard API calls.

While the former gives the most control over the inference process it requires the code to be executed on a machine with a sufficiently powerful GPU to run the model locally, and is more error prone than the alternative. We therefore strongly recommend the latter, as deploying the model as a service can be done either locally or on a remote server and makes the model available to multiple clients in parallel among other advantages.

Unless you have very specific needs (e.g. for research) that require adapting the inference process it is preferable to follow the "deployment as a service" guidelines below.

In any case, we recommend using a temperature setting close to zero (0.0–0.2) to achieve optimal performance.

Local inference with Python / transformers

The model utilizes the widely adopted ChatML template to structure conversational inputs and outputs. Using this standardized chat format ensures a consistent and enhanced conversational experience. The template can be easily applied through the tokenizer’s built-in functions, as illustrated in the example snippet below:

import torch
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "BSC-LT/ALIA-40b-fc-2605"

text = "What is the weather like in Paris today?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]

tools = [{
    "type": "function",
    "name": "get_weather",
    "description": "Get current temperature for a given location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City and country e.g. Bogotá, Colombia"
            }
        },
        "required": [
            "location"
        ],
        "additionalProperties": False
    }
}]

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris, France"}}
</tool_call>

Deployment as service and remote use (Messages API)

Deploy the model using vLLM docker image:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 80:80 \
    vllm/vllm-openai:latest \
    --model BSC-LT/salamandra-7b-instruct-tools \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --max_model_len 8196 \
    --port 80

Once the deployment is running, interact with the model through the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
        base_url="http://localhost:8080/v1/", 
        api_key="hf_xxxx"
    )

models = client.models.list()
model = models.data[0].id

system_message = ""
messages = [{ "role": "system", "content": system_message}] if system_message else []
messages.append( {"role":"user", "content": "What is the weather like in Paris today?"})
print(messages)
chat_completion = client.chat.completions.create(
    model=model,
    tools=tools
    messages=messages,
    stream=False,
    max_tokens=1000,
    temperature=0.1,
    frequency_penalty=0.2,
)

msg = chat_completion.choices[0].message

# --- HANDLE TOOL CALL OR NORMAL CONTENT ---

if not getattr(msg, "tool_calls", None):
    # Normal assistant message
    print(msg.content)

    messages.append({
        "role": "assistant",
        "content": msg.content
    })

else:
    # Assistant tool call message
    print(msg.tool_calls)

    messages.append({"role": "assistant", "tool_calls": msg.tool_calls})

    # --- Fake tool execution example ---
    tool_call = msg.tool_calls[0]
    # Example: handle the get_weather tool
    if tool_call.function.name == "get_weather":
        # Fake tool result (this would come from your actual backend)
        fake_tool_result = '{"temperature": 18, "unit": "C", "description": "Partly cloudy in Paris"}'

        # Append the tool result message so the model can use it in the next turn
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": fake_tool_result,
        })

Training Data

The dataset used in the supervised fine-tuning stage is built from a mixture of high-quality, permissively licensed datasets developed by third parties and synthetic data generated in-house using DeepSeek-V3-0324.

The table below provides a detailed breakdown of the datasets included in this mixture:

Dataset	Generation Method	License	Instances
nvidia/When2Call	Synthetic	cc-by-4.0	14,800
Salesforce/xlam-function-calling-60k	Synthetic	cc-by-4.0	59,800
glaiveai/glaive-function-calling-v2	Synthetic	apache-2.0	102,891
Team-ACE/ToolACE	Synthetic	apache-2.0	11,068
Agent-Ark/Toucan-1.5M	Synthetic	apache-2.0	119,079
allenai/Dolci-Instruct-SFT-Tool-Use-SA	Synthetic	cc-by-sa-4.0	1,369
In-house function calling data (synthetically generated)	Synthetic	apache-2.0	19,227
Instruction-tuning data (see ALIA-40b-instruct)	Mix	apache-2.0	399,800
Total			728,034

Note: Counts may differ slightly from the original datasets due to quality filtering (e.g., removal of poorly formatted or invalid samples) and because a small portion of each dataset was held out for validation purposes (total of 2,000 instances).

Evaluation

The model’s function-calling (FC) capabilities were evaluated using the BFCL benchmark, which is widely regarded as a standard and comprehensive suite for assessing tool-use and function invocation performance in large language models.

Metric	Category	Score
Simple AST	Non-Live	71.0%
Multiple AST	Non-Live	94.5%
Parallel AST	Non-Live	80.5%
Parallel Multiple AST	Non-Live	81.5%
Simple AST	Live	74.8%
Multiple AST	Live	74.4%
Parallel AST	Live	56.3%
Parallel Multiple AST	Live	70.8%
Base	Multi-Turn	15.5%
Miss Func	Multi-Turn	2.0%
Miss Param	Multi-Turn	12.0%
Long Context	Multi-Turn	7.0%
Relevance Detection	Hallucination	81.3%
Irrelevance Detection	Hallucination	84.0%

Ethical Considerations and Limitations

The ALIA-40b-fc model is an instruction-tuned variant. It has several limitations that users should be aware of. Ongoing work is addressing these areas, including comprehensive evaluation of societal and cognitive biases as well as safety.

Functional Limitations:

Reasoning & Math: The model is not guaranteed to perform robust chain-of-thought reasoning or advanced mathematics. Complex logical puzzles or multi-step inferences may fail or produce inconsistent answers.
Code Generation: Although exposed to code during pretraining, ALIA-40b-fc is not a specialized code-generation model. It may produce code-like text, but outputs should be verified and tested before use in production codebases.
Agentive Capabilities: The model does not have agentive or autonomous action capabilities. It cannot act as an autonomous agent or execute multi-step workflows.

Recommendations:

Developers should implement additional safety filters, human oversight, targeted evaluation suites, and secondary evaluation models when deploying this model. Do not deploy ALIA-40b-fc in critical applications without extensive testing and mitigation. Users are responsible for assessing and mitigating harmful behavior or misinformation resulting from model outputs, and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

Additional information

Author

The Language Modeling team from AI Institute at Barcelona Supercomputing Center.

Contact

For further information, please send an email to ai_institute_languagemodeling@bsc.es.

Copyright

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

This work has been promoted and supported by the Government of Catalonia through the Aina Project.

Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.

We are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. Many other institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà. We thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, especially to: Marcelo Sanchez, Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipe Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

Disclaimer

Be aware that the model may show biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}