# How to use SalamandraTA models
## Table of Contents
- [Introduction](#introduction)
- [Quick Start Guide - Full Models](#quick-start-guide---full-models)
- [Installation Requirements](#installation-requirements)
- [Loading the Model and Tokenizer in Python](#loading-the-model-and-tokenizer-in-python)
- [Preparing Your First Prompt](#preparing-your-first-prompt)
- [Generating a Translation (Inference)](#generating-a-translation-inference)
- [Quick Start Guide - Quantized Models](#quick-start-guide---quantized-models)
- [Installation Requirements](#installation-requirements-1)
- [Loading the Model and Tokenizer in Python](#loading-the-model-and-tokenizer-in-python-1)
- [Preparing Your First Prompt](#preparing-your-first-prompt-1)
- [Generating a Translation (Inference)](#generating-a-translation-inference-1)
- [Prompting Guide](#prompting-guide)
- [Multi-turn Prompts](#multi-turn-prompts)
- [Load the Model and Tokenizer](#load-the-model-and-tokenizer)
- [Define a Chat Helper Function](#define-a-chat-helper-function)
- [Example: Alternative Translations](#example-alternative-translations)
- [Notes on multi-turn interactions](#notes-on-multi-turn-interactions)
- [Decoding Strategies](#decoding-strategies)
- [Greedy Decoding (Default)](#greedy-decoding-default)
- [Beam Search](#beam-search)
- [Important Note on End Tokens](#important-note-on-end-tokens)
- [Diverse Beam Search](#diverse-beam-search)
- [Quality-Aware Decoding Strategies](#quality-aware-decoding-strategies)
- [Tuned Re-ranking](#tuned-re-ranking)
- [Minimum Bayes Risk (MBR) Decoding](#minimum-bayes-risk-mbr-decoding)
- [Compatibility Wrapper](#compatibility-wrapper)
## Introduction
**SalamandraTA** is a family of multilingual language models fine-tuned for translation and related language technology tasks. These instruction-tuned variants (available in 2B and 7B sizes) are part of the broader Salamandra model suite and are particularly strong in **Catalan, Spanish, and English**, with support for over 40 languages.
The models support sentence- and paragraph-level translation, automatic post-editing, paraphrasing, grammar correction, and multilingual NER. Both models are available in full precision and quantized formats, making them accessible for a wide range of hardware configurations.
This document serves as a **wiki and quick start guide** for developers and researchers who want to use SalamandraTA for translation and MT-related tasks. It includes introductory notes on prompting, decoding strategies, and key technical specifications of the model.
## Quick Start Guide - Full Models
This section guides users from installing the necessary tools to running your first translation task, regardless of experience level.
### Installation Requirements
Ensure you have the required packages installed:
```bash
pip install transformers accelerate sentencepiece
```
### Loading the Model and Tokenizer in Python
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_id = "BSC-LT/salamandraTA-7b-instruct" # Can also use "BSC-LT/salamandraTA-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # Automatically selects GPU/CPU (recommended for seamless use)
torch_dtype=torch.bfloat16 # Use bfloat16 for faster inference (ensure hardware support)
)
```
### Preparing Your First Prompt
SalamandraTA models use a specific prompt format (ChatML) to understand instructions properly. This is how you prepare a prompt to translate text from Spanish to Catalan:
```python
# Define your task
# These variables will change depending on the task (translation, NER, grammar correction, etc.)
source = "Spanish"
target = "English"
sentence = (
"Ayer se fue, tomó sus cosas y se puso a navegar. Una camisa, un pantalón vaquero "
"y una canción, dónde irá, dónde irá. Se despidió, y decidió batirse en duelo con el mar. "
"Y recorrer el mundo en su velero. Y navegar, nai-na-na, navegar."
)
# Prepare the input prompt using the standard instruction format
# see the Prompting Guide for recommended prompts for different tasks
text = (
f"Translate the following text from {source} into {target}.\n"
f"{source}: {sentence}\n"
f"{target}:"
)
# Structure the prompt using ChatML format
message = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(
message,
tokenize=False, # Return plain string, not tokenized
add_generation_prompt=True # Add special token to signal model should begin generating
)
```
### Generating a Translation (Inference)
```python
inputs = tokenizer.encode(
prompt,
add_special_tokens=False, # Don't add special tokens beyond what the ChatML template includes
return_tensors="pt" # Return as PyTorch tensor
)
input_length = inputs.shape[1] # Used to trim the prompt portion from the output
eos_tokens = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>") ]
outputs = model.generate(
input_ids=inputs.to(model.device), # Automatically move to CPU/GPU as needed
max_new_tokens=200, # Increase this for longer inputs (e.g., paragraph or doc-level translation)
eos_token_id=eos_tokens, # Use instruction-tuned stop token and pre-training stop token
num_beams=1, # 1 = greedy decoding (use num_beams > 1 to enable beam search)
pad_token_id=tokenizer.eos_token_id # Required for some models during beam search
)
translation = tokenizer.decode(
outputs[0, input_length:], # Skip the prompt when decoding
skip_special_tokens=True # Remove any lingering special tokens
)
print(translation)
```
## Quick Start Guide - Quantized Models
### Installation Requirements
For inference with quantized models (GGUF), ensure you have the required packages installed:
```bash
pip install huggingface_hub vllm torch
pip install llama-cpp-python
```
### Loading the Model and Tokenizer in Python
```python
from huggingface_hub import snapshot_download
from vllm import LLM, SamplingParams
# Download the quantized GGUF model from Hugging Face
model_dir = snapshot_download(
repo_id="BSC-LT/salamandraTA-7B-instruct-GGUF", # Replace with the 2B repo if needed
revision="main" # Optional: pin to a specific version
)
# Choose which GGUF file to load; q4 is generally recommended
model_name = "salamandrata_7b_inst_q4.gguf"
# Load the quantized model and tokenizer for inference
llm = LLM(
model=model_dir + '/' + model_name, # Path to GGUF file
tokenizer=model_dir # Tokenizer is automatically detected from snapshot
)
```
### Preparing Your First Prompt
SalamandraTA models use a specific prompt format (ChatML) to understand instructions properly. This is how you prepare a prompt to translate text from Spanish to Catalan:
```python
# Define your task
# These variables will change depending on the task (translation, NER, grammar correction, etc.)
source = "Spanish"
target = "English"
sentence = (
"Ayer se fue, tomó sus cosas y se puso a navegar. Una camisa, un pantalón vaquero "
"y una canción, dónde irá, dónde irá. Se despidió, y decidió batirse en duelo con el mar. "
"Y recorrer el mundo en su velero. Y navegar, nai-na-na, navegar."
)
# Prepare the input prompt using the standard instruction format
# see the Prompting Guide for recommended prompts for different tasks
prompt = f"Translate the following text from {source} into {target}.\\n{source}: {sentence} \\n{target}:"
messages = [{'role': 'user', 'content': prompt}]
)
```
### Generating a Translation (Inference)
```python
outputs = llm.chat(messages,
sampling_params=SamplingParams(
temperature=0, # Greedy decoding (no randomness)
stop_token_ids=[5], # Stop generation at special token ID (should correspond to <|im_end|>)
max_tokens=200) # Maximum number of new tokens to generate; increase for longer texts
)[0].outputs
print(outputs[0].text)
```
## Prompting Guide
The SalamandraTA models were instruction-tuned using a combination of custom prompts and publicly available instruction datasets.
- For **sentence-, paragraph- and document-level translation**, we used a single fixed English prompt consistently applied across language pairs. These tasks are the most robust and consistent across the model's output.
- For **other tasks**, including post-editing, grammar correction, paraphrasing, alternative translations, machine translation evaluation and context-aware translation, we incorporated instruction samples from the [**TowerBlocks** dataset (Alves et al., 2024)]([https://arxiv.org/pdf/2402.17733](https://arxiv.org/pdf/2402.17733)), which provides high-quality, task-diverse instruction tuning data.
- TowerBlocks includes multiple prompt phrasings per task, often paraphrased or reworded in different ways. As a result, the model has **some tolerance for varied instructions** on these secondary tasks. However, performance may be less consistent than for translation tasks unless prompt phrasing is close to what the model saw during training.
## Multi-turn Prompts
Some tasks supported by the SalamandraTA models, such as Alternative Translations, require multi-turn interactions. These are scenarios where the model is prompted more than once in a conversation-like structure, and each prompt depends on the model’s previous response.
Unlike single-shot tasks, multi-turn interactions must be handled programmatically or within a notebook environment, where you can:
1. Send an initial prompt,
2. Capture the model’s response,
3. Send a second prompt that builds on that response,
4. Repeat as needed.
This conversational style mirrors the format used during instruction tuning and is important for eliciting coherent, contextual follow-up behavior.
### Load the Model and Tokenizer
In order to interact with the model in this way, you should begin by loading the model and tokenizer as in the Quick Start Guide above. Then follow the following steps:
### Define a Chat Helper Function
```python
def chat(messages, max_new_tokens=200):
# Convert the message list into a ChatML-style prompt
# increase max_new_tokens if dealing with longer (paragraph/doc-level) inputs.
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize the prompt
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
input_len = inputs.shape[1]
# Generate a response using greedy decoding and ChatML-specific stop token
outputs = model.generate(
input_ids=inputs,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)
# Return just the newly generated text, skipping special tokens
return tokenizer.decode(outputs[0, input_len:], skip_special_tokens=True)
```
### Example: Alternative Translations
```python
# First user message: Request a Catalan translation
messages = [
{
"role": "user",
"content": 'Given the English sentence "Nothing is achieved without effort.", generate a translation in Catalan.'
}
]
# Generate and print the first response
response_1 = chat(messages)
print("Response 1:", response_1)
#Response 1: Translation in Catalan as listed:
#Res s'aconsegueix sense esforç.
```
```python
# Add model's first reply to the chat history
messages.append({"role": "assistant", "content": response_1})
# Add a second user message requesting an alternative translation
messages.append({
"role": "user",
"content": 'That was a useful translation. For a broader understanding, please provide one additional translation of the English phrase "Nothing is achieved without effort." into Catalan.'
})
# Generate and print the second response
response_2 = chat(messages)
print("Response 2:", response_2)
#Response 2: Res s'aconsegueix sense treballar.
```
### Notes on multi-turn interactions
This structure can be extended to more than two turns by continuing to append `{"role": "assistant", "content": ...}` and `{"role": "user", "content": ...}` objects to the `messages` list. It can also be used with other tasks - for example, in Machine Translation Evaluation, you could follow up a request for the best translation with a request for the worst.
## Decoding Strategies
There are different decoding strategies that can be used with SalamandraTA models for generating translations. The strategy you choose can affect the fluency, accuracy, and consistency of translations or other task outputs.
For most use cases, we recommend one of the following:
### **Greedy Decoding** (Default)
---
Greedy decoding selects the most likely next token at each step, producing fast and predictable outputs. It is well-suited for most translation tasks where consistency and speed are priorities.
- **Use when**: You want reliable, fast inference translations — ideal for most MT and post-editing tasks.
```python
outputs = model.generate(
input_ids=inputs.to(model.device),
max_new_tokens=200,
num_beams=1, # 1 = greedy decoding
pad_token_id=tokenizer.eos_token_id
)
```
### **Beam Search**
Beam search keeps multiple candidate sequences at each step and chooses the most probable overall sequence after considering these options. This can improve fluency and quality, especially for longer or more complex text, but requires more computational resources and may sometimes produce less consistent end tokens.
- **Use when**: You want more polished outputs, are working with longer paragraphs, or are post-selecting among alternatives.
```python
outputs = model.generate(
input_ids=inputs.to(model.device),
max_new_tokens=200,
num_beams=5, # > = greedy decoding, generally 5 beams is a good balance between quality and cost
pad_token_id=tokenizer.eos_token_id
)
```
### **Important Note on End Tokens**
The instruction-tuned SalamandraTA models are designed to end outputs with `<|im_end|>`.
Greedy decoding reliably respects this. Beam search sometimes prefers ``, depending on how likely `<|im_end|>` is under the beam hypotheses.
If needed, set the following to ensure clean stopping behavior:
```python
stop_sequence = '<|im_end|>'
eos_tokens = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids(stop_sequence),
]
```
In the generation, it can then be applied as in the following:
```python
outputs = model.generate(
input_ids=inputs.to(model.device),
max_new_tokens=4000,
early_stopping=True,
eos_token_id=eos_tokens,
pad_token_id=tokenizer.eos_token_id,
num_beams=5
)
```
### Diverse Beam Search
Diverse beam search is a variation of beam search that encourages variety among the translation candidates. Instead of just finding the most probable sequences, it aims to find a set of high-probability sequences that are also different from one another. This is achieved by penalizing hypotheses that are too similar to others already in the beam.
- **Use when:** You want to generate multiple, distinct translation options for a single source sentence. This is useful for tasks like providing alternative translations or for creating a richer set of candidates for a subsequent quality-aware decoding steps.
```python
# Use num_beam_groups and diversity_penalty to enable diverse beam search
# This will generate 5 beams, divided into 5 groups (so each group is independent)
# The diversity penalty discourages beams within a group from being too similar.
outputs = model.generate(
input_ids=inputs.to(model.device),
max_new_tokens=200,
num_beams=5,
num_beam_groups=5,
diversity_penalty=1.0,
pad_token_id=tokenizer.eos_token_id
)
```
## **Quality-Aware Decoding Strategies**
While greedy and beam search are powerful, they prioritize statistical likelihood (what the model thinks is the most probable translation) over actual translation quality. This can sometimes lead to the "beam search curse," where translations with the highest probability are not actually the best in terms of quality, sometimes resulting in unnatural text. Quality-aware strategies address this by incorporating a more direct sense of "goodness" into the decoding process. For using quality-aware decoding strategies with SalamandraTA we will need to install two libraries; `mbrs` and `comet`.
```python
pip install mbrs
pip install unbabel-comet
```
Then, we will need a set of candidate translations, we will be using diverse beam search for generating a set of hypothesis. Traditional beam search often generates very similar sequences across different beams. Diverse beam search addresses this by penalizing beams that generate tokens already chosen by other beams in the same time step.
```python
sentence = "Ayer se fue, tomó sus cosas y se puso a navegar"
text = (
f"Translate the following text from {source} into {target}.\n"
f"{source}: {sentence}\n"
f"{target}:"
)
message = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(prompt, add_special_tokens=False,return_tensors="pt")
input_length = inputs.shape[1] # Used to trim the prompt portion from the output
eos_tokens = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>") ]
# We will generate 20 candidate translations
num_hypotheses = 20
outputs = model.generate(
input_ids=inputs.to(model.device),
max_new_tokens=200,
num_beams=num_hypotheses,
num_return_sequences=num_hypotheses,
num_beam_groups=5,
diversity_penalty=1.0,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)
# Decode all hypotheses into a list of strings
hypotheses = [ tokenizer.decode(output[input_length:], skip_special_tokens=True).strip().replace('\n', '') for output in outputs]
```
### Tuned Re-ranking
Tuned re-ranking is a two-step process:
1. **Generation:** First, use a decoding method like beam search or diverse beam search to generate a list of multiple translation candidates (an "N-best list").
2. **Re-ranking:** Next, use a second, specialized model—a "re-ranker"—to score each candidate in the N-best list. This re-ranker is often a Quality Estimation (QE) model trained specifically to predict the quality of a translation without needing a human reference. The translation with the highest quality score is then selected as the final output.
Use when: You want to achieve the highest possible translation quality. By using a dedicated QE model for the final selection, you can often find a better translation in the N-best list than the main model would have chosen based on probability alone.
```python
from mbrs.metrics import MetricCOMETkiwi
from mbrs.decoders import DecoderRerank
# 1. Initialize the quality metric you want to use for re-ranking.
# Here, we use COMET-kiwi, a state-of-the-art metric for quality estimation
metric_cfg = MetricCOMETkiwi.Config(model="Unbabel/wmt22-cometkiwi-da")
metric = MetricCOMETkiwi(metric_cfg)
# 2. Initialize the re-rank decoder.
decoder_cfg = DecoderRerank.Config()
decoder = DecoderRerank(decoder_cfg, metric)
# 3. Decode! The 'decode' method takes your list of hypotheses
# and re-ranks them according to the selected metric
result = decoder.decode(hypotheses, source=sentence, nbest=1)
best_hyp = result.sentence[0]
# Print the result selected by Comet-kiwi re-ranking
print(f"Selected sentence: {best_hyp}")
```
### Minimum Bayes Risk (MBR) Decoding
Minimum Bayes Risk (MBR) decoding is a powerful alternative that directly optimizes for output quality. Instead of selecting the single most probable hypothesis (as in standard beam search), MBR selects the hypothesis that is most similar, on average, to all other plausible hypotheses.
The core idea is to find the candidate translation that has the highest *expected utility* (i.e., the highest average quality score) when compared against a set of other high-quality potential translations, known as "pseudo-references". This approach is effective at avoiding the common problems of standard decoding by favoring consensus and quality over raw probability.
```python
from mbrs.metrics import MetricCOMET
from mbrs.decoders import DecoderMBR
# 1. Initialize the quality metric you want to use for the utility function.
# Here, we use COMET, a state-of-the-art metric for neural based-translation
# quality estimation
metric_cfg = MetricCOMET.Config(model="Unbabel/wmt22-comet-da")
metric = MetricCOMET(metric_cfg)
# 2. Initialize the MBR decoder.
decoder_cfg = DecoderMBR.Config()
decoder = DecoderMBR(decoder_cfg, metric)
# 3. Decode! The 'decode' method takes your list of hypotheses and also uses
# them as the pseudo-references to calculate the expected utility scores.
result = decoder.decode(hypotheses,hypotheses, source=sentence, nbest=1)
best_hyp = result.sentence[0]
# Print the result selected by MBR decoding
print(f"Selected sentence: {best_hyp}")
```
## **Compatibility Wrapper**
In order to integrate SalamandraTA as a drop-in replacement for translation services such as Google Translate or DeepL you can use the wrapper provided at [**https://github.com/langtech-bsc/mt-wrapper**](https://github.com/langtech-bsc/mt-wrapper). This service accepts incoming requests in Google Translate or DeepL format and translates them to appropriately formatted requests to a SalamandraTA endpoint.
The wrapper service can be deployed locally or on any hosting platform with minimal resource requirements.