# How to use SalamandraTA models

<aside>

📌 Location of our published models:

[🤗 SalamandraTA-7B - Full model](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct)

→ [7B Quantized version](https://huggingface.co/BSC-LT/salamandraTA-7B-instruct-GGUF)

[🤗 SalamandraTA-2B - Full model](https://huggingface.co/BSC-LT/salamandraTA-2b-instruct)

→ [2B Quantized version](https://huggingface.co/BSC-LT/salamandraTA-2B-instruct-GGUF)

</aside>

## Table of Contents

- [Introduction](#introduction)
- [Quick Start Guide - Full Models](#quick-start-guide---full-models)
  - [Installation Requirements](#installation-requirements)
  - [Loading the Model and Tokenizer in Python](#loading-the-model-and-tokenizer-in-python)
  - [Preparing Your First Prompt](#preparing-your-first-prompt)
  - [Generating a Translation (Inference)](#generating-a-translation-inference)
- [Quick Start Guide - Quantized Models](#quick-start-guide---quantized-models)
  - [Installation Requirements](#installation-requirements-1)
  - [Loading the Model and Tokenizer in Python](#loading-the-model-and-tokenizer-in-python-1)
  - [Preparing Your First Prompt](#preparing-your-first-prompt-1)
  - [Generating a Translation (Inference)](#generating-a-translation-inference-1)
- [Prompting Guide](#prompting-guide)
- [Multi-turn Prompts](#multi-turn-prompts)
  - [Load the Model and Tokenizer](#load-the-model-and-tokenizer)
  - [Define a Chat Helper Function](#define-a-chat-helper-function)
  - [Example: Alternative Translations](#example-alternative-translations)
  - [Notes on multi-turn interactions](#notes-on-multi-turn-interactions)
- [Decoding Strategies](#decoding-strategies)
  - [Greedy Decoding (Default)](#greedy-decoding-default)
  - [Beam Search](#beam-search)
  - [Important Note on End Tokens](#important-note-on-end-tokens)
  - [Diverse Beam Search](#diverse-beam-search)
- [Quality-Aware Decoding Strategies](#quality-aware-decoding-strategies)
  - [Tuned Re-ranking](#tuned-re-ranking)
  - [Minimum Bayes Risk (MBR) Decoding](#minimum-bayes-risk-mbr-decoding)
- [Compatibility Wrapper](#compatibility-wrapper)

## Introduction

**SalamandraTA** is a family of multilingual language models fine-tuned for translation and related language technology tasks. These instruction-tuned variants (available in 2B and 7B sizes) are part of the broader Salamandra model suite and are particularly strong in **Catalan, Spanish, and English**, with support for over 40 languages.

The models support sentence- and paragraph-level translation, automatic post-editing, paraphrasing, grammar correction, and multilingual NER. Both models are available in full precision and quantized formats, making them accessible for a wide range of hardware configurations.

This document serves as a **wiki and quick start guide** for developers and researchers who want to use SalamandraTA for translation and MT-related tasks. It includes introductory notes on prompting, decoding strategies, and key technical specifications of the model.

<aside>

📎 **Overview:**

- **Languages supported:** 40 languages (and 3 varieties)
    
    Arabic, Aragonese, Asturian, Basque, Bulgarian, Catalan (and Catalan-Valencian variety), Chinese (simplified), Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Maltese, Norwegian (Bokmål and Nynorsk varieties), Occitan (and Aranese variety), Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Welsh
    
- **Tasks supported:** 9 translation and/or MT-related tasks
    
    Sentence-level-translation
    Paragraph-level-translation
    Document-level-translation
    Automatic post-editing
    Grammar checking
    Machine translation evaluation
    Alternative translations
    Named-entity-recognition
    Context-aware translation
    
- **Architectures:**
    - salamandraTA-2b-instruct
        
        
        | **Feature** | **Value** |
        | --- | --- |
        | Total parameters | 2,253,490,176 |
        | Embedding parameters | 524,288,000 |
        | Layers | 24 |
        | Hidden size | 2,048 |
        | Attention heads | 16 |
        | Context length | 8,192 |
        | Vocabulary size | 256,000 |
        | Precision | bfloat16 |
        | Embeddig type | RoPE |
        | Activation function | SwiGLU |
        | Layer normalization | RMS Norm |
        | Flash attention | ✅ |
        | Grouped query attention | ❌ |
        | Num. query groups | N/A |
    - salamandraTA-7b-instruct
        
        
        | **Feature** | **Value** |
        | --- | --- |
        | Total parameters | 7,768,117,248 |
        | Embedding parameters | 1,048,576,000 |
        | Layers | 32 |
        | Hidden size | 4,096 |
        | Attention heads | 32 |
        | Context length | 8,192 |
        | Vocabulary size | 256,000 |
        | Precision | bfloat16 |
        | Embeddig type | RoPE |
        | Activation function | SwiGLU |
        | Layer normalization | RMS Norm |
        | Flash attention | ✅ |
        | Grouped query attention | ✅ |
        | Num. query groups | 8 |
- **Resource considerations:**
    - Full models
        - salamandraTA-2b-instruct
            
            Loading the SalamandraTA-2b model requires approximately **4.5 GB of GPU memory** just for the weights. Additional memory is needed for intermediate computations, activations, and buffers.
            
            For stable inference and to avoid out-of-memory errors, a GPU with at least **12 GB of VRAM** is recommended. Please see specifics below.
            
            | **Resource** | **Requirement** | **Notes** |
            | --- | --- | --- |
            | Weights (bf16) | ~4.5 GB | From 2.25B params |
            | KV cache (8192 ctx) | ~1.6 GB | Grows linearly with context length |
            | Total minimal | ~6.1 GB | Weights + KV only |
            | **Practical VRAM** | **≥12 GB** |  |
            | FLOPs/token | ~4.5B | ≈2 × params |
        - salamandraTA-7b-instruct
            
            Loading the SalamandraTA-7b model requires approximately **15.5 GB of GPU memory** just for the weights. Additional memory is needed for intermediate computations, activations, and buffers.
            
            For stable inference and to avoid out-of-memory errors, a GPU with at least **24 GB of VRAM** is recommended. Please see specifics below.
            
            | **Resource** | **Requirement** | **Notes** |
            | --- | --- | --- |
            | Weights (bf16) | ~15.5 GB | From 7.77B params |
            | KV cache (8192 ctx) | ~4.3 GB | Grows linearly with context length |
            | Total minimal | ~19.8 GB | Weights + KV only |
            | **Practical VRAM** | **≥24 GB** |  |
            | FLOPs/token | ~15.5B | ≈3.45 × heavier than 2B |
    - Quantized models
        - salamandraTA-2b-instruct-GGUF
            
            Loading the quantized SalamandraTA-2b model requires approximately **2.7 GB of GPU memory** just for the weights. Additional memory is needed for intermediate computations, activations, and buffers.
            
            For stable inference and to avoid out-of-memory errors, a GPU with at least **8 GB of VRAM** is recommended. Please see specifics below.
            
            | **Resource** | **Requirement** | **Notes** |
            | --- | --- | --- |
            | Weights (Q4, CPU) | ~2.7-3 GB | From 2.25B params quantized |
            | KV cache (8192 ctx, FP16) | ~1.6 GB | Grows linearly with context length |
            | Total minimal | ~4.3 GB | Weights + KV only |
            | **Practical VRAM** | **≥8 GB** |  |
            | FLOPs/token | ~2.2B | ≈params × quantization overhead |
        - salamandraTA-7b-instruct-GGUF
            
            Loading the quantized SalamandraTA-2b model requires approximately **8.6 GB of GPU memory** just for the weights. Additional memory is needed for intermediate computations, activations, and buffers.
            
            For stable inference and to avoid out-of-memory errors, a GPU with at least **16 GB of VRAM** is recommended. Please see specifics below.
            
            | **Resource** | **Requirement** | **Notes** |
            | --- | --- | --- |
            | Weights (Q4, CPU) | ~8.6 GB | From 6.7B params quantized |
            | KV cache (8192 ctx, FP16) | ~5 GB | Grows linearly with context length |
            | Total minimal | ~13.6 GB | Weights + KV only |
            | **Practical VRAM** | **≥16 GB** |  |
            | FLOPs/token | ~6.7B | ≈params × quantization overhead |
</aside>

## Quick Start Guide - Full Models

This section guides users from installing the necessary tools to running your first translation task, regardless of experience level.

### Installation Requirements

Ensure you have the required packages installed:

```bash
pip install transformers accelerate sentencepiece
```

### Loading the Model and Tokenizer in Python

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_id = "BSC-LT/salamandraTA-7b-instruct"  # Can also use "BSC-LT/salamandraTA-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",           # Automatically selects GPU/CPU (recommended for seamless use)
    torch_dtype=torch.bfloat16   # Use bfloat16 for faster inference (ensure hardware support)
)
```

### Preparing Your First Prompt

SalamandraTA models use a specific prompt format (ChatML) to understand instructions properly. This is how you prepare a prompt to translate text from Spanish to Catalan:

```python
# Define your task
# These variables will change depending on the task (translation, NER, grammar correction, etc.)
source = "Spanish"
target = "English"
sentence = (
    "Ayer se fue, tomó sus cosas y se puso a navegar. Una camisa, un pantalón vaquero "
    "y una canción, dónde irá, dónde irá. Se despidió, y decidió batirse en duelo con el mar. "
    "Y recorrer el mundo en su velero. Y navegar, nai-na-na, navegar."
)
# Prepare the input prompt using the standard instruction format 
# see the Prompting Guide for recommended prompts for different tasks
text = (
    f"Translate the following text from {source} into {target}.\n"
    f"{source}: {sentence}\n"
    f"{target}:"
)
    
# Structure the prompt using ChatML format
message = [{"role": "user", "content": text}]

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,              # Return plain string, not tokenized
    add_generation_prompt=True   # Add special token to signal model should begin generating
)

```

### Generating a Translation (Inference)

```python
inputs = tokenizer.encode(
    prompt,
    add_special_tokens=False,   # Don't add special tokens beyond what the ChatML template includes
    return_tensors="pt"         # Return as PyTorch tensor
)

input_length = inputs.shape[1]  # Used to trim the prompt portion from the output

eos_tokens = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>") ]

outputs = model.generate(
    input_ids=inputs.to(model.device),  # Automatically move to CPU/GPU as needed
    max_new_tokens=200,                 # Increase this for longer inputs (e.g., paragraph or doc-level translation)
    eos_token_id=eos_tokens,  # Use instruction-tuned stop token and pre-training stop token
    num_beams=1,                        # 1 = greedy decoding (use num_beams > 1 to enable beam search)

    pad_token_id=tokenizer.eos_token_id  # Required for some models during beam search
)

translation = tokenizer.decode(
    outputs[0, input_length:],          # Skip the prompt when decoding
    skip_special_tokens=True            # Remove any lingering special tokens
)

print(translation)
```

## Quick Start Guide - Quantized Models

### Installation Requirements

For inference with quantized models (GGUF), ensure you have the required packages installed:

```bash
pip install huggingface_hub vllm torch
pip install llama-cpp-python
```

### Loading the Model and Tokenizer in Python

```python
from huggingface_hub import snapshot_download
from vllm import LLM, SamplingParams

# Download the quantized GGUF model from Hugging Face
model_dir = snapshot_download(
    repo_id="BSC-LT/salamandraTA-7B-instruct-GGUF",  # Replace with the 2B repo if needed
    revision="main"  # Optional: pin to a specific version
)
# Choose which GGUF file to load; q4 is generally recommended
model_name = "salamandrata_7b_inst_q4.gguf"

# Load the quantized model and tokenizer for inference
llm = LLM(
    model=model_dir + '/' + model_name,  # Path to GGUF file
    tokenizer=model_dir                  # Tokenizer is automatically detected from snapshot
)
```

### Preparing Your First Prompt

SalamandraTA models use a specific prompt format (ChatML) to understand instructions properly. This is how you prepare a prompt to translate text from Spanish to Catalan:

```python
# Define your task
# These variables will change depending on the task (translation, NER, grammar correction, etc.)
source = "Spanish"
target = "English"
sentence = (
    "Ayer se fue, tomó sus cosas y se puso a navegar. Una camisa, un pantalón vaquero "
    "y una canción, dónde irá, dónde irá. Se despidió, y decidió batirse en duelo con el mar. "
    "Y recorrer el mundo en su velero. Y navegar, nai-na-na, navegar."
)
# Prepare the input prompt using the standard instruction format 
# see the Prompting Guide for recommended prompts for different tasks
prompt = f"Translate the following text from {source} into {target}.\\n{source}: {sentence} \\n{target}:"
messages = [{'role': 'user', 'content': prompt}]
)
```

### Generating a Translation (Inference)

```python
outputs = llm.chat(messages,
                   sampling_params=SamplingParams(
                       temperature=0,      # Greedy decoding (no randomness)
                       stop_token_ids=[5], # Stop generation at special token ID (should correspond to <|im_end|>)
				               max_tokens=200)     # Maximum number of new tokens to generate; increase for longer texts
                   )[0].outputs

print(outputs[0].text)
```

## Prompting Guide

The SalamandraTA models were instruction-tuned using a combination of custom prompts and publicly available instruction datasets.

- For **sentence-, paragraph- and document-level translation**, we used a single fixed English prompt consistently applied across language pairs. These tasks are the most robust and consistent across the model's output.
- For **other tasks**, including post-editing, grammar correction, paraphrasing, alternative translations, machine translation evaluation and context-aware translation, we incorporated instruction samples from the [**TowerBlocks** dataset (Alves et al., 2024)]([https://arxiv.org/pdf/2402.17733](https://arxiv.org/pdf/2402.17733)), which provides high-quality, task-diverse instruction tuning data.
- TowerBlocks includes multiple prompt phrasings per task, often paraphrased or reworded in different ways. As a result, the model has **some tolerance for varied instructions** on these secondary tasks. However, performance may be less consistent than for translation tasks unless prompt phrasing is close to what the model saw during training.

<aside>
🤖

Below we provide examples of recommended prompt structures for each supported task:

- **Translation**
    
    ```
    Translate the following text from {src} into {tgt}.
    {src}: {context}
    {tgt}: 
    ```
    
- **Post-editing**
    
    ```
    Please fix any mistakes in the following {source}-{target} machine translation 
    or keep it unedited if it's correct.
    Source: {source_sentence}
    MT: {machine_translation}
    Corrected:
    ```
    
- **Context-aware Translation**
    
    ```
    What is the {target} translation of the sentence below?
    {sentence}
    Use the following context to help your translation:
    {context}
    Translation: 
    ```
    
- **Paraphrasing**
    
    ```
    Rewrite the following sentence in {target} without changing its meaning:
    Source: {sentence}
    Paraphrase:
    ```
    
- **Grammar Correction**
    
    ```
    Please fix any mistakes in the following {source} sentence or keep it unedited if it's correct.
    Sentence: {sentence}
    Corrected:
    ```
    
- **Named Entity Recognition (NER)**
    
    ```
    Analyse the following tokenized text and mark the tokens containing named entities.
    Use the following annotation guidelines with these tags for named entities:
    - ORG (Refers to named groups or organizations)
    - PER (Refers to individual people or named groups of people)
    - LOC (Refers to physical places or natural landmarks)
    - MISC (Refers to entities that don't fit into standard categories).
    Prepend B- to the first token of a given entity and I- to the remaining ones if they exist.
    If a token is not a named entity, label it as O.
    Input: {list of words in a sentence}
    Marked: 
    ```
    
- **Machine Translation Evaluation**
    
    ```
    Source: {sentence}
    Pick the best English translation from the following pool of candidates:
    1. {translation_1}
    2. {translation_2}
    3. {translation_3}
    4. {translation_4}
    Choose the translation that best conveys the meaning of the original source by indicating its corresponding number.
    ```
    
</aside>

## Multi-turn Prompts

Some tasks supported by the SalamandraTA models, such as Alternative Translations, require multi-turn interactions. These are scenarios where the model is prompted more than once in a conversation-like structure, and each prompt depends on the model’s previous response.

Unlike single-shot tasks, multi-turn interactions must be handled programmatically or within a notebook environment, where you can:

1. Send an initial prompt,
2. Capture the model’s response,
3. Send a second prompt that builds on that response,
4. Repeat as needed.

This conversational style mirrors the format used during instruction tuning and is important for eliciting coherent, contextual follow-up behavior.

### Load the Model and Tokenizer

In order to interact with the model in this way, you should begin by loading the model and tokenizer as in the Quick Start Guide above. Then follow the following steps:

### Define a Chat Helper Function

```python
def chat(messages, max_new_tokens=200):
    # Convert the message list into a ChatML-style prompt
    # increase max_new_tokens if dealing with longer (paragraph/doc-level) inputs.
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
    input_len = inputs.shape[1]
    
    # Generate a response using greedy decoding and ChatML-specific stop token
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
    )
    
    # Return just the newly generated text, skipping special tokens
    return tokenizer.decode(outputs[0, input_len:], skip_special_tokens=True)

```

### Example: Alternative Translations

```python
# First user message: Request a Catalan translation
messages = [
    {
        "role": "user",
        "content": 'Given the English sentence "Nothing is achieved without effort.", generate a translation in Catalan.'
    }
]

# Generate and print the first response
response_1 = chat(messages)
print("Response 1:", response_1)
#Response 1: Translation in Catalan as listed:
#Res s'aconsegueix sense esforç.
```

```python
# Add model's first reply to the chat history
messages.append({"role": "assistant", "content": response_1})

# Add a second user message requesting an alternative translation
messages.append({
    "role": "user",
    "content": 'That was a useful translation. For a broader understanding, please provide one additional translation of the English phrase "Nothing is achieved without effort." into Catalan.'
})

# Generate and print the second response
response_2 = chat(messages)
print("Response 2:", response_2)
#Response 2: Res s'aconsegueix sense treballar.
```

### Notes on multi-turn interactions

This structure can be extended to more than two turns by continuing to append `{"role": "assistant", "content": ...}` and `{"role": "user", "content": ...}` objects to the `messages` list. It can also be used with other tasks - for example, in Machine Translation Evaluation, you could follow up a request for the best translation with a request for the worst. 

## Decoding Strategies

There are different decoding strategies that can be used with SalamandraTA models for generating translations. The strategy you choose can affect the fluency, accuracy, and consistency of translations or other task outputs.

For most use cases, we recommend one of the following:

### **Greedy Decoding** (Default)

---

Greedy decoding selects the most likely next token at each step, producing fast and predictable outputs. It is well-suited for most translation tasks where consistency and speed are priorities.

- **Use when**: You want reliable, fast inference translations — ideal for most MT and post-editing tasks.

```python
outputs = model.generate(
    input_ids=inputs.to(model.device), 
    max_new_tokens=200,                 
    num_beams=1,                        # 1 = greedy decoding
    pad_token_id=tokenizer.eos_token_id  
)
```

### **Beam Search**

Beam search keeps multiple candidate sequences at each step and chooses the most probable overall sequence after considering these options. This can improve fluency and quality, especially for longer or more complex text, but requires more computational resources and may sometimes produce less consistent end tokens.

- **Use when**: You want more polished outputs, are working with longer paragraphs, or are post-selecting among alternatives.

```python

outputs = model.generate(
    input_ids=inputs.to(model.device), 
    max_new_tokens=200,                 
    num_beams=5,                        # > = greedy decoding, generally 5 beams is a good balance between quality and cost
    pad_token_id=tokenizer.eos_token_id  
) 
```

### **Important Note on End Tokens**

The instruction-tuned SalamandraTA models are designed to end outputs with `<|im_end|>`.

Greedy decoding reliably respects this. Beam search sometimes prefers `</s>`, depending on how likely `<|im_end|>` is under the beam hypotheses.

If needed, set the following to ensure clean stopping behavior:

```python
stop_sequence = '<|im_end|>'
eos_tokens = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids(stop_sequence),
    ]
```

In the generation, it can then be applied as in the following:

```python
    outputs = model.generate(
        input_ids=inputs.to(model.device),
        max_new_tokens=4000,
        early_stopping=True,
        eos_token_id=eos_tokens, 
        pad_token_id=tokenizer.eos_token_id,
        num_beams=5
    )
```

### Diverse Beam Search

Diverse beam search is a variation of beam search that encourages variety among the translation candidates. Instead of just finding the most probable sequences, it aims to find a set of high-probability sequences that are also different from one another. This is achieved by penalizing hypotheses that are too similar to others already in the beam.

- **Use when:** You want to generate multiple, distinct translation options for a single source sentence. This is useful for tasks like providing alternative translations or for creating a richer set of candidates for a subsequent quality-aware decoding steps.

```python
# Use num_beam_groups and diversity_penalty to enable diverse beam search
# This will generate 5 beams, divided into 5 groups (so each group is independent)
# The diversity penalty discourages beams within a group from being too similar.

outputs = model.generate(
    input_ids=inputs.to(model.device), 
    max_new_tokens=200,             
    num_beams=5,                    
    num_beam_groups=5,              
    diversity_penalty=1.0,          
    pad_token_id=tokenizer.eos_token_id  
)
```

## **Quality-Aware Decoding Strategies**

While greedy and beam search are powerful, they prioritize statistical likelihood (what the model thinks is the most probable translation) over actual translation quality. This can sometimes lead to the "beam search curse," where translations with the highest probability are not actually the best in terms of quality, sometimes resulting in unnatural text. Quality-aware strategies address this by incorporating a more direct sense of "goodness" into the decoding process. For using quality-aware decoding strategies with SalamandraTA we will need to install two libraries; `mbrs` and `comet`.

```python
pip install mbrs
pip install unbabel-comet
```

Then, we will need a set of candidate translations, we will be using diverse beam search for generating a set of hypothesis. Traditional beam search often generates very similar sequences across different beams. Diverse beam search addresses this by penalizing beams that generate tokens already chosen by other beams in the same time step.

```python
sentence =  "Ayer se fue, tomó sus cosas y se puso a navegar"
text = (
    f"Translate the following text from {source} into {target}.\n"
    f"{source}: {sentence}\n"
    f"{target}:"
)
message = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(prompt, add_special_tokens=False,return_tensors="pt")
input_length = inputs.shape[1]  # Used to trim the prompt portion from the output
eos_tokens = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>") ]

# We will generate 20 candidate translations
num_hypotheses = 20
outputs = model.generate(
    input_ids=inputs.to(model.device),
    max_new_tokens=200,
    num_beams=num_hypotheses,
    num_return_sequences=num_hypotheses,
    num_beam_groups=5,              
    diversity_penalty=1.0,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)

# Decode all hypotheses into a list of strings
hypotheses = [ tokenizer.decode(output[input_length:], skip_special_tokens=True).strip().replace('\n', '') for output in outputs]
```

### Tuned Re-ranking

Tuned re-ranking is a two-step process:

1. **Generation:** First, use a decoding method like beam search or diverse beam search to generate a list of multiple translation candidates (an "N-best list").
2. **Re-ranking:** Next, use a second, specialized model—a "re-ranker"—to score each candidate in the N-best list. This re-ranker is often a Quality Estimation (QE) model trained specifically to predict the quality of a translation without needing a human reference. The translation with the highest quality score is then selected as the final output.

Use when: You want to achieve the highest possible translation quality. By using a dedicated QE model for the final selection, you can often find a better translation in the N-best list than the main model would have chosen based on probability alone.

```python
from mbrs.metrics import MetricCOMETkiwi
from mbrs.decoders import DecoderRerank

# 1. Initialize the quality metric you want to use for re-ranking.
#    Here, we use COMET-kiwi, a state-of-the-art metric for quality estimation
metric_cfg = MetricCOMETkiwi.Config(model="Unbabel/wmt22-cometkiwi-da")
metric = MetricCOMETkiwi(metric_cfg)

# 2. Initialize the re-rank decoder.
decoder_cfg = DecoderRerank.Config()        
decoder = DecoderRerank(decoder_cfg, metric)

# 3. Decode! The 'decode' method takes your list of hypotheses 
# and re-ranks them according to the selected metric
result = decoder.decode(hypotheses, source=sentence, nbest=1)
best_hyp = result.sentence[0]

# Print the result selected by Comet-kiwi re-ranking
print(f"Selected sentence: {best_hyp}")
```

### Minimum Bayes Risk (MBR) Decoding

Minimum Bayes Risk (MBR) decoding is a powerful alternative that directly optimizes for output quality. Instead of selecting the single most probable hypothesis (as in standard beam search), MBR selects the hypothesis that is most similar, on average, to all other plausible hypotheses.

The core idea is to find the candidate translation that has the highest *expected utility* (i.e., the highest average quality score) when compared against a set of other high-quality potential translations, known as "pseudo-references". This approach is effective at avoiding the common problems of standard decoding by favoring consensus and quality over raw probability.

```python
from mbrs.metrics import MetricCOMET
from mbrs.decoders import DecoderMBR

# 1. Initialize the quality metric you want to use for the utility function.
#    Here, we use COMET, a state-of-the-art metric for neural based-translation 
#    quality estimation
metric_cfg = MetricCOMET.Config(model="Unbabel/wmt22-comet-da")
metric = MetricCOMET(metric_cfg)

# 2. Initialize the MBR decoder.
decoder_cfg = DecoderMBR.Config()
decoder = DecoderMBR(decoder_cfg, metric)

# 3. Decode! The 'decode' method takes your list of hypotheses and also uses
#    them as the pseudo-references to calculate the expected utility scores.
result = decoder.decode(hypotheses,hypotheses, source=sentence, nbest=1)
best_hyp = result.sentence[0]

# Print the result selected by MBR decoding
print(f"Selected sentence: {best_hyp}")
```

## **Compatibility Wrapper**

In order to integrate SalamandraTA as a drop-in replacement for translation services such as Google Translate or DeepL you can use the wrapper provided at [**https://github.com/langtech-bsc/mt-wrapper**](https://github.com/langtech-bsc/mt-wrapper). This service accepts incoming requests in Google Translate or DeepL format and translates them to appropriately formatted requests to a SalamandraTA endpoint.

The wrapper service can be deployed locally or on any hosting platform with minimal resource requirements.