---
license: apache-2.0
language:
- en
- de
- fr
- nl
- it
- es
- pt
- ru
- zh
base_model:
- Qwen/Qwen2.5-7B
pipeline_tag: text-generation
library_name: transformers
---

# Model Card for double7/Qwen2.5-7B-MT-GRRM-Optimized-CLA

## Model Details

### Model Description

**double7/Qwen2.5-7B-MT-GRRM-Optimized-CLA** is a multilingual **Machine Translation (MT)** model post-trained with **Group Relative Policy Optimization (GRPO)** using [**GRRM** (*Group Relative Reward Model*)](https://huggingface.co/double7/Qwen2.5-7B-GRRM) as the reward provider. The training goal is to improve translation quality—especially on challenging, reasoning-intensive translation cases—by leveraging **groupwise, relative reward signals** (GQM) that provide fine-grained intra-group ranking feedback.

This model is the **CLA variant** of [Qwen2.5-7B-MT-GRRM-Optimized](https://huggingface.co/double7/Qwen2.5-7B-MT-GRRM-Optimized). Compared to the non-CLA model (trained on the standard multilingual TowerBlocks MT set as-is), this version enables **Cross-Lingual Augmentation (CLA)** during GRPO training by pairing source sentences with alternative target languages from the dataset (e.g., Zh→De), while keeping the overall training budget comparable. The architecture and reward model (GRRM) remain the same; the difference is only the GRPO training data construction.


The model is initialized from **Qwen2.5-7B**, then:
1. **Cold-started via SFT** on Chinese–English data (with LLM-annotated comparative reasoning / CoT-style supervision for translation with reasoning).
2. **Optimized via GRPO** on multilingual MT data (TowerBlocks, ~150k samples spanning 10 languages) using **GRRM** as the reference-free reward model, with **Cross-Lingual Augmentation (CLA)** enabled.

- **Model type:** Causal Language Model (Instruction-tuned / MT-oriented post-training)
- **Primary use:** Machine Translation (multilingual, En↔X / Zh↔En emphasized)
- **Language(s):** English, Portuguese, Spanish, French, German, Dutch, Italian, Russian, Chinese (and potentially other languages, but not guaranteed)
- **License:** Apache License 2.0
- **Finetuned from model:** Qwen2.5-7B

### Model Sources

- **Paper:** [GRRM: Group Relative Reward Modeling for Machine Translation](https://arxiv.org/abs/2602.14028)
- **Repository:** https://github.com/NJUNLP/GRRM
- **Reward model (used in training):** [Qwen2.5-7B-GRRM]((https://huggingface.co/double7/Qwen2.5-7B-GRRM))


## Uses

### Direct Use

This model is intended for **translation-with-reasoning**, including:
- General-domain MT across multiple language pairs (e.g., En↔De/Fr/Es/Pt/It/Nl/Ru/Zh).
- Challenging MT scenarios where reasoning about ambiguity, localization, idioms, discourse coherence, or subtle adequacy issues is required.


## Input / Output Format

### Input format

Format the input as an instruction-style MT prompt with explicit reasoning request (See [the example below](#how-to-get-started-with-the-model)). Wrap the source text in a **fenced code block** to avoid formatting ambiguity and to support multi-line inputs.

### Output format

The model typically outputs:

1. A **step-by-step analysis** section in Markdown, explaining idioms, ambiguity resolution, and translation choices.
2. The **final translation** wrapped in a **fenced code block**, as requested by the prompt.


## Bias, Risks, and Limitations

- **Language and domain coverage:** Although trained with multilingual GRPO, quality can degrade for low-resource languages, specialized terminology, or very creative text.
- **Prompt sensitivity:** Output style may vary with prompting. For consistent outputs, standardize prompts.
- **Reward-model-induced bias:** Optimization uses GRRM (a learned judge). While GRRM improves ranking sensitivity in groups, it may still encode preferences (e.g., fluency/style) and occasionally mis-rank edge cases, which can influence the final MT model.


## How to Get Started with the Model

> [!NOTE]
> For high-throughput inference with vllm and official parsing script, please refer to [our code repository](https://github.com/NJUNLP/GRRM).


````python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_PATH = "double7/Qwen2.5-7B-MT-GRRM-Optimized-CLA"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, device_map="auto", torch_dtype="auto"
)

prompt = """Translate the following text from English into Chinese. Perform a step by step analysis and output the final translation in a code block.

Source text:
```
The grass is always greener on the other side.
```
"""

messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        do_sample=False,
        max_new_tokens=8192,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
````

<details><summary>Output Example</summary>

````markdown
### Step-by-step Analysis

1.  **"The grass is always greener on the other side."**: This is a well-known English idiom. A literal, word-for-word translation would be unnatural in Chinese. The core meaning is that people often feel their current situation is not as good as they perceive another person's or another situation to be.

2.  **Identifying the Chinese equivalent**: The most common and idiomatic translation for this concept in Chinese is "这山望着那山高" (zhè shān wàngzhe nà shān gāo), which literally means "To look at the other mountain, thinking it's higher." This perfectly captures the essence of wanting what one doesn't have.

3.  **Exploring other options**:
    *   "彼岸的草总是更绿" (Bǐ'àn de cǎo zǒngshì gèng lǜ): A more literal, poetic translation, "The grass on the other shore is always greener," which is also understandable but not as common in everyday speech as the idiom.
    *   "隔岸观火" (gé àn guān huǒ): This idiom means "to watch a fire from the other bank," implying a passive, detached, or envious observation, but it doesn't fully capture the comparison between one's own situation and another's.

4.  **Final Selection**: "这山望着那山高" is the most idiomatic and widely used equivalent. A slightly more formal and written version is "人往高处走，水往低处流" (rén wǎng gāochù zǒu, shuǐ wǎng dīchù liú), meaning "People strive for higher ground, water flows to lower ground," which also conveys a similar message. However, "这山望着那山高" is the most direct and natural translation.

### Final Translation

```
这山望着那山高。
```
````

</details>


## Training Details

### Training Data

- **SFT cold-start (Zh–En):** Chinese–English subset of TowerBlocks, supervised with LLM-annotated reasoning and translation signals.
- **GRPO stage (multilingual):** TowerBlocks multilingual translation data covering 10 languages, about 150k training samples. Expanded to 300k samples based on CLA.

### Training Procedure (High-level)

1. **SFT (cold start):** Initialize translation and basic reasoning behaviors (Zh–En).
2. **GRPO w/ GRRM feedback:** Sample groups of candidate translations per source, score/rank them with GRRM (groupwise), compute advantages within-group, and update the policy to prefer better candidates—targeting improved reasoning ability and robustness on challenging MT cases.

### Training Hyperparameters

- **Hardware:** 16 × NVIDIA A100 (80GB)

#### SFT (policy cold-start)
- **Epochs:** 1  
- **Global batch size:** 64  
- **LR scheduler:** cosine  
- **Peak learning rate:** 1e-5  
- **Warmup ratio:** 0.1  

#### Reinforcement Learning (policy optimization with GRRM)
- **RL algorithm:** GSPO (Group Sequence Policy Optimization), with additional stabilization enhancements (see paper appendix)
- **Epochs:** 1
- **Learning rate:** 1e-5  
- **LR scheduler:** constant
- **Rollouts per prompt:** 4
- **Length control:** max 4096 tokens, soft length penalty with overlong buffer = 2048 tokens
- **Total batch size:** 512
- **PPO mini-batch size:** 128
- **KL penalty:** disabled (no KL divergence penalty)
- **Advantage normalization:** standard deviation normalization removed
- **Reward scaling:** scaled to **[0, 0.1]** for stability


## Evaluation

MT performance on WMT and Seed-X-Challenge benchmarks. We report BLEURT-20 and LLM-as-a-Judge scores (evaluated by DeepSeek-R1-0528). Optimizing with GRRM via GRPO significantly improves the translation quality and reasoning capabilities of the base model.

| Model | WMT Zh→En (BLEURT / R1) | WMT En→Zh (BLEURT / R1) | WMT En→X (BLEURT / R1) | Seed-X Zh→En (BLEURT / R1) | Seed-X En→Zh (BLEURT / R1) |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *General LLMs* | | | | | |
| Gemini-2.5-Pro | 68.66 / 92.92 | 66.00 / 91.31 | 68.87 / 90.35 | 71.59 / 89.41 | 69.19 / 86.06 |
| DeepSeek-R1-0528 | 67.78 / 92.34 | 64.87 / 89.24 | 67.72 / 88.48 | 70.92 / 87.95 | 68.23 / 84.40 |
| Qwen2.5-7B-Instruct | 67.31 / 88.49 | 59.92 / 80.51 | 58.72 / 72.51 | 66.59 / 79.23 | 62.75 / 72.37 |
| *Specialized Models* | | | | | |
| TowerInstruct-13B | 67.56 / 84.83 | 62.92 / 77.63 | 66.61 / 82.68 | 63.32 / 69.54 | 63.46 / 71.17 |
| SeedX-PPO | **69.02** / 90.47 | **67.21** / 87.98 | **68.35** / **86.04** | 69.37 / 82.47 | **68.72** / 80.56 |
| SSR-X-Zero-7B | 68.30 / 88.67 | 66.12 / 83.78 | - / - | 68.84 / 81.15 | 67.08 / 77.56 |
| Qwen2.5-7B-SFT | 67.07 / 87.78 | 59.99 / 76.98 | 57.14 / 67.91 | 67.65 / 80.91 | 62.36 / 72.42 |
| **⭐+ GRPO** | 67.41 / **92.24** | 64.80 / 87.80 | 64.65 / 83.86 | **69.55** / 85.90 | 67.05 / 82.55 |
| **⭐+ GRPO w/ CLA** | 67.39 / 92.09 | 63.91 / **88.29** | 64.50 / 83.71 | 69.25 / **88.58** | 67.07 / **83.33** |


## Citation

```bibtex
@article{yang2026grrmgrouprelativereward,
      title={GRRM: Group Relative Reward Modeling for Machine Translation}, 
      author={Sen Yang and Shanbo Cheng and Lu Xu and Jianbing Zhang and Shujian Huang},
      year={2026},
      eprint={2602.14028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14028},
}
```