Instructions to use Thermostatic/rosettia-quy-gspo-nllb13b-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Thermostatic/rosettia-quy-gspo-nllb13b-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForSeq2SeqLM base_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-1.3B") model = PeftModel.from_pretrained(base_model, "Thermostatic/rosettia-quy-gspo-nllb13b-lora") - Notebooks
- Google Colab
- Kaggle
rosettia-quy — Spanish → Chanka/Ayacucho Quechua (GSPO-NLLB)
A LoRA adapter for facebook/nllb-200-1.3B for Spanish → Chanka / Ayacucho
Quechua (quy_Latn), trained with GSPO reinforcement learning (Group Sequence
Policy Optimization) on top of a synthetic-augmented supervised model.
To our knowledge this is the first application of GSPO to an encoder–decoder NMT model. It is the strongest result we are aware of on the AmericasNLP 2021 spa→quy benchmark — but please read the Limitations section: this is a research-grade system, evaluated on a single benchmark with single-reference ChrF and no native-speaker evaluation.
Results — AmericasNLP 2021 spa→quy test
ChrF (sacrebleu, word_order=0), 1003 sentences, single reference. The test set was
never used for training, tuning, or checkpoint selection.
| System | ChrF (w0) |
|---|---|
| Sheffield 2023 (NLLB-3.3B, 3-model ensemble) | 34.01 |
| Helsinki 2021 (prior task winner) | 39.40 |
| Qwen-9B (ours), greedy | 40.55 |
| NLLB-1.3B (ours), supervised + synthetic | 42.95 |
| Qwen-9B ⊕ NLLB-1.3B ensemble (ours) | 45.01 |
| + GSPO RL, single model, beam5 | 45.53 |
| + MBR decoding, single model | 46.43 |
| GSPO multi-checkpoint MBR ensemble (best) | 46.71 |
GSPO adds +2.6 ChrF over the supervised model, and the best single 1.3B model (46.43) matches our previous, much larger 9B + 1.3B ensemble — a simpler, smaller system at equal quality.
Which number is the "clean" one (please read). The fully pre-registered result is the validation-selected checkpoint (ckpt-600) with our standard decode (beam5 + apostrophe-suppression): 45.53 ChrF, a single test evaluation. The MBR (46.43) and multi-checkpoint-ensemble (46.71) numbers involve decode-time choices — the MBR sampling temperature and which checkpoints to ensemble — that we compared on the test set. They are therefore best-found configurations, not blind single evaluations, and the ~0.3–0.6 ChrF spread among them is within the noise of a 1003-sentence single-reference test. The robust, conservative claim is ≈46 ChrF, clearly above prior published work; the exact decimal of the MBR/ensemble rows is configuration-dependent. (The GSPO checkpoint was selected on a held-out validation split, never on test.)
A note on comparability. Many shared-task papers report ChrF++ (
word_order=2), which typically reads ~2–3 points higher than theword_order=0ChrF used here (e.g. BSC-2024, the 2024 task winner, reported 38.21 ChrF++). Cross-metric comparisons should be made with care; all of our numbers above areword_order=0.
How GSPO helped
Reward = sentence-ChrF against held-out, unseen Ayacucho references (deduplicated against the model's exact training corpus). Validation ChrF climbs, peaks, then declines (over-optimization); we select the peak checkpoint on validation and run a single test evaluation. The test set is never used for selection.
Quality beyond the surface metric
ChrF is a single-reference surface metric and saturates. To check the RL gains are real quality (not metric-gaming), we score several automatic, speaker-free axes:
| Axis | NLLB-1.3B (pre-RL) | + GSPO | direction |
|---|---|---|---|
| ChrF (w0) | 43.17 | 45.53 | higher better |
| Adequacy (round-trip quy→spa vs source) | 48.28 | 52.86 | higher better |
| Spanish-leakage (% sentences) | 3.39 | 2.69 | lower better |
| Length miscalibration (mean |len ratio−1|) | 0.201 | 0.175 | lower better |
GSPO improved adequacy (+4.6) by more than it improved ChrF (+2.4), and reduced leakage and length error — i.e. the gains are multi-axis quality, not surface gaming. (These are automatic proxies; no human judgments exist for this language pair.)
In the scorecard and the metrics table below, NLLB-1.3B is decoded with the same settings as the GSPO model (beam5 +
no_repeat_ngram_size=3+ apostrophe-suppression) → 43.17, vs 42.95 (beam5 only) in the headline table. The matched-decode comparison is the fair one and slightly understates the GSPO gain.
Standard MT metrics (supervised vs GSPO)
| Metric | NLLB-1.3B (pre-RL) | + GSPO | direction |
|---|---|---|---|
| ChrF (w0) | 43.17 | 45.53 | higher better |
| ChrF++ (w2) | 37.63 | 39.64 | higher better |
| BLEU | 5.65 | 5.94 | higher better |
| TER | 87.02 | 88.62 | lower better |
We report all four for transparency. GSPO (reward = ChrF) improves the character-level metrics (ChrF, ChrF++) and BLEU marginally, but word-level TER does not improve — the RL optimized character overlap, not word edits. BLEU is near-floor for both systems: word n-gram matching is unreliable for an agglutinative language under a single reference, which is exactly why we treat ChrF as the primary metric here.
Training
- Base / SFT: NLLB-200-1.3B → LoRA (BSC-2024 recipe, r256/α512, lr 2e-4 inverse-sqrt) → + ~198k synthetic pairs (Spanish monolingual forward-translated by our Qwen-9B teacher; sequence-level distillation) → supervised model ("NLLB-r2", 42.95).
- RL: GSPO — sequence-level (length-normalized) importance ratio + group-relative advantage + KL-to-frozen-reference, with single inner-epoch updates per rollout (so the importance ratio is 1 at the update and the ratio/clip reduce to length-normalized policy gradient in this regime). Reward = sentence-ChrF, selected on validation over ChrF++, length-penalty, repetition-penalty, and round-trip-adequacy reward variants (an ablation; plain ChrF won — the round-trip reward was built and falsified). Group size 16; lr 2e-6, clip 0.2, KL-coef 0.04. Rollouts via an in-process vLLM engine (we implemented NLLB/M2M-100 support for vLLM, unsupported upstream) with per-step GPU→GPU weight sync.
- Data hygiene: RL data deduplicated against the model's exact training corpus and the test set; dialect-filtered to Ayacucho/Chanka; separate validation split for checkpoint selection; one final test evaluation.
Reproduce
All scripts are in the GitHub repo; the full
narrative (problems, breakthroughs, methodology) is in docs/report/
(Typst source + compiled PDF). Outline:
# 1. Supervised NLLB (BSC recipe) + synthetic distillation -> "NLLB-r2"
python scripts/nllb/train_nllb_chanka.py --train-parquet clean_chanka/nllb_v2_corpus.parquet ...
# 2. GSPO RL (needs our NLLB-in-vLLM fork; reward = held-out-ref ChrF, G=16)
python scripts/rl/gspo_nllb_vllm.py --init-adapter <nllb-r2> --reward-type chrf --group-size 16 ...
# 3. Select the peak checkpoint on the held-out validation split, then ONE test eval
python scripts/nllb/eval_nllb_americasnlp.py --adapter <ckpt> --no-repeat-ngram 3 # standalone 45.53
# 4. (optional, configuration-dependent) MBR + multi-checkpoint ensemble
python scripts/decoding/gen_candidates_nllb.py --adapter <ckpt> --temperature 0.7
python scripts/decoding/ensemble_mbr_rerank.py --candidate-jsonls <ckpt600> <ckpt800> # 46.71
# audits / scorecard
python scripts/decoding/quality_scorecard.py --reverse-adapter <reverse> --pred-jsonl ...
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from peft import PeftModel
tok = AutoTokenizer.from_pretrained("facebook/nllb-200-1.3B", src_lang="spa_Latn", tgt_lang="quy_Latn")
m = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-1.3B", torch_dtype=torch.bfloat16)
m = PeftModel.from_pretrained(m, "Thermostatic/rosettia-quy-gspo-nllb13b-lora").cuda().eval()
bos = tok.convert_tokens_to_ids("quy_Latn")
enc = tok("No sé por qué sucedió eso.", return_tensors="pt").to("cuda")
out = m.generate(**enc, forced_bos_token_id=bos, num_beams=5, max_new_tokens=128)
print(tok.batch_decode(out, skip_special_tokens=True)[0]) # -> Manam yachanichu imarayku chay pasarqa.
The root adapter is the validation-selected single model (standalone 45.53 / self-MBR
46.43). For the best result (46.71), sample candidates from this adapter and the
checkpoint-800/ adapter, deduplicate, and pick the ChrF-MBR consensus. Apostrophe
suppression at decode is a small free gain (Ayacucho quy has no glottalization).
Limitations & intended use
- Research-grade, validated on a single benchmark with single-reference ChrF. ChrF ~46 means roughly half the character n-grams match one reference — useful as a draft, not production quality. Review with speakers before consequential use.
- No native-speaker / multi-reference evaluation was performed (no Chanka experts were available); all "quality" axes here are automatic proxies.
- Known residual issues: numbers are often kept as digits rather than spelled out in
Quechua; occasional Spanish loanword spelling; rare repetition (mitigated with
no_repeat_ngram_size=3at decode). - Dialect: Ayacucho/Chanka (
quy). Not validated for Cuzco (quz) or Central varieties. - License
cc-by-nc-4.0; non-commercial, consistent with the underlying data sources.
Authors & contributions
A two-person SomosNLP hackathon project:
- Estefanía Espinosa Fernández — data curation, and the initial Qwen3.5 LoRA experiments (comparing DoRA, rsLoRA and LoRA, and exploring data mixes).
- Irving Ernesto Quezada Ramírez (irvingernesto.com) — the subsequent modeling through the final system: synthetic-data distillation, the NLLB pipeline, GSPO reinforcement learning, decoding/ensembling, evaluation, and release.
The project was a close collaboration; both contributions were essential to the result.
Links & resources
- Code & methodology: https://github.com/Sekinal/rosettia-chanka
- Merged (standalone) model, no PEFT needed: https://huggingface.co/Thermostatic/rosettia-quy-gspo-nllb13b-merged
- NLLB / M2M-100 support for vLLM (our fork — used for fast GSPO rollouts; NLLB was unsupported upstream): https://github.com/Sekinal/vllm/tree/add-nllb-m2m100-support
- Data: https://huggingface.co/datasets/Thermostatic/rosettia-chanka-data
- Qwen-9B sibling model (the other ensemble member): https://huggingface.co/Thermostatic/rosettia-quy-v30b-9b-merged
- Base model: https://huggingface.co/facebook/nllb-200-1.3B
- GSPO: Zheng et al. 2025, Group Sequence Policy Optimization (arXiv:2507.18071)
- Downloads last month
- 52
Model tree for Thermostatic/rosettia-quy-gspo-nllb13b-lora
Base model
facebook/nllb-200-1.3B


