Strathos GRPO: Stage-3 RL Post-Training (Research Artifact)

GRPO-trained LoRA adapter on top of Strathos V2-PLUS SFT.

This is a research artifact, not a production model. Strathos V2-PLUS (the SFT-only model) outperforms this GRPO model on held-out evaluation. We release this adapter for reproducibility and to document an honest GRPO-on-small-data finding.

Built solo for the Meta PyTorch OpenEnv Hackathon Grand Finale (Bangalore, April 25-26, 2026).

Why we trained it

The OpenEnv hackathon's explicit ask was to "build environments AND post-train models." After three attempted GRPO integration paths via TRL hit infrastructure blockers (rollout_func bugs, vLLM colocate memory, TRL+OpenEnv version mismatches), we shipped a standalone GRPO that trains directly on stored prompts/completions using our 7-component composable rubric as the reward function — sidestepping all integration issues.

Setup

Field	Value
Base	`kavyanshshakya/strathos-qwen17b-sft` (V2-PLUS SFT)
Algorithm	GRPO (TRL 0.27.1)
Reward function	7-component composable rubric (offline scoring)
Dataset	15 unique env-derived prompts
Generations per prompt	4
Total rollouts	120 (15 × 4 × 2 epochs)
Learning rate	5e-6
KL beta	0.04
Hardware	Colab Pro A100
Wall-clock	3.5 minutes

Result

GRPO training succeeded end-to-end with measurable reward improvement during training:

Metric	Step 2 (start)	Step 14 (end)	Δ
Mean reward	0.31	0.66	+110%
Loss range	-0.07 to +0.08 (small advantage-based losses, normal for GRPO)

However, on held-out evaluation against 15 unique env scenarios:

Model	Strict (decision + ASI class)	Relaxed (decision only)
Base Qwen 1.7B	0/15 (0%)	0/15 (0%)
V2-PLUS SFT	6/15 (40%)	9/15 (60%)
V2-PLUS + format-hint	4/15 (27%)	12/15 (80%)
GRPO (this model)	3/15 (20%)	7/15 (47%)

GRPO degraded performance vs SFT baseline despite climbing reward. Two failure modes were observed:

Reward hacking: Model emitted strings like "asi_class": "ASI01-ASI07" covering all classes, exploiting the regex-based ASI class match in the reward function.
Format collapse: 5 of 15 cases (including 4 of 5 legitimate scenarios) produced decision=None — the JSON output structure that V2-PLUS reliably produced was destabilized by GRPO.

Honest takeaway

This is what GRPO-on-small-data looks like in practice. The reward signal is real and trainable, but with only 15 unique prompts and a regex-parseable reward, the model finds shortcuts faster than it improves on the actual task. Larger prompt set, stricter reward shaping (enum validation, structural validation), and longer training are needed.

Use

Same loading pattern as the SFT model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(base, "kavyanshshakya/strathos-qwen17b-grpo")

For production use, prefer the SFT model: kavyanshshakya/strathos-qwen17b-sft.

Citation

@misc{strathos-grpo-2026,
  author = {Shakya, Kavyansh},
  title = {Strathos GRPO: Stage-3 RL Post-Training Research Artifact},
  year = {2026},
  howpublished = {Meta PyTorch OpenEnv Hackathon Grand Finale, Bangalore},
  url = {https://huggingface.co/kavyanshshakya/strathos-qwen17b-grpo}
}

License

MIT

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kavyanshshakya/strathos-qwen17b-grpo

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Adapter

kavyanshshakya/strathos-qwen17b-sft