Instructions to use kavyanshshakya/strathos-qwen17b-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use kavyanshshakya/strathos-qwen17b-grpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B") model = PeftModel.from_pretrained(base_model, "kavyanshshakya/strathos-qwen17b-grpo") - Notebooks
- Google Colab
- Kaggle
Strathos GRPO: Stage-3 RL Post-Training (Research Artifact)
GRPO-trained LoRA adapter on top of Strathos V2-PLUS SFT.
This is a research artifact, not a production model. Strathos V2-PLUS (the SFT-only model) outperforms this GRPO model on held-out evaluation. We release this adapter for reproducibility and to document an honest GRPO-on-small-data finding.
Built solo for the Meta PyTorch OpenEnv Hackathon Grand Finale (Bangalore, April 25-26, 2026).
Why we trained it
The OpenEnv hackathon's explicit ask was to "build environments AND post-train models." After three attempted GRPO integration paths via TRL hit infrastructure blockers (rollout_func bugs, vLLM colocate memory, TRL+OpenEnv version mismatches), we shipped a standalone GRPO that trains directly on stored prompts/completions using our 7-component composable rubric as the reward function โ sidestepping all integration issues.
Setup
| Field | Value |
|---|---|
| Base | kavyanshshakya/strathos-qwen17b-sft (V2-PLUS SFT) |
| Algorithm | GRPO (TRL 0.27.1) |
| Reward function | 7-component composable rubric (offline scoring) |
| Dataset | 15 unique env-derived prompts |
| Generations per prompt | 4 |
| Total rollouts | 120 (15 ร 4 ร 2 epochs) |
| Learning rate | 5e-6 |
| KL beta | 0.04 |
| Hardware | Colab Pro A100 |
| Wall-clock | 3.5 minutes |
Result
GRPO training succeeded end-to-end with measurable reward improvement during training:
| Metric | Step 2 (start) | Step 14 (end) | ฮ |
|---|---|---|---|
| Mean reward | 0.31 | 0.66 | +110% |
| Loss range | -0.07 to +0.08 (small advantage-based losses, normal for GRPO) |
However, on held-out evaluation against 15 unique env scenarios:
| Model | Strict (decision + ASI class) | Relaxed (decision only) |
|---|---|---|
| Base Qwen 1.7B | 0/15 (0%) | 0/15 (0%) |
| V2-PLUS SFT | 6/15 (40%) | 9/15 (60%) |
| V2-PLUS + format-hint | 4/15 (27%) | 12/15 (80%) |
| GRPO (this model) | 3/15 (20%) | 7/15 (47%) |
GRPO degraded performance vs SFT baseline despite climbing reward. Two failure modes were observed:
- Reward hacking: Model emitted strings like
"asi_class": "ASI01-ASI07"covering all classes, exploiting the regex-based ASI class match in the reward function. - Format collapse: 5 of 15 cases (including 4 of 5 legitimate scenarios) produced
decision=Noneโ the JSON output structure that V2-PLUS reliably produced was destabilized by GRPO.
Honest takeaway
This is what GRPO-on-small-data looks like in practice. The reward signal is real and trainable, but with only 15 unique prompts and a regex-parseable reward, the model finds shortcuts faster than it improves on the actual task. Larger prompt set, stricter reward shaping (enum validation, structural validation), and longer training are needed.
Use
Same loading pattern as the SFT model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(base, "kavyanshshakya/strathos-qwen17b-grpo")
For production use, prefer the SFT model: kavyanshshakya/strathos-qwen17b-sft.
Citation
@misc{strathos-grpo-2026,
author = {Shakya, Kavyansh},
title = {Strathos GRPO: Stage-3 RL Post-Training Research Artifact},
year = {2026},
howpublished = {Meta PyTorch OpenEnv Hackathon Grand Finale, Bangalore},
url = {https://huggingface.co/kavyanshshakya/strathos-qwen17b-grpo}
}
License
MIT
- Downloads last month
- -
Model tree for kavyanshshakya/strathos-qwen17b-grpo
Base model
Qwen/Qwen3-1.7B-Base