--- license: apache-2.0 base_model: Qwen/Qwen3-4B tags: - lora - grpo - legal-reasoning - syllogym - multi-turn - rlvr --- # SylloGym — Qwen3-4B GRPO (LoRA adapter) Fine-tuned from [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) on the [SylloGym](https://huggingface.co/spaces/farffadet/syllogym-env) legal reasoning environment using GRPO (Group Relative Policy Optimization). **Training:** 180 steps on an A100, multi-turn rollout via TRL + Unsloth **Environment:** 12 legal domains, 45 tasks, deterministic Python verifiers **Result:** +6.1 pp overall accuracy (61.7% → 67.8%), +8.3 pp on 5-turn episodes This is a LoRA adapter — load it on top of the base Qwen3-4B model. ## Usage ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B") model = PeftModel.from_pretrained(base, "farffadet/syllogym-judge-qwen3-4b-grpo") tokenizer = AutoTokenizer.from_pretrained("farffadet/syllogym-judge-qwen3-4b-grpo") ``` ## Environment - **Repo:** [eliot-gtn/syllogym](https://github.com/eliot-gtn/syllogym) - **HF Space:** [farffadet/syllogym-env](https://huggingface.co/spaces/farffadet/syllogym-env) - **Blog post:** See the repository for the full write-up. ## Training details | Parameter | Value | |-----------|-------| | Base model | Qwen/Qwen3-4B (4-bit, LoRA r=32) | | Steps | 180 (target: 300) | | Batch size | 8 | | Temperature | 0.9 | | KL penalty (β) | 0.02 | | Loss type | DR-GRPO | | max_completion_length | 3072 tokens |