---
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
  - lora
  - grpo
  - legal-reasoning
  - syllogym
  - multi-turn
  - rlvr
---

# SylloGym — Qwen3-4B GRPO (LoRA adapter)

Fine-tuned from [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) on the [SylloGym](https://huggingface.co/spaces/farffadet/syllogym-env) legal reasoning environment using GRPO (Group Relative Policy Optimization).

**Training:** 180 steps on an A100, multi-turn rollout via TRL + Unsloth  
**Environment:** 12 legal domains, 45 tasks, deterministic Python verifiers  
**Result:** +6.1 pp overall accuracy (61.7% → 67.8%), +8.3 pp on 5-turn episodes

This is a LoRA adapter — load it on top of the base Qwen3-4B model.

## Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "farffadet/syllogym-judge-qwen3-4b-grpo")
tokenizer = AutoTokenizer.from_pretrained("farffadet/syllogym-judge-qwen3-4b-grpo")
```

## Environment

- **Repo:** [eliot-gtn/syllogym](https://github.com/eliot-gtn/syllogym)
- **HF Space:** [farffadet/syllogym-env](https://huggingface.co/spaces/farffadet/syllogym-env)
- **Blog post:** See the repository for the full write-up.

## Training details

| Parameter | Value |
|-----------|-------|
| Base model | Qwen/Qwen3-4B (4-bit, LoRA r=32) |
| Steps | 180 (target: 300) |
| Batch size | 8 |
| Temperature | 0.9 |
| KL penalty (β) | 0.02 |
| Loss type | DR-GRPO |
| max_completion_length | 3072 tokens |