farffadet's picture
docs: add model card
2d6c8f2 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
  - lora
  - grpo
  - legal-reasoning
  - syllogym
  - multi-turn
  - rlvr

SylloGym — Qwen3-4B GRPO (LoRA adapter)

Fine-tuned from Qwen/Qwen3-4B on the SylloGym legal reasoning environment using GRPO (Group Relative Policy Optimization).

Training: 180 steps on an A100, multi-turn rollout via TRL + Unsloth
Environment: 12 legal domains, 45 tasks, deterministic Python verifiers
Result: +6.1 pp overall accuracy (61.7% → 67.8%), +8.3 pp on 5-turn episodes

This is a LoRA adapter — load it on top of the base Qwen3-4B model.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "farffadet/syllogym-judge-qwen3-4b-grpo")
tokenizer = AutoTokenizer.from_pretrained("farffadet/syllogym-judge-qwen3-4b-grpo")

Environment

Training details

Parameter Value
Base model Qwen/Qwen3-4B (4-bit, LoRA r=32)
Steps 180 (target: 300)
Batch size 8
Temperature 0.9
KL penalty (β) 0.02
Loss type DR-GRPO
max_completion_length 3072 tokens