Reinforcement Learning
PEFT
Safetensors
Portuguese
English
lora
grpo
rlhf
fidc
portuguese
finance
code
qwen
Instructions to use sttjr/paganini-qwen35-27b-grpo-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sttjr/paganini-qwen35-27b-grpo-lora with PEFT:
Base model is not found.
- Notebooks
- Google Colab
- Kaggle
Update model card: add full GRPO training details, reward function, agent architecture
Browse files
README.md
CHANGED
|
@@ -1,66 +1,289 @@
|
|
| 1 |
---
|
| 2 |
-
base_model: Qwen/Qwen3.5-27B
|
| 3 |
library_name: peft
|
| 4 |
-
|
| 5 |
tags:
|
| 6 |
-
- lora
|
| 7 |
-
- grpo
|
| 8 |
-
-
|
| 9 |
-
- fidc
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
-
|
| 13 |
-
-
|
|
|
|
|
|
|
| 14 |
language:
|
| 15 |
-
- pt
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
-
#
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- **Method**: GRPO (Group Relative Policy Optimization) via [Tinker API](https://thinkingmachines.ai/tinker/)
|
| 27 |
-
- **LoRA**: Rank 32, Alpha 32, all-linear targets
|
| 28 |
-
- **Dataset**: 13,697 dual-domain Q&A pairs (code + finance + cross-domain)
|
| 29 |
-
- **Reward Function**: Dual-domain with 6 guardrail gates
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
```
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
Code (Ξ»=1.0): spec adherence, architecture, pipeline compliance, code quality
|
| 37 |
-
Finance (Ξ»=0.0): guardrail compliance, factual accuracy, source attribution, precision
|
| 38 |
-
Cross (Ξ»=0.5): both domains integrated
|
| 39 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
|
| 49 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
```python
|
| 52 |
from peft import PeftModel
|
| 53 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
|
| 57 |
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
```
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: peft
|
| 3 |
+
base_model: Qwen/Qwen3.5-27B
|
| 4 |
tags:
|
| 5 |
+
- lora
|
| 6 |
+
- grpo
|
| 7 |
+
- rlhf
|
| 8 |
+
- fidc
|
| 9 |
+
- portuguese
|
| 10 |
+
- finance
|
| 11 |
+
- code
|
| 12 |
+
- reinforcement-learning
|
| 13 |
+
- peft
|
| 14 |
+
- qwen
|
| 15 |
language:
|
| 16 |
+
- pt
|
| 17 |
+
- en
|
| 18 |
+
license: apache-2.0
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Paganini GRPO LoRA β Qwen3.5-27B
|
| 22 |
+
|
| 23 |
+
<p align="center">
|
| 24 |
+
<img src="https://img.shields.io/badge/Base%20Model-Qwen3.5--27B-blue" />
|
| 25 |
+
<img src="https://img.shields.io/badge/Method-GRPO%20%2B%20LoRA-purple" />
|
| 26 |
+
<img src="https://img.shields.io/badge/Language-PT--BR%20%7C%20EN-green" />
|
| 27 |
+
<img src="https://img.shields.io/badge/Domain-FIDC%20%7C%20Code-orange" />
|
| 28 |
+
<img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey" />
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
> **Paganini** is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture.
|
| 32 |
+
|
| 33 |
---
|
| 34 |
|
| 35 |
+
## π§ Model Overview
|
| 36 |
|
| 37 |
+
| Property | Value |
|
| 38 |
+
|---|---|
|
| 39 |
+
| **Base Model** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) |
|
| 40 |
+
| **Parameters** | 27B |
|
| 41 |
+
| **Adapter Type** | LoRA (PEFT) |
|
| 42 |
+
| **Training Method** | GRPO (Group Relative Policy Optimization) |
|
| 43 |
+
| **LoRA Rank** | 32 |
|
| 44 |
+
| **LoRA Alpha** | 32 |
|
| 45 |
+
| **LoRA Targets** | all-linear |
|
| 46 |
+
| **Task** | CAUSAL_LM |
|
| 47 |
+
| **Adapter Size** | 966 MB (safetensors) |
|
| 48 |
+
| **Languages** | Portuguese (Brazil) + English |
|
| 49 |
+
| **Training Platform** | [Tinker API](https://tinkerchat.ai) β Thinking Machines Lab cloud GPUs |
|
| 50 |
+
| **Training Duration** | ~3 hours (23 runs) |
|
| 51 |
+
| **Run ID** | `7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0` |
|
| 52 |
|
| 53 |
+
---
|
| 54 |
|
| 55 |
+
## ποΈ Training Pipeline
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
Paganini follows a two-stage alignment pipeline:
|
| 58 |
|
| 59 |
```
|
| 60 |
+
Qwen3.5-27B (base)
|
| 61 |
+
β
|
| 62 |
+
βΌ
|
| 63 |
+
βββββββββββββββββββββββββββββββββββββββββββ
|
| 64 |
+
β Stage 1: Supervised Fine-Tuning (SFT) β
|
| 65 |
+
β Platform: RunPod A100 80GB β
|
| 66 |
+
β Accuracy: 87.75% | Loss: 0.454 β
|
| 67 |
+
βββββββββββββββββββββββββββββββββββββββββββ
|
| 68 |
+
β
|
| 69 |
+
βΌ sttjr/paganini-qwen35-27b-sft-lora
|
| 70 |
+
β
|
| 71 |
+
βββββββββββββββββββββββββββββββββββββββββββ
|
| 72 |
+
β Stage 2: GRPO RL Alignment (this) β
|
| 73 |
+
β Platform: Tinker API (TML Cloud GPUs) β
|
| 74 |
+
β 23 training runs | ~3 hours β
|
| 75 |
+
β Dual-domain reward optimization β
|
| 76 |
+
βββββββββββββββββββββββββββββββββββββββββββ
|
| 77 |
+
β
|
| 78 |
+
βΌ sttjr/paganini-qwen35-27b-grpo-lora β you are here
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### SFT Predecessor
|
| 82 |
+
|
| 83 |
+
The GRPO run was initialized from the SFT checkpoint:
|
| 84 |
+
- **SFT Model**: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora)
|
| 85 |
+
- **Platform**: RunPod A100 80GB
|
| 86 |
+
- **Accuracy**: 87.75%
|
| 87 |
+
- **Final Loss**: 0.454
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## π¦ Dataset
|
| 92 |
+
|
| 93 |
+
**Name:** `dual-dataset-v2.jsonl`
|
| 94 |
+
|
| 95 |
+
| Split | Count |
|
| 96 |
+
|---|---|
|
| 97 |
+
| Total samples | 13,697 |
|
| 98 |
+
| Code domain | 6,848 |
|
| 99 |
+
| Finance domain | 6,849 |
|
| 100 |
+
|
| 101 |
+
**Difficulty distribution:**
|
| 102 |
+
|
| 103 |
+
| Level | Count |
|
| 104 |
+
|---|---|
|
| 105 |
+
| L1 (Basic) | 4,566 |
|
| 106 |
+
| L2 (Intermediate) | 4,566 |
|
| 107 |
+
| L3 (Advanced) | 4,565 |
|
| 108 |
+
|
| 109 |
+
**Sources:**
|
| 110 |
+
- **Finance**: FIDC (Fundo de Investimento em Direitos CreditΓ³rios) regulatory corpus under CVM Resolution 175 β covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management
|
| 111 |
+
- **Code**: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## π― Reward Function (Dual-Domain)
|
| 116 |
+
|
| 117 |
+
The GRPO training uses a composite reward function:
|
| 118 |
|
|
|
|
|
|
|
|
|
|
| 119 |
```
|
| 120 |
+
R(x) = Ξ» Β· R_code + (1 - Ξ») Β· R_fin + R_shared
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
Where `Ξ» = 1.0` for code samples and `Ξ» = 0.0` for finance samples.
|
| 124 |
+
|
| 125 |
+
### R_code β Code Domain Rewards
|
| 126 |
+
|
| 127 |
+
| Component | Reward |
|
| 128 |
+
|---|---|
|
| 129 |
+
| Spec adherence | +0.30 |
|
| 130 |
+
| Architecture patterns | +0.25 |
|
| 131 |
+
| Pipeline compliance | +0.15 |
|
| 132 |
+
| Code blocks present | +0.10 |
|
| 133 |
+
| TDD terms present | +0.10 |
|
| 134 |
+
| **Maximum** | **+0.90** |
|
| 135 |
+
|
| 136 |
+
### R_finance β Finance Domain Rewards
|
| 137 |
|
| 138 |
+
| Component | Reward |
|
| 139 |
+
|---|---|
|
| 140 |
+
| Guardrail compliance | +0.35 |
|
| 141 |
+
| Source attribution | +0.20 |
|
| 142 |
+
| CVM citation | +0.15 |
|
| 143 |
+
| Article reference | +0.15 |
|
| 144 |
+
| **Maximum** | **+0.85** |
|
| 145 |
|
| 146 |
+
### R_shared β Shared Penalty/Bonus
|
| 147 |
+
|
| 148 |
+
| Component | Reward |
|
| 149 |
+
|---|---|
|
| 150 |
+
| Hallucination penalty | β0.15 |
|
| 151 |
+
| Corporate speak penalty | β0.05 per occurrence |
|
| 152 |
+
| PT-BR language bonus | +0.05 |
|
| 153 |
+
| Length < 50 tokens penalty | β0.20 |
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## π€ Use Case: Paganini AIOS
|
| 158 |
+
|
| 159 |
+
This model is the intelligence backbone for **9 specialized FIDC domain agents** in the Paganini AIOS platform:
|
| 160 |
+
|
| 161 |
+
| Agent | Role |
|
| 162 |
+
|---|---|
|
| 163 |
+
| ποΈ Admin | Administrative governance and fund operations |
|
| 164 |
+
| π¦ Custodian | Asset custody, settlement, and safekeeping |
|
| 165 |
+
| π Manager | Portfolio management and investment decisions |
|
| 166 |
+
| βοΈ Compliance | Regulatory adherence and audit trails |
|
| 167 |
+
| π Reporting | Investor reporting and fund disclosures |
|
| 168 |
+
| π Due Diligence | Cedente/debtor analysis and credit assessment |
|
| 169 |
+
| ποΈ RegWatch | Regulatory change monitoring (CVM, BACEN) |
|
| 170 |
+
| π§ IR | Investor Relations communication |
|
| 171 |
+
| πΉ Pricing | Asset pricing and NAV calculation |
|
| 172 |
+
|
| 173 |
+
### 6-Gate Guardrail Pipeline
|
| 174 |
+
|
| 175 |
+
Each query passes through a sequential compliance chain:
|
| 176 |
+
|
| 177 |
+
```
|
| 178 |
+
Input β [Eligibility] β [Concentration] β [Covenant] β [PLD/AML] β [Compliance] β [Risk] β Output
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types.
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## π Usage
|
| 186 |
+
|
| 187 |
+
### Installation
|
| 188 |
+
|
| 189 |
+
```bash
|
| 190 |
+
pip install transformers peft accelerate
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### Load and Run
|
| 194 |
|
| 195 |
```python
|
| 196 |
from peft import PeftModel
|
| 197 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 198 |
|
| 199 |
+
# Load base model
|
| 200 |
+
base = AutoModelForCausalLM.from_pretrained(
|
| 201 |
+
"Qwen/Qwen3.5-27B",
|
| 202 |
+
device_map="auto",
|
| 203 |
+
torch_dtype="auto"
|
| 204 |
+
)
|
| 205 |
+
|
| 206 |
+
# Load GRPO LoRA adapter
|
| 207 |
model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
|
| 208 |
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
|
| 209 |
+
|
| 210 |
+
# Finance domain example (PT-BR)
|
| 211 |
+
prompt = "Explique os requisitos de PDD mΓnima para FIDC conforme CVM 175."
|
| 212 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 213 |
+
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
|
| 214 |
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
### Merge Adapter (Optional)
|
| 218 |
+
|
| 219 |
+
```python
|
| 220 |
+
# Merge LoRA weights into base model for faster inference
|
| 221 |
+
merged_model = model.merge_and_unload()
|
| 222 |
+
merged_model.save_pretrained("paganini-27b-merged")
|
| 223 |
+
tokenizer.save_pretrained("paganini-27b-merged")
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## π Checkpoints
|
| 229 |
+
|
| 230 |
+
| Checkpoint | Size | Description |
|
| 231 |
+
|---|---|---|
|
| 232 |
+
| `paganini-test` | 2.7 GB | Intermediate checkpoint |
|
| 233 |
+
| `paganini-rl-final` | 2.7 GB | Final GRPO-aligned checkpoint |
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## β οΈ Intended Use & Limitations
|
| 238 |
+
|
| 239 |
+
### Intended Use
|
| 240 |
+
- FIDC regulatory Q&A in Portuguese (Brazil)
|
| 241 |
+
- Software architecture guidance for AIOS agents
|
| 242 |
+
- Compliance-first financial analysis aligned with CVM 175
|
| 243 |
+
- Internal enterprise use within the Paganini AIOS platform
|
| 244 |
+
|
| 245 |
+
### Out-of-Scope Use
|
| 246 |
+
- General-purpose chatbot (use base Qwen3.5-27B instead)
|
| 247 |
+
- Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks)
|
| 248 |
+
- Real-time trading decisions or autonomous financial transactions
|
| 249 |
+
|
| 250 |
+
### Limitations
|
| 251 |
+
- Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff
|
| 252 |
+
- PT-BR outputs are prioritized; EN responses may be less fluent
|
| 253 |
+
- Requires at least 2Γ A100 80GB GPUs or equivalent for full-precision inference
|
| 254 |
+
- LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16)
|
| 255 |
+
|
| 256 |
+
---
|
| 257 |
+
|
| 258 |
+
## π Project Links
|
| 259 |
+
|
| 260 |
+
| Resource | Link |
|
| 261 |
+
|---|---|
|
| 262 |
+
| π GitHub (Paganini AIOS) | [juboyy/paganini-aios](https://github.com/juboyy/paganini-aios) |
|
| 263 |
+
| π Dashboard | [dashboard-v2-pearl-rho.vercel.app](https://dashboard-v2-pearl-rho.vercel.app) |
|
| 264 |
+
| π€ SFT Predecessor | [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) |
|
| 265 |
+
|
| 266 |
+
---
|
| 267 |
+
|
| 268 |
+
## π Citation
|
| 269 |
+
|
| 270 |
+
```bibtex
|
| 271 |
+
@misc{paganini-grpo-lora-2026,
|
| 272 |
+
title = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence},
|
| 273 |
+
author = {sttjr},
|
| 274 |
+
year = {2026},
|
| 275 |
+
publisher = {HuggingFace},
|
| 276 |
+
howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}},
|
| 277 |
+
note = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture}
|
| 278 |
+
}
|
| 279 |
```
|
| 280 |
|
| 281 |
+
---
|
| 282 |
+
|
| 283 |
+
## π License
|
| 284 |
|
| 285 |
+
Apache 2.0 β See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.
|
| 286 |
|
| 287 |
+
---
|
| 288 |
|
| 289 |
+
*Paganini AIOS β Built for the Brazilian FIDC ecosystem.*
|