sttjr commited on
Commit
1213e19
Β·
verified Β·
1 Parent(s): f40791c

Update model card: add full GRPO training details, reward function, agent architecture

Browse files
Files changed (1) hide show
  1. README.md +261 -38
README.md CHANGED
@@ -1,66 +1,289 @@
1
  ---
2
- base_model: Qwen/Qwen3.5-27B
3
  library_name: peft
4
- license: apache-2.0
5
  tags:
6
- - lora
7
- - grpo
8
- - rl
9
- - fidc
10
- - finance
11
- - compliance
12
- - portuguese
13
- - paganini-aios
 
 
14
  language:
15
- - pt
16
- pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
- # Paganini AIOS β€” GRPO LoRA Adapter
20
 
21
- **Qwen3.5-27B + LoRA Rank 32** fine-tuned with Group Relative Policy Optimization (GRPO) for dual-domain expertise: **Brazilian FIDC compliance** and **software engineering**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- ## Training Details
24
 
25
- - **Base Model**: Qwen/Qwen3.5-27B
26
- - **Method**: GRPO (Group Relative Policy Optimization) via [Tinker API](https://thinkingmachines.ai/tinker/)
27
- - **LoRA**: Rank 32, Alpha 32, all-linear targets
28
- - **Dataset**: 13,697 dual-domain Q&A pairs (code + finance + cross-domain)
29
- - **Reward Function**: Dual-domain with 6 guardrail gates
30
 
31
- ## Reward Function Design
32
 
33
  ```
34
- R(x) = λ·R_code + (1-λ)·R_fin + R_shared
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- Code (Ξ»=1.0): spec adherence, architecture, pipeline compliance, code quality
37
- Finance (Ξ»=0.0): guardrail compliance, factual accuracy, source attribution, precision
38
- Cross (Ξ»=0.5): both domains integrated
39
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- ### Guardrail Gates
42
- 1. **Eligibility** β€” CVM 175 compliance check
43
- 2. **Concentration** β€” Portfolio concentration limits
44
- 3. **Covenant** β€” Fund covenant monitoring
45
- 4. **PLD/AML** β€” Anti-money laundering
46
- 5. **Compliance** β€” Regulatory compliance
47
- 6. **Risk** β€” Bayesian risk assessment
48
 
49
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ```python
52
  from peft import PeftModel
53
  from transformers import AutoModelForCausalLM, AutoTokenizer
54
 
55
- base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-27B")
 
 
 
 
 
 
 
56
  model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
57
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
- ## Part of Paganini AIOS
 
 
61
 
62
- [Paganini AIOS](https://github.com/juboyy/paganini-aios) is an autonomous AI system for Brazilian FIDC (Fundos de Investimento em Direitos CreditΓ³rios) operations, featuring 14 specialized agents, 6 guardrail gates, and a Bayesian risk network.
63
 
64
- ## SFT Checkpoint
65
 
66
- The SFT checkpoint (pre-GRPO) is available at: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora)
 
1
  ---
 
2
  library_name: peft
3
+ base_model: Qwen/Qwen3.5-27B
4
  tags:
5
+ - lora
6
+ - grpo
7
+ - rlhf
8
+ - fidc
9
+ - portuguese
10
+ - finance
11
+ - code
12
+ - reinforcement-learning
13
+ - peft
14
+ - qwen
15
  language:
16
+ - pt
17
+ - en
18
+ license: apache-2.0
19
+ ---
20
+
21
+ # Paganini GRPO LoRA β€” Qwen3.5-27B
22
+
23
+ <p align="center">
24
+ <img src="https://img.shields.io/badge/Base%20Model-Qwen3.5--27B-blue" />
25
+ <img src="https://img.shields.io/badge/Method-GRPO%20%2B%20LoRA-purple" />
26
+ <img src="https://img.shields.io/badge/Language-PT--BR%20%7C%20EN-green" />
27
+ <img src="https://img.shields.io/badge/Domain-FIDC%20%7C%20Code-orange" />
28
+ <img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey" />
29
+ </p>
30
+
31
+ > **Paganini** is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture.
32
+
33
  ---
34
 
35
+ ## 🧠 Model Overview
36
 
37
+ | Property | Value |
38
+ |---|---|
39
+ | **Base Model** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) |
40
+ | **Parameters** | 27B |
41
+ | **Adapter Type** | LoRA (PEFT) |
42
+ | **Training Method** | GRPO (Group Relative Policy Optimization) |
43
+ | **LoRA Rank** | 32 |
44
+ | **LoRA Alpha** | 32 |
45
+ | **LoRA Targets** | all-linear |
46
+ | **Task** | CAUSAL_LM |
47
+ | **Adapter Size** | 966 MB (safetensors) |
48
+ | **Languages** | Portuguese (Brazil) + English |
49
+ | **Training Platform** | [Tinker API](https://tinkerchat.ai) β€” Thinking Machines Lab cloud GPUs |
50
+ | **Training Duration** | ~3 hours (23 runs) |
51
+ | **Run ID** | `7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0` |
52
 
53
+ ---
54
 
55
+ ## πŸ—οΈ Training Pipeline
 
 
 
 
56
 
57
+ Paganini follows a two-stage alignment pipeline:
58
 
59
  ```
60
+ Qwen3.5-27B (base)
61
+ β”‚
62
+ β–Ό
63
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
64
+ β”‚ Stage 1: Supervised Fine-Tuning (SFT) β”‚
65
+ β”‚ Platform: RunPod A100 80GB β”‚
66
+ β”‚ Accuracy: 87.75% | Loss: 0.454 β”‚
67
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
68
+ β”‚
69
+ β–Ό sttjr/paganini-qwen35-27b-sft-lora
70
+ β”‚
71
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
72
+ β”‚ Stage 2: GRPO RL Alignment (this) β”‚
73
+ β”‚ Platform: Tinker API (TML Cloud GPUs) β”‚
74
+ β”‚ 23 training runs | ~3 hours β”‚
75
+ β”‚ Dual-domain reward optimization β”‚
76
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
77
+ β”‚
78
+ β–Ό sttjr/paganini-qwen35-27b-grpo-lora ← you are here
79
+ ```
80
+
81
+ ### SFT Predecessor
82
+
83
+ The GRPO run was initialized from the SFT checkpoint:
84
+ - **SFT Model**: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora)
85
+ - **Platform**: RunPod A100 80GB
86
+ - **Accuracy**: 87.75%
87
+ - **Final Loss**: 0.454
88
+
89
+ ---
90
+
91
+ ## πŸ“¦ Dataset
92
+
93
+ **Name:** `dual-dataset-v2.jsonl`
94
+
95
+ | Split | Count |
96
+ |---|---|
97
+ | Total samples | 13,697 |
98
+ | Code domain | 6,848 |
99
+ | Finance domain | 6,849 |
100
+
101
+ **Difficulty distribution:**
102
+
103
+ | Level | Count |
104
+ |---|---|
105
+ | L1 (Basic) | 4,566 |
106
+ | L2 (Intermediate) | 4,566 |
107
+ | L3 (Advanced) | 4,565 |
108
+
109
+ **Sources:**
110
+ - **Finance**: FIDC (Fundo de Investimento em Direitos CreditΓ³rios) regulatory corpus under CVM Resolution 175 β€” covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management
111
+ - **Code**: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development
112
+
113
+ ---
114
+
115
+ ## 🎯 Reward Function (Dual-Domain)
116
+
117
+ The GRPO training uses a composite reward function:
118
 
 
 
 
119
  ```
120
+ R(x) = Ξ» Β· R_code + (1 - Ξ») Β· R_fin + R_shared
121
+ ```
122
+
123
+ Where `Ξ» = 1.0` for code samples and `Ξ» = 0.0` for finance samples.
124
+
125
+ ### R_code β€” Code Domain Rewards
126
+
127
+ | Component | Reward |
128
+ |---|---|
129
+ | Spec adherence | +0.30 |
130
+ | Architecture patterns | +0.25 |
131
+ | Pipeline compliance | +0.15 |
132
+ | Code blocks present | +0.10 |
133
+ | TDD terms present | +0.10 |
134
+ | **Maximum** | **+0.90** |
135
+
136
+ ### R_finance β€” Finance Domain Rewards
137
 
138
+ | Component | Reward |
139
+ |---|---|
140
+ | Guardrail compliance | +0.35 |
141
+ | Source attribution | +0.20 |
142
+ | CVM citation | +0.15 |
143
+ | Article reference | +0.15 |
144
+ | **Maximum** | **+0.85** |
145
 
146
+ ### R_shared β€” Shared Penalty/Bonus
147
+
148
+ | Component | Reward |
149
+ |---|---|
150
+ | Hallucination penalty | βˆ’0.15 |
151
+ | Corporate speak penalty | βˆ’0.05 per occurrence |
152
+ | PT-BR language bonus | +0.05 |
153
+ | Length < 50 tokens penalty | βˆ’0.20 |
154
+
155
+ ---
156
+
157
+ ## πŸ€– Use Case: Paganini AIOS
158
+
159
+ This model is the intelligence backbone for **9 specialized FIDC domain agents** in the Paganini AIOS platform:
160
+
161
+ | Agent | Role |
162
+ |---|---|
163
+ | πŸ›οΈ Admin | Administrative governance and fund operations |
164
+ | 🏦 Custodian | Asset custody, settlement, and safekeeping |
165
+ | πŸ“Š Manager | Portfolio management and investment decisions |
166
+ | βš–οΈ Compliance | Regulatory adherence and audit trails |
167
+ | πŸ“‹ Reporting | Investor reporting and fund disclosures |
168
+ | πŸ” Due Diligence | Cedente/debtor analysis and credit assessment |
169
+ | πŸ‘οΈ RegWatch | Regulatory change monitoring (CVM, BACEN) |
170
+ | πŸ“§ IR | Investor Relations communication |
171
+ | πŸ’Ή Pricing | Asset pricing and NAV calculation |
172
+
173
+ ### 6-Gate Guardrail Pipeline
174
+
175
+ Each query passes through a sequential compliance chain:
176
+
177
+ ```
178
+ Input β†’ [Eligibility] β†’ [Concentration] β†’ [Covenant] β†’ [PLD/AML] β†’ [Compliance] β†’ [Risk] β†’ Output
179
+ ```
180
+
181
+ All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types.
182
+
183
+ ---
184
+
185
+ ## πŸš€ Usage
186
+
187
+ ### Installation
188
+
189
+ ```bash
190
+ pip install transformers peft accelerate
191
+ ```
192
+
193
+ ### Load and Run
194
 
195
  ```python
196
  from peft import PeftModel
197
  from transformers import AutoModelForCausalLM, AutoTokenizer
198
 
199
+ # Load base model
200
+ base = AutoModelForCausalLM.from_pretrained(
201
+ "Qwen/Qwen3.5-27B",
202
+ device_map="auto",
203
+ torch_dtype="auto"
204
+ )
205
+
206
+ # Load GRPO LoRA adapter
207
  model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
208
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
209
+
210
+ # Finance domain example (PT-BR)
211
+ prompt = "Explique os requisitos de PDD mΓ­nima para FIDC conforme CVM 175."
212
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
213
+ out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
214
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
215
+ ```
216
+
217
+ ### Merge Adapter (Optional)
218
+
219
+ ```python
220
+ # Merge LoRA weights into base model for faster inference
221
+ merged_model = model.merge_and_unload()
222
+ merged_model.save_pretrained("paganini-27b-merged")
223
+ tokenizer.save_pretrained("paganini-27b-merged")
224
+ ```
225
+
226
+ ---
227
+
228
+ ## πŸ“Š Checkpoints
229
+
230
+ | Checkpoint | Size | Description |
231
+ |---|---|---|
232
+ | `paganini-test` | 2.7 GB | Intermediate checkpoint |
233
+ | `paganini-rl-final` | 2.7 GB | Final GRPO-aligned checkpoint |
234
+
235
+ ---
236
+
237
+ ## ⚠️ Intended Use & Limitations
238
+
239
+ ### Intended Use
240
+ - FIDC regulatory Q&A in Portuguese (Brazil)
241
+ - Software architecture guidance for AIOS agents
242
+ - Compliance-first financial analysis aligned with CVM 175
243
+ - Internal enterprise use within the Paganini AIOS platform
244
+
245
+ ### Out-of-Scope Use
246
+ - General-purpose chatbot (use base Qwen3.5-27B instead)
247
+ - Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks)
248
+ - Real-time trading decisions or autonomous financial transactions
249
+
250
+ ### Limitations
251
+ - Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff
252
+ - PT-BR outputs are prioritized; EN responses may be less fluent
253
+ - Requires at least 2Γ— A100 80GB GPUs or equivalent for full-precision inference
254
+ - LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16)
255
+
256
+ ---
257
+
258
+ ## πŸ”— Project Links
259
+
260
+ | Resource | Link |
261
+ |---|---|
262
+ | πŸ™ GitHub (Paganini AIOS) | [juboyy/paganini-aios](https://github.com/juboyy/paganini-aios) |
263
+ | πŸ“Š Dashboard | [dashboard-v2-pearl-rho.vercel.app](https://dashboard-v2-pearl-rho.vercel.app) |
264
+ | πŸ€— SFT Predecessor | [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) |
265
+
266
+ ---
267
+
268
+ ## πŸ“„ Citation
269
+
270
+ ```bibtex
271
+ @misc{paganini-grpo-lora-2026,
272
+ title = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence},
273
+ author = {sttjr},
274
+ year = {2026},
275
+ publisher = {HuggingFace},
276
+ howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}},
277
+ note = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture}
278
+ }
279
  ```
280
 
281
+ ---
282
+
283
+ ## πŸ“œ License
284
 
285
+ Apache 2.0 β€” See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.
286
 
287
+ ---
288
 
289
+ *Paganini AIOS β€” Built for the Brazilian FIDC ecosystem.*