---
library_name: peft
base_model: Qwen/Qwen3.5-27B
tags:
- lora
- grpo
- rlhf
- fidc
- portuguese
- finance
- code
- reinforcement-learning
- peft
- qwen
language:
- pt
- en
license: apache-2.0
---
# Paganini GRPO LoRA β Qwen3.5-27B
> **Paganini** is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture.
---
## π§ Model Overview
| Property | Value |
|---|---|
| **Base Model** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) |
| **Parameters** | 27B |
| **Adapter Type** | LoRA (PEFT) |
| **Training Method** | GRPO (Group Relative Policy Optimization) |
| **LoRA Rank** | 32 |
| **LoRA Alpha** | 32 |
| **LoRA Targets** | all-linear |
| **Task** | CAUSAL_LM |
| **Adapter Size** | 966 MB (safetensors) |
| **Languages** | Portuguese (Brazil) + English |
| **Training Platform** | [Tinker API](https://tinkerchat.ai) β Thinking Machines Lab cloud GPUs |
| **Training Duration** | ~3 hours (23 runs) |
| **Run ID** | `7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0` |
---
## ποΈ Training Pipeline
Paganini follows a two-stage alignment pipeline:
```
Qwen3.5-27B (base)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Stage 1: Supervised Fine-Tuning (SFT) β
β Platform: RunPod A100 80GB β
β Accuracy: 87.75% | Loss: 0.454 β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ sttjr/paganini-qwen35-27b-sft-lora
β
βββββββββββββββββββββββββββββββββββββββββββ
β Stage 2: GRPO RL Alignment (this) β
β Platform: Tinker API (TML Cloud GPUs) β
β 23 training runs | ~3 hours β
β Dual-domain reward optimization β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ sttjr/paganini-qwen35-27b-grpo-lora β you are here
```
### SFT Predecessor
The GRPO run was initialized from the SFT checkpoint:
- **SFT Model**: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora)
- **Platform**: RunPod A100 80GB
- **Accuracy**: 87.75%
- **Final Loss**: 0.454
---
## π¦ Dataset
**Name:** `dual-dataset-v2.jsonl`
| Split | Count |
|---|---|
| Total samples | 13,697 |
| Code domain | 6,848 |
| Finance domain | 6,849 |
**Difficulty distribution:**
| Level | Count |
|---|---|
| L1 (Basic) | 4,566 |
| L2 (Intermediate) | 4,566 |
| L3 (Advanced) | 4,565 |
**Sources:**
- **Finance**: FIDC (Fundo de Investimento em Direitos CreditΓ³rios) regulatory corpus under CVM Resolution 175 β covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management
- **Code**: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development
---
## π― Reward Function (Dual-Domain)
The GRPO training uses a composite reward function:
```
R(x) = Ξ» Β· R_code + (1 - Ξ») Β· R_fin + R_shared
```
Where `Ξ» = 1.0` for code samples and `Ξ» = 0.0` for finance samples.
### R_code β Code Domain Rewards
| Component | Reward |
|---|---|
| Spec adherence | +0.30 |
| Architecture patterns | +0.25 |
| Pipeline compliance | +0.15 |
| Code blocks present | +0.10 |
| TDD terms present | +0.10 |
| **Maximum** | **+0.90** |
### R_finance β Finance Domain Rewards
| Component | Reward |
|---|---|
| Guardrail compliance | +0.35 |
| Source attribution | +0.20 |
| CVM citation | +0.15 |
| Article reference | +0.15 |
| **Maximum** | **+0.85** |
### R_shared β Shared Penalty/Bonus
| Component | Reward |
|---|---|
| Hallucination penalty | β0.15 |
| Corporate speak penalty | β0.05 per occurrence |
| PT-BR language bonus | +0.05 |
| Length < 50 tokens penalty | β0.20 |
---
## π€ Use Case: Paganini AIOS
This model is the intelligence backbone for **9 specialized FIDC domain agents** in the Paganini AIOS platform:
| Agent | Role |
|---|---|
| ποΈ Admin | Administrative governance and fund operations |
| π¦ Custodian | Asset custody, settlement, and safekeeping |
| π Manager | Portfolio management and investment decisions |
| βοΈ Compliance | Regulatory adherence and audit trails |
| π Reporting | Investor reporting and fund disclosures |
| π Due Diligence | Cedente/debtor analysis and credit assessment |
| ποΈ RegWatch | Regulatory change monitoring (CVM, BACEN) |
| π§ IR | Investor Relations communication |
| πΉ Pricing | Asset pricing and NAV calculation |
### 6-Gate Guardrail Pipeline
Each query passes through a sequential compliance chain:
```
Input β [Eligibility] β [Concentration] β [Covenant] β [PLD/AML] β [Compliance] β [Risk] β Output
```
All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types.
---
## π Usage
### Installation
```bash
pip install transformers peft accelerate
```
### Load and Run
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-27B",
device_map="auto",
torch_dtype="auto"
)
# Load GRPO LoRA adapter
model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
# Finance domain example (PT-BR)
prompt = "Explique os requisitos de PDD mΓnima para FIDC conforme CVM 175."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
### Merge Adapter (Optional)
```python
# Merge LoRA weights into base model for faster inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("paganini-27b-merged")
tokenizer.save_pretrained("paganini-27b-merged")
```
---
## π Checkpoints
| Checkpoint | Size | Description |
|---|---|---|
| `paganini-test` | 2.7 GB | Intermediate checkpoint |
| `paganini-rl-final` | 2.7 GB | Final GRPO-aligned checkpoint |
---
## β οΈ Intended Use & Limitations
### Intended Use
- FIDC regulatory Q&A in Portuguese (Brazil)
- Software architecture guidance for AIOS agents
- Compliance-first financial analysis aligned with CVM 175
- Internal enterprise use within the Paganini AIOS platform
### Out-of-Scope Use
- General-purpose chatbot (use base Qwen3.5-27B instead)
- Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks)
- Real-time trading decisions or autonomous financial transactions
### Limitations
- Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff
- PT-BR outputs are prioritized; EN responses may be less fluent
- Requires at least 2Γ A100 80GB GPUs or equivalent for full-precision inference
- LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16)
---
## π Project Links
| Resource | Link |
|---|---|
| π GitHub (Paganini AIOS) | [juboyy/paganini-aios](https://github.com/juboyy/paganini-aios) |
| π Dashboard | [dashboard-v2-pearl-rho.vercel.app](https://dashboard-v2-pearl-rho.vercel.app) |
| π€ SFT Predecessor | [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) |
---
## π Citation
```bibtex
@misc{paganini-grpo-lora-2026,
title = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence},
author = {sttjr},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}},
note = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture}
}
```
---
## π License
Apache 2.0 β See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.
---
*Paganini AIOS β Built for the Brazilian FIDC ecosystem.*