sttjr
/

paganini-qwen35-27b-grpo-lora

@@ -1,66 +1,289 @@
 ---
-base_model: Qwen/Qwen3.5-27B
 library_name: peft
-license: apache-2.0
 tags:
-- lora
-- grpo
-- rl
-- fidc
-- finance
-- compliance
-- portuguese
-- paganini-aios
 language:
-- pt
-pipeline_tag: text-generation
 ---
-# Paganini AIOS — GRPO LoRA Adapter
-**Qwen3.5-27B + LoRA Rank 32** fine-tuned with Group Relative Policy Optimization (GRPO) for dual-domain expertise: **Brazilian FIDC compliance** and **software engineering**.
-## Training Details
-- **Base Model**: Qwen/Qwen3.5-27B
-- **Method**: GRPO (Group Relative Policy Optimization) via [Tinker API](https://thinkingmachines.ai/tinker/)
-- **LoRA**: Rank 32, Alpha 32, all-linear targets
-- **Dataset**: 13,697 dual-domain Q&A pairs (code + finance + cross-domain)
-- **Reward Function**: Dual-domain with 6 guardrail gates
-## Reward Function Design
 ```
-R(x) = λ·R_code + (1-λ)·R_fin + R_shared
-Code (λ=1.0):   spec adherence, architecture, pipeline compliance, code quality
-Finance (λ=0.0): guardrail compliance, factual accuracy, source attribution, precision
-Cross (λ=0.5):   both domains integrated
 ```
-### Guardrail Gates
-1. **Eligibility** — CVM 175 compliance check
-2. **Concentration** — Portfolio concentration limits
-3. **Covenant** — Fund covenant monitoring
-4. **PLD/AML** — Anti-money laundering
-5. **Compliance** — Regulatory compliance
-6. **Risk** — Bayesian risk assessment
-## Usage
 ```python
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer
-base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-27B")
 model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
 ```
-## Part of Paganini AIOS
-[Paganini AIOS](https://github.com/juboyy/paganini-aios) is an autonomous AI system for Brazilian FIDC (Fundos de Investimento em Direitos Creditórios) operations, featuring 14 specialized agents, 6 guardrail gates, and a Bayesian risk network.
-## SFT Checkpoint
-The SFT checkpoint (pre-GRPO) is available at: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora)

 ---
 library_name: peft
+base_model: Qwen/Qwen3.5-27B
 tags:
+  - lora
+  - grpo
+  - rlhf
+  - fidc
+  - portuguese
+  - finance
+  - code
+  - reinforcement-learning
+  - peft
+  - qwen
 language:
+  - pt
+  - en
+license: apache-2.0
+---
+# Paganini GRPO LoRA — Qwen3.5-27B
+<p align="center">
+  <img src="https://img.shields.io/badge/Base%20Model-Qwen3.5--27B-blue" />
+  <img src="https://img.shields.io/badge/Method-GRPO%20%2B%20LoRA-purple" />
+  <img src="https://img.shields.io/badge/Language-PT--BR%20%7C%20EN-green" />
+  <img src="https://img.shields.io/badge/Domain-FIDC%20%7C%20Code-orange" />
+  <img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey" />
+</p>
+> **Paganini** is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture.
 ---
+## 🧠 Model Overview
+| Property | Value |
+|---|---|
+| **Base Model** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) |
+| **Parameters** | 27B |
+| **Adapter Type** | LoRA (PEFT) |
+| **Training Method** | GRPO (Group Relative Policy Optimization) |
+| **LoRA Rank** | 32 |
+| **LoRA Alpha** | 32 |
+| **LoRA Targets** | all-linear |
+| **Task** | CAUSAL_LM |
+| **Adapter Size** | 966 MB (safetensors) |
+| **Languages** | Portuguese (Brazil) + English |
+| **Training Platform** | [Tinker API](https://tinkerchat.ai) — Thinking Machines Lab cloud GPUs |
+| **Training Duration** | ~3 hours (23 runs) |
+| **Run ID** | `7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0` |
+---
+## 🏗️ Training Pipeline
+Paganini follows a two-stage alignment pipeline:
 ```
+Qwen3.5-27B (base)
+       │
+       ▼
+  ┌─────────────────────────────────────────┐
+  │  Stage 1: Supervised Fine-Tuning (SFT)  │
+  │  Platform: RunPod A100 80GB             │
+  │  Accuracy: 87.75% | Loss: 0.454         │
+  └─────────────────────────────────────────┘
+       │
+       ▼  sttjr/paganini-qwen35-27b-sft-lora
+       │
+  ┌─────────────────────────────────────────┐
+  │  Stage 2: GRPO RL Alignment (this)      │
+  │  Platform: Tinker API (TML Cloud GPUs)  │
+  │  23 training runs | ~3 hours            │
+  │  Dual-domain reward optimization        │
+  └─────────────────────────────────────────┘
+       │
+       ▼  sttjr/paganini-qwen35-27b-grpo-lora  ← you are here
+```
+### SFT Predecessor
+The GRPO run was initialized from the SFT checkpoint:
+- **SFT Model**: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora)
+- **Platform**: RunPod A100 80GB
+- **Accuracy**: 87.75%
+- **Final Loss**: 0.454
+---
+## 📦 Dataset
+**Name:** `dual-dataset-v2.jsonl`
+| Split | Count |
+|---|---|
+| Total samples | 13,697 |
+| Code domain | 6,848 |
+| Finance domain | 6,849 |
+**Difficulty distribution:**
+| Level | Count |
+|---|---|
+| L1 (Basic) | 4,566 |
+| L2 (Intermediate) | 4,566 |
+| L3 (Advanced) | 4,565 |
+**Sources:**
+- **Finance**: FIDC (Fundo de Investimento em Direitos Creditórios) regulatory corpus under CVM Resolution 175 — covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management
+- **Code**: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development
+---
+## 🎯 Reward Function (Dual-Domain)
+The GRPO training uses a composite reward function:
 ```
+R(x) = λ · R_code + (1 - λ) · R_fin + R_shared
+```
+Where `λ = 1.0` for code samples and `λ = 0.0` for finance samples.
+### R_code — Code Domain Rewards
+| Component | Reward |
+|---|---|
+| Spec adherence | +0.30 |
+| Architecture patterns | +0.25 |
+| Pipeline compliance | +0.15 |
+| Code blocks present | +0.10 |
+| TDD terms present | +0.10 |
+| **Maximum** | **+0.90** |
+### R_finance — Finance Domain Rewards
+| Component | Reward |
+|---|---|
+| Guardrail compliance | +0.35 |
+| Source attribution | +0.20 |
+| CVM citation | +0.15 |
+| Article reference | +0.15 |
+| **Maximum** | **+0.85** |
+### R_shared — Shared Penalty/Bonus
+| Component | Reward |
+|---|---|
+| Hallucination penalty | −0.15 |
+| Corporate speak penalty | −0.05 per occurrence |
+| PT-BR language bonus | +0.05 |
+| Length < 50 tokens penalty | −0.20 |
+---
+## 🤖 Use Case: Paganini AIOS
+This model is the intelligence backbone for **9 specialized FIDC domain agents** in the Paganini AIOS platform:
+| Agent | Role |
+|---|---|
+| 🏛️ Admin | Administrative governance and fund operations |
+| 🏦 Custodian | Asset custody, settlement, and safekeeping |
+| 📊 Manager | Portfolio management and investment decisions |
+| ⚖️ Compliance | Regulatory adherence and audit trails |
+| 📋 Reporting | Investor reporting and fund disclosures |
+| 🔍 Due Diligence | Cedente/debtor analysis and credit assessment |
+| 👁️ RegWatch | Regulatory change monitoring (CVM, BACEN) |
+| 📧 IR | Investor Relations communication |
+| 💹 Pricing | Asset pricing and NAV calculation |
+### 6-Gate Guardrail Pipeline
+Each query passes through a sequential compliance chain:
+```
+Input → [Eligibility] → [Concentration] → [Covenant] → [PLD/AML] → [Compliance] → [Risk] → Output
+```
+All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types.
+---
+## 🚀 Usage
+### Installation
+```bash
+pip install transformers peft accelerate
+```
+### Load and Run
 ```python
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load base model
+base = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3.5-27B",
+    device_map="auto",
+    torch_dtype="auto"
+)
+# Load GRPO LoRA adapter
 model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
+# Finance domain example (PT-BR)
+prompt = "Explique os requisitos de PDD mínima para FIDC conforme CVM 175."
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+### Merge Adapter (Optional)
+```python
+# Merge LoRA weights into base model for faster inference
+merged_model = model.merge_and_unload()
+merged_model.save_pretrained("paganini-27b-merged")
+tokenizer.save_pretrained("paganini-27b-merged")
+```
+---
+## 📊 Checkpoints
+| Checkpoint | Size | Description |
+|---|---|---|
+| `paganini-test` | 2.7 GB | Intermediate checkpoint |
+| `paganini-rl-final` | 2.7 GB | Final GRPO-aligned checkpoint |
+---
+## ⚠️ Intended Use & Limitations
+### Intended Use
+- FIDC regulatory Q&A in Portuguese (Brazil)
+- Software architecture guidance for AIOS agents
+- Compliance-first financial analysis aligned with CVM 175
+- Internal enterprise use within the Paganini AIOS platform
+### Out-of-Scope Use
+- General-purpose chatbot (use base Qwen3.5-27B instead)
+- Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks)
+- Real-time trading decisions or autonomous financial transactions
+### Limitations
+- Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff
+- PT-BR outputs are prioritized; EN responses may be less fluent
+- Requires at least 2× A100 80GB GPUs or equivalent for full-precision inference
+- LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16)
+---
+## 🔗 Project Links
+| Resource | Link |
+|---|---|
+| 🐙 GitHub (Paganini AIOS) | [juboyy/paganini-aios](https://github.com/juboyy/paganini-aios) |
+| 📊 Dashboard | [dashboard-v2-pearl-rho.vercel.app](https://dashboard-v2-pearl-rho.vercel.app) |
+| 🤗 SFT Predecessor | [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) |
+---
+## 📄 Citation
+```bibtex
+@misc{paganini-grpo-lora-2026,
+  title        = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence},
+  author       = {sttjr},
+  year         = {2026},
+  publisher    = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}},
+  note         = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture}
+}
 ```
+---
+## 📜 License
+Apache 2.0 — See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.
+---
+*Paganini AIOS — Built for the Brazilian FIDC ecosystem.*