--- library_name: peft base_model: Qwen/Qwen3.5-27B tags: - lora - grpo - rlhf - fidc - portuguese - finance - code - reinforcement-learning - peft - qwen language: - pt - en license: apache-2.0 --- # Paganini GRPO LoRA — Qwen3.5-27B

> **Paganini** is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture. --- ## 🧠 Model Overview | Property | Value | |---|---| | **Base Model** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) | | **Parameters** | 27B | | **Adapter Type** | LoRA (PEFT) | | **Training Method** | GRPO (Group Relative Policy Optimization) | | **LoRA Rank** | 32 | | **LoRA Alpha** | 32 | | **LoRA Targets** | all-linear | | **Task** | CAUSAL_LM | | **Adapter Size** | 966 MB (safetensors) | | **Languages** | Portuguese (Brazil) + English | | **Training Platform** | [Tinker API](https://tinkerchat.ai) — Thinking Machines Lab cloud GPUs | | **Training Duration** | ~3 hours (23 runs) | | **Run ID** | `7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0` | --- ## 🏗️ Training Pipeline Paganini follows a two-stage alignment pipeline: ``` Qwen3.5-27B (base) │ ▼ ┌─────────────────────────────────────────┐ │ Stage 1: Supervised Fine-Tuning (SFT) │ │ Platform: RunPod A100 80GB │ │ Accuracy: 87.75% | Loss: 0.454 │ └─────────────────────────────────────────┘ │ ▼ sttjr/paganini-qwen35-27b-sft-lora │ ┌─────────────────────────────────────────┐ │ Stage 2: GRPO RL Alignment (this) │ │ Platform: Tinker API (TML Cloud GPUs) │ │ 23 training runs | ~3 hours │ │ Dual-domain reward optimization │ └─────────────────────────────────────────┘ │ ▼ sttjr/paganini-qwen35-27b-grpo-lora ← you are here ``` ### SFT Predecessor The GRPO run was initialized from the SFT checkpoint: - **SFT Model**: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) - **Platform**: RunPod A100 80GB - **Accuracy**: 87.75% - **Final Loss**: 0.454 --- ## 📦 Dataset **Name:** `dual-dataset-v2.jsonl` | Split | Count | |---|---| | Total samples | 13,697 | | Code domain | 6,848 | | Finance domain | 6,849 | **Difficulty distribution:** | Level | Count | |---|---| | L1 (Basic) | 4,566 | | L2 (Intermediate) | 4,566 | | L3 (Advanced) | 4,565 | **Sources:** - **Finance**: FIDC (Fundo de Investimento em Direitos Creditórios) regulatory corpus under CVM Resolution 175 — covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management - **Code**: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development --- ## 🎯 Reward Function (Dual-Domain) The GRPO training uses a composite reward function: ``` R(x) = λ · R_code + (1 - λ) · R_fin + R_shared ``` Where `λ = 1.0` for code samples and `λ = 0.0` for finance samples. ### R_code — Code Domain Rewards | Component | Reward | |---|---| | Spec adherence | +0.30 | | Architecture patterns | +0.25 | | Pipeline compliance | +0.15 | | Code blocks present | +0.10 | | TDD terms present | +0.10 | | **Maximum** | **+0.90** | ### R_finance — Finance Domain Rewards | Component | Reward | |---|---| | Guardrail compliance | +0.35 | | Source attribution | +0.20 | | CVM citation | +0.15 | | Article reference | +0.15 | | **Maximum** | **+0.85** | ### R_shared — Shared Penalty/Bonus | Component | Reward | |---|---| | Hallucination penalty | −0.15 | | Corporate speak penalty | −0.05 per occurrence | | PT-BR language bonus | +0.05 | | Length < 50 tokens penalty | −0.20 | --- ## 🤖 Use Case: Paganini AIOS This model is the intelligence backbone for **9 specialized FIDC domain agents** in the Paganini AIOS platform: | Agent | Role | |---|---| | 🏛️ Admin | Administrative governance and fund operations | | 🏦 Custodian | Asset custody, settlement, and safekeeping | | 📊 Manager | Portfolio management and investment decisions | | ⚖️ Compliance | Regulatory adherence and audit trails | | 📋 Reporting | Investor reporting and fund disclosures | | 🔍 Due Diligence | Cedente/debtor analysis and credit assessment | | 👁️ RegWatch | Regulatory change monitoring (CVM, BACEN) | | 📧 IR | Investor Relations communication | | 💹 Pricing | Asset pricing and NAV calculation | ### 6-Gate Guardrail Pipeline Each query passes through a sequential compliance chain: ``` Input → [Eligibility] → [Concentration] → [Covenant] → [PLD/AML] → [Compliance] → [Risk] → Output ``` All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types. --- ## 🚀 Usage ### Installation ```bash pip install transformers peft accelerate ``` ### Load and Run ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model base = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3.5-27B", device_map="auto", torch_dtype="auto" ) # Load GRPO LoRA adapter model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B") # Finance domain example (PT-BR) prompt = "Explique os requisitos de PDD mínima para FIDC conforme CVM 175." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ### Merge Adapter (Optional) ```python # Merge LoRA weights into base model for faster inference merged_model = model.merge_and_unload() merged_model.save_pretrained("paganini-27b-merged") tokenizer.save_pretrained("paganini-27b-merged") ``` --- ## 📊 Checkpoints | Checkpoint | Size | Description | |---|---|---| | `paganini-test` | 2.7 GB | Intermediate checkpoint | | `paganini-rl-final` | 2.7 GB | Final GRPO-aligned checkpoint | --- ## ⚠️ Intended Use & Limitations ### Intended Use - FIDC regulatory Q&A in Portuguese (Brazil) - Software architecture guidance for AIOS agents - Compliance-first financial analysis aligned with CVM 175 - Internal enterprise use within the Paganini AIOS platform ### Out-of-Scope Use - General-purpose chatbot (use base Qwen3.5-27B instead) - Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks) - Real-time trading decisions or autonomous financial transactions ### Limitations - Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff - PT-BR outputs are prioritized; EN responses may be less fluent - Requires at least 2× A100 80GB GPUs or equivalent for full-precision inference - LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16) --- ## 🔗 Project Links | Resource | Link | |---|---| | 🐙 GitHub (Paganini AIOS) | [juboyy/paganini-aios](https://github.com/juboyy/paganini-aios) | | 📊 Dashboard | [dashboard-v2-pearl-rho.vercel.app](https://dashboard-v2-pearl-rho.vercel.app) | | 🤗 SFT Predecessor | [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) | --- ## 📄 Citation ```bibtex @misc{paganini-grpo-lora-2026, title = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence}, author = {sttjr}, year = {2026}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}}, note = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture} } ``` --- ## 📜 License Apache 2.0 — See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details. --- *Paganini AIOS — Built for the Brazilian FIDC ecosystem.*