--- library_name: peft base_model: Qwen/Qwen3.5-27B tags: - lora - grpo - rlhf - fidc - portuguese - finance - code - reinforcement-learning - peft - qwen language: - pt - en license: apache-2.0 --- # Paganini GRPO LoRA β€” Qwen3.5-27B

> **Paganini** is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture. --- ## 🧠 Model Overview | Property | Value | |---|---| | **Base Model** | [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) | | **Parameters** | 27B | | **Adapter Type** | LoRA (PEFT) | | **Training Method** | GRPO (Group Relative Policy Optimization) | | **LoRA Rank** | 32 | | **LoRA Alpha** | 32 | | **LoRA Targets** | all-linear | | **Task** | CAUSAL_LM | | **Adapter Size** | 966 MB (safetensors) | | **Languages** | Portuguese (Brazil) + English | | **Training Platform** | [Tinker API](https://tinkerchat.ai) β€” Thinking Machines Lab cloud GPUs | | **Training Duration** | ~3 hours (23 runs) | | **Run ID** | `7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0` | --- ## πŸ—οΈ Training Pipeline Paganini follows a two-stage alignment pipeline: ``` Qwen3.5-27B (base) β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stage 1: Supervised Fine-Tuning (SFT) β”‚ β”‚ Platform: RunPod A100 80GB β”‚ β”‚ Accuracy: 87.75% | Loss: 0.454 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό sttjr/paganini-qwen35-27b-sft-lora β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stage 2: GRPO RL Alignment (this) β”‚ β”‚ Platform: Tinker API (TML Cloud GPUs) β”‚ β”‚ 23 training runs | ~3 hours β”‚ β”‚ Dual-domain reward optimization β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό sttjr/paganini-qwen35-27b-grpo-lora ← you are here ``` ### SFT Predecessor The GRPO run was initialized from the SFT checkpoint: - **SFT Model**: [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) - **Platform**: RunPod A100 80GB - **Accuracy**: 87.75% - **Final Loss**: 0.454 --- ## πŸ“¦ Dataset **Name:** `dual-dataset-v2.jsonl` | Split | Count | |---|---| | Total samples | 13,697 | | Code domain | 6,848 | | Finance domain | 6,849 | **Difficulty distribution:** | Level | Count | |---|---| | L1 (Basic) | 4,566 | | L2 (Intermediate) | 4,566 | | L3 (Advanced) | 4,565 | **Sources:** - **Finance**: FIDC (Fundo de Investimento em Direitos CreditΓ³rios) regulatory corpus under CVM Resolution 175 β€” covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management - **Code**: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development --- ## 🎯 Reward Function (Dual-Domain) The GRPO training uses a composite reward function: ``` R(x) = Ξ» Β· R_code + (1 - Ξ») Β· R_fin + R_shared ``` Where `Ξ» = 1.0` for code samples and `Ξ» = 0.0` for finance samples. ### R_code β€” Code Domain Rewards | Component | Reward | |---|---| | Spec adherence | +0.30 | | Architecture patterns | +0.25 | | Pipeline compliance | +0.15 | | Code blocks present | +0.10 | | TDD terms present | +0.10 | | **Maximum** | **+0.90** | ### R_finance β€” Finance Domain Rewards | Component | Reward | |---|---| | Guardrail compliance | +0.35 | | Source attribution | +0.20 | | CVM citation | +0.15 | | Article reference | +0.15 | | **Maximum** | **+0.85** | ### R_shared β€” Shared Penalty/Bonus | Component | Reward | |---|---| | Hallucination penalty | βˆ’0.15 | | Corporate speak penalty | βˆ’0.05 per occurrence | | PT-BR language bonus | +0.05 | | Length < 50 tokens penalty | βˆ’0.20 | --- ## πŸ€– Use Case: Paganini AIOS This model is the intelligence backbone for **9 specialized FIDC domain agents** in the Paganini AIOS platform: | Agent | Role | |---|---| | πŸ›οΈ Admin | Administrative governance and fund operations | | 🏦 Custodian | Asset custody, settlement, and safekeeping | | πŸ“Š Manager | Portfolio management and investment decisions | | βš–οΈ Compliance | Regulatory adherence and audit trails | | πŸ“‹ Reporting | Investor reporting and fund disclosures | | πŸ” Due Diligence | Cedente/debtor analysis and credit assessment | | πŸ‘οΈ RegWatch | Regulatory change monitoring (CVM, BACEN) | | πŸ“§ IR | Investor Relations communication | | πŸ’Ή Pricing | Asset pricing and NAV calculation | ### 6-Gate Guardrail Pipeline Each query passes through a sequential compliance chain: ``` Input β†’ [Eligibility] β†’ [Concentration] β†’ [Covenant] β†’ [PLD/AML] β†’ [Compliance] β†’ [Risk] β†’ Output ``` All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types. --- ## πŸš€ Usage ### Installation ```bash pip install transformers peft accelerate ``` ### Load and Run ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model base = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3.5-27B", device_map="auto", torch_dtype="auto" ) # Load GRPO LoRA adapter model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B") # Finance domain example (PT-BR) prompt = "Explique os requisitos de PDD mΓ­nima para FIDC conforme CVM 175." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ### Merge Adapter (Optional) ```python # Merge LoRA weights into base model for faster inference merged_model = model.merge_and_unload() merged_model.save_pretrained("paganini-27b-merged") tokenizer.save_pretrained("paganini-27b-merged") ``` --- ## πŸ“Š Checkpoints | Checkpoint | Size | Description | |---|---|---| | `paganini-test` | 2.7 GB | Intermediate checkpoint | | `paganini-rl-final` | 2.7 GB | Final GRPO-aligned checkpoint | --- ## ⚠️ Intended Use & Limitations ### Intended Use - FIDC regulatory Q&A in Portuguese (Brazil) - Software architecture guidance for AIOS agents - Compliance-first financial analysis aligned with CVM 175 - Internal enterprise use within the Paganini AIOS platform ### Out-of-Scope Use - General-purpose chatbot (use base Qwen3.5-27B instead) - Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks) - Real-time trading decisions or autonomous financial transactions ### Limitations - Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff - PT-BR outputs are prioritized; EN responses may be less fluent - Requires at least 2Γ— A100 80GB GPUs or equivalent for full-precision inference - LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16) --- ## πŸ”— Project Links | Resource | Link | |---|---| | πŸ™ GitHub (Paganini AIOS) | [juboyy/paganini-aios](https://github.com/juboyy/paganini-aios) | | πŸ“Š Dashboard | [dashboard-v2-pearl-rho.vercel.app](https://dashboard-v2-pearl-rho.vercel.app) | | πŸ€— SFT Predecessor | [sttjr/paganini-qwen35-27b-sft-lora](https://huggingface.co/sttjr/paganini-qwen35-27b-sft-lora) | --- ## πŸ“„ Citation ```bibtex @misc{paganini-grpo-lora-2026, title = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence}, author = {sttjr}, year = {2026}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}}, note = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture} } ``` --- ## πŸ“œ License Apache 2.0 β€” See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details. --- *Paganini AIOS β€” Built for the Brazilian FIDC ecosystem.*