Instructions to use sttjr/paganini-qwen35-27b-grpo-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sttjr/paganini-qwen35-27b-grpo-lora with PEFT:
Base model is not found.
- Notebooks
- Google Colab
- Kaggle
Configuration Parsing Warning:In adapter_config.json: "peft.base_model_name_or_path" must be a string
Paganini GRPO LoRA β Qwen3.5-27B
Paganini is a dual-domain LoRA adapter trained via GRPO (Group Relative Policy Optimization) on top of Qwen3.5-27B. It serves as the intelligence backbone for 9 specialized FIDC agents in the Paganini AIOS platform, with deep expertise in Brazilian investment fund regulation (CVM 175) and software architecture.
π§ Model Overview
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Adapter Type | LoRA (PEFT) |
| Training Method | GRPO (Group Relative Policy Optimization) |
| LoRA Rank | 32 |
| LoRA Alpha | 32 |
| LoRA Targets | all-linear |
| Task | CAUSAL_LM |
| Adapter Size | 966 MB (safetensors) |
| Languages | Portuguese (Brazil) + English |
| Training Platform | Tinker API β Thinking Machines Lab cloud GPUs |
| Training Duration | ~3 hours (23 runs) |
| Run ID | 7e18a5a1-8a6b-530d-b443-4f855a3aa8c4:train:0 |
ποΈ Training Pipeline
Paganini follows a two-stage alignment pipeline:
Qwen3.5-27B (base)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Stage 1: Supervised Fine-Tuning (SFT) β
β Platform: RunPod A100 80GB β
β Accuracy: 87.75% | Loss: 0.454 β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ sttjr/paganini-qwen35-27b-sft-lora
β
βββββββββββββββββββββββββββββββββββββββββββ
β Stage 2: GRPO RL Alignment (this) β
β Platform: Tinker API (TML Cloud GPUs) β
β 23 training runs | ~3 hours β
β Dual-domain reward optimization β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ sttjr/paganini-qwen35-27b-grpo-lora β you are here
SFT Predecessor
The GRPO run was initialized from the SFT checkpoint:
- SFT Model: sttjr/paganini-qwen35-27b-sft-lora
- Platform: RunPod A100 80GB
- Accuracy: 87.75%
- Final Loss: 0.454
π¦ Dataset
Name: dual-dataset-v2.jsonl
| Split | Count |
|---|---|
| Total samples | 13,697 |
| Code domain | 6,848 |
| Finance domain | 6,849 |
Difficulty distribution:
| Level | Count |
|---|---|
| L1 (Basic) | 4,566 |
| L2 (Intermediate) | 4,566 |
| L3 (Advanced) | 4,565 |
Sources:
- Finance: FIDC (Fundo de Investimento em Direitos CreditΓ³rios) regulatory corpus under CVM Resolution 175 β covering eligibility, concentration limits, covenants, PLD/AML procedures, compliance gates, and risk management
- Code: Software architecture patterns, pipeline compliance, TDD practices, and spec adherence for AIOS agent development
π― Reward Function (Dual-Domain)
The GRPO training uses a composite reward function:
R(x) = Ξ» Β· R_code + (1 - Ξ») Β· R_fin + R_shared
Where Ξ» = 1.0 for code samples and Ξ» = 0.0 for finance samples.
R_code β Code Domain Rewards
| Component | Reward |
|---|---|
| Spec adherence | +0.30 |
| Architecture patterns | +0.25 |
| Pipeline compliance | +0.15 |
| Code blocks present | +0.10 |
| TDD terms present | +0.10 |
| Maximum | +0.90 |
R_finance β Finance Domain Rewards
| Component | Reward |
|---|---|
| Guardrail compliance | +0.35 |
| Source attribution | +0.20 |
| CVM citation | +0.15 |
| Article reference | +0.15 |
| Maximum | +0.85 |
R_shared β Shared Penalty/Bonus
| Component | Reward |
|---|---|
| Hallucination penalty | β0.15 |
| Corporate speak penalty | β0.05 per occurrence |
| PT-BR language bonus | +0.05 |
| Length < 50 tokens penalty | β0.20 |
π€ Use Case: Paganini AIOS
This model is the intelligence backbone for 9 specialized FIDC domain agents in the Paganini AIOS platform:
| Agent | Role |
|---|---|
| ποΈ Admin | Administrative governance and fund operations |
| π¦ Custodian | Asset custody, settlement, and safekeeping |
| π Manager | Portfolio management and investment decisions |
| βοΈ Compliance | Regulatory adherence and audit trails |
| π Reporting | Investor reporting and fund disclosures |
| π Due Diligence | Cedente/debtor analysis and credit assessment |
| ποΈ RegWatch | Regulatory change monitoring (CVM, BACEN) |
| π§ IR | Investor Relations communication |
| πΉ Pricing | Asset pricing and NAV calculation |
6-Gate Guardrail Pipeline
Each query passes through a sequential compliance chain:
Input β [Eligibility] β [Concentration] β [Covenant] β [PLD/AML] β [Compliance] β [Risk] β Output
All 6 gates must pass before a response is delivered to end users. This ensures CVM 175-compliant, hallucination-free outputs across all agent types.
π Usage
Installation
pip install transformers peft accelerate
Load and Run
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-27B",
device_map="auto",
torch_dtype="auto"
)
# Load GRPO LoRA adapter
model = PeftModel.from_pretrained(base, "sttjr/paganini-qwen35-27b-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
# Finance domain example (PT-BR)
prompt = "Explique os requisitos de PDD mΓnima para FIDC conforme CVM 175."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Merge Adapter (Optional)
# Merge LoRA weights into base model for faster inference
merged_model = model.merge_and_unload()
merged_model.save_pretrained("paganini-27b-merged")
tokenizer.save_pretrained("paganini-27b-merged")
π Checkpoints
| Checkpoint | Size | Description |
|---|---|---|
paganini-test |
2.7 GB | Intermediate checkpoint |
paganini-rl-final |
2.7 GB | Final GRPO-aligned checkpoint |
β οΈ Intended Use & Limitations
Intended Use
- FIDC regulatory Q&A in Portuguese (Brazil)
- Software architecture guidance for AIOS agents
- Compliance-first financial analysis aligned with CVM 175
- Internal enterprise use within the Paganini AIOS platform
Out-of-Scope Use
- General-purpose chatbot (use base Qwen3.5-27B instead)
- Non-Brazilian regulatory domains (model is specialized for CVM/BACEN frameworks)
- Real-time trading decisions or autonomous financial transactions
Limitations
- Finance knowledge is bounded by CVM 175 regulatory corpus at training cutoff
- PT-BR outputs are prioritized; EN responses may be less fluent
- Requires at least 2Γ A100 80GB GPUs or equivalent for full-precision inference
- LoRA adapter requires the base Qwen3.5-27B model (~54 GB in fp16)
π Project Links
| Resource | Link |
|---|---|
| π GitHub (Paganini AIOS) | juboyy/paganini-aios |
| π Dashboard | dashboard-v2-pearl-rho.vercel.app |
| π€ SFT Predecessor | sttjr/paganini-qwen35-27b-sft-lora |
π Citation
@misc{paganini-grpo-lora-2026,
title = {Paganini GRPO LoRA -- Qwen3.5-27B: Dual-Domain RL Alignment for FIDC Regulatory Intelligence},
author = {sttjr},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/sttjr/paganini-qwen35-27b-grpo-lora}},
note = {GRPO-aligned LoRA adapter for Brazilian investment fund regulation and software architecture}
}
π License
Apache 2.0 β See LICENSE for details.
Paganini AIOS β Built for the Brazilian FIDC ecosystem.
- Downloads last month
- -
Model tree for sttjr/paganini-qwen35-27b-grpo-lora
Base model
Qwen/Qwen3.5-27B