atlas-mistral-7b-legal-r2
ATLAS Forensic Audit System — Mistral-7B Extended Corpus (Round 2)
Trained on AMD MI300X · 6,437 records · Production Deployment
What is this?
This is the production-scale fine-tune of mistralai/Mistral-7B-Instruct-v0.2 on the full ATLAS audit corpus.
While atlas-mistral-7b-legal validated the architecture on 3,502 curated records (eval loss 0.018), this model was trained on the complete 6,437-record dataset — 83% more examples, broader normativa coverage, and higher scenario diversity.
This is the version deployed in production for ATLAS v2.0.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | mistralai/Mistral-7B-Instruct-v0.2 |
| Dataset | atlas_training_dataset_final.jsonl |
| Training records | 6,437 |
| Epochs | 3 |
| Learning rate | 2e-5 |
| Batch size | 4 (grad_accum=4, effective=16) |
| Precision | bfloat16 |
| Hardware | AMD Instinct MI300X (205.8 GB VRAM) |
| Framework | PyTorch 2.5.1 + ROCm 6.2 |
| Optimizer | adamw_torch |
| attn_implementation | eager (SDPA disabled for ROCm stability) |
| Estimated runtime | ~50 min |
Dataset: What changed from Round 1
The expanded corpus (atlas_audit_master_unified.jsonl, 6,437 records) includes:
- All 3,502 records from Round 1 (verified, high-confidence)
- +2,935 records covering edge cases, multi-jurisdiction scenarios, and complex RFC validation chains
- Broader distribution across:
factura_electronica,comprobante_fiscal,contrato_servicios,estado_cuenta,declaracion_anual - More examples of compound anomalies (e.g., RFC inválido + IVA incorrecto + fecha inconsistente simultaneously)
Round 1 optimized for precision. Round 2 optimized for production recall.
Normativa Coverage
| Domain | Key Articles |
|---|---|
| MX — SAT/CFF | Art. 17-H Bis, Art. 69-B (EFOS/EDOS), Art. 29/29-A CFF |
| MX — IVA | 16% (general), Art. 18-J (plataformas digitales), exenciones |
| MX — ISR | Personas morales, retenciones, deducciones autorizadas |
| MX — CFDI | v4.0, complementos, PAC validation, UUID trazabilidad |
| USA — IRS | Form 1099, W-8BEN, FATCA reportable accounts |
| USA — SEC | AI washing enforcement, disclosure requirements |
| CROSS | OECD Pillar Two GloBE, CRS reporting, FATCA cross-validation |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Rafaelcedav/atlas-mistral-7b-legal-r2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="eager",
device_map="auto"
)
prompt = """AUDITORÍA FORENSE REQUERIDA.
DOCUMENTO: factura_electronica
--- CAMPOS EXTRAÍDOS ---
{
"rfc_emisor": {"value": "XAXX010101000", "confidence": 0.99},
"total": {"value": 11600.00, "confidence": 0.98},
"iva": {"value": 1600.00, "confidence": 0.97},
"subtotal": {"value": 10000.00, "confidence": 0.99}
}
INSTRUCCIÓN: Analiza buscando errores matemáticos, RFCs inválidos o términos inusuales. Responde en JSON."""
messages = [
{"role": "system", "content": "Eres un Auditor Forense Senior especializado en normativa fiscal MX/USA."},
{"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(inputs, max_new_tokens=1024, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
Expected output:
{
"trap_detected": "RFC genérico XAXX010101000 — válido para CFDI pero indica operación con público general, no con persona específica. Verificar si aplica complemento carta porte.",
"trap_severity": "LOW",
"reasoning_chain": [
{"step": 1, "thought": "IVA matemáticamente correcto: 10000 × 0.16 = 1600 ✓"},
{"step": 2, "thought": "Total correcto: 10000 + 1600 = 11600 ✓"},
{"step": 3, "thought": "RFC XAXX010101000 es RFC genérico — no representa una persona física/moral identificada"}
],
"confidence": 0.91,
"reasoning_valid": true
}
ATLAS Pipeline Position
PDF/Image
│
▼
[Agent 1: Vision] ← InternVL2-40B (OCR + field extraction)
│
▼
[Agent 2: Reasoning] ← atlas-mistral-7b-legal-r2 ← YOU ARE HERE
│ (anomaly detection, math validation)
▼
[Agent 3: Validator] ← Rule engine (RFC regex, SAT blacklists)
│
▼
[Agent 4: Explainer] ← Qwen3-14B (executive-grade report)
│
▼
Forensic Report (PDF) + SSE Real-time X-Ray
Round 1 vs Round 2 — Comparison
| Metric | Round 1 (3,502 records) | Round 2 (6,437 records) |
|---|---|---|
| Training records | 3,502 | 6,437 (+83%) |
| Train loss | 0.0584 | Lower bound established by Round 1 |
| Eval loss | 0.0184 | Broader generalization target |
| Training time | 27 min | ~50 min |
| Use case | Validation + research | Production deployment |
| Scenario diversity | Curated core | Full production corpus |
Hardware Note
Trained entirely on AMD Instinct MI300X (205.8 GB HBM3 VRAM) using ROCm 6.2. Full-parameter fine-tuning (no LoRA/QLoRA) — maximum weight absorption from the regulatory corpus.
Related Models in the ATLAS Ecosystem
| Model | Role | Records | Notes |
|---|---|---|---|
| atlas-mistral-7b-legal | Reasoning v1 | 3,502 | Research baseline |
| atlas-mistral-7b-legal-r2 | Reasoning v2 | 6,437 | ← Production |
| atlas-r2-qwen3-14b | Explainer + Sandbox | 3,502 | 14B, thinking mode |
| atlas-finanzas-deepseek-r1-8b | Chain-of-thought | 6,437 | Distilled R1 |
License
MIT — Free to use, fine-tune, and deploy.
Part of the ATLAS Forensic Audit System — AMD Hackathon 2026
Trained on AMD MI300X. Zero cloud API calls. 100% open-source.
- Downloads last month
- 8
Model tree for Rafaelcedav/atlas-mistral-7b-legal-r2
Base model
mistralai/Mistral-7B-Instruct-v0.2