--- license: mit language: - en - zh pipeline_tag: text-generation library_name: transformers base_model: - zai-org/GLM-5.1 tags: - macaron - personal-agent - tool-use - mixture-of-lora - generative-ui - a2ui - glm --- # Macaron-V1-Preview-749B Macaron-V1-Preview-749B is a 749B-class Mixture-of-LoRA personal-agent model from MindLab Research, post-trained from GLM-5.1 with MinT. It combines a 744B base model with five specialist LoRA adapters and a router-driven serving design for multi-turn personal-life assistance, tool-grounded planning, coding and terminal workflows, and protocol-grounded Generative UI. Release blog: https://macaron.im/mindlab/research/macaron-v1-preview ## Highlights - 749B-class Mixture-of-LoRA preview model: 744B base + 5 specialist LoRAs. - Built for personal-agent tasks where user intent, private state, tools, and world state change across turns. - Uses an explicit router-tool design: the default adapter can route to specialist LoRAs through `change_model`. - Covers personal planning, search/calendar/tool workflows, coding and terminal tasks, computer-agent workflows, and A2UI Generative UI. - Ships as a single Hugging Face repository: base model files at root, LoRA adapters in `l0/` through `l4/`. ## Model Overview | Field | Value | |---|---| | Model name | Macaron-V1-Preview-749B | | Organization | MindLab Research | | Base model | GLM-5.1 | | Architecture | Mixture-of-LoRA | | Parameter footprint | 749B-class: 744B base + 5 x ~1B LoRA | | Post-training system | MinT | | Primary domain | Personal agents, tool-use agents, Generative UI | | Release type | Preview | | Checkpoint format | Single HF repo: base checkpoint at root; LoRAs under `l0/`-`l4/` | | Context length | 202,752 tokens, from `config.json` / `tokenizer_config.json` | | Precision | bfloat16, from `config.json` | | License | MIT; see [License](#license) | ## Repository Layout The release is intentionally kept in one Hugging Face model repository: ```text . |-- config.json |-- generation_config.json |-- model.safetensors.index.json |-- model-00001-of-00282.safetensors |-- ... |-- model-00282-of-00282.safetensors |-- tokenizer.json |-- tokenizer_config.json |-- l0/ | |-- adapter_config.json | `-- adapter_model.safetensors |-- l1/ |-- l2/ |-- l3/ `-- l4/ ``` Adapter roles: | Adapter | Role | |---|---| | `l0` | Default chat, general-purpose behavior, and routing entry point | | `l1` | Personal-agent tasks such as calendar, planning, search, and life automation | | `l2` | Coding, terminal, repository, and shell tasks | | `l3` | A2UI and Generative UI | | `l4` | Computer-agent / OpenClaw-style workflows | ## What Macaron Is For A useful personal agent has to work where the user actually lives. Daily life is full of small contingent decisions: what to eat tonight, where to find a quiet table, how to reroute when traffic changes, how to schedule an errand around family obligations, or how to choose the right UI surface for a task. These tasks become hard because the user, tools, and environment all change while the agent is working. Macaron-V1-Preview-749B targets three linked abilities: - **Capability**: using real tools such as search, maps, restaurants, calendars, coding environments, and task APIs. - **Coherence**: tracking a real human across turns, preferences, constraints, and changing intent. - **Expression**: choosing the right surface, such as text, card, form, table, slider, or dashboard, and rendering it quickly enough to remain useful. ## Architecture ### Mixture-of-LoRA Macaron-V1-Preview-749B keeps divergent skill families in separate LoRAs over a shared base model. This is intended to reduce interference between chat, personal-agent tool use, coding, computer-agent behavior, and Generative UI, while still allowing the system to add new specialist domains without modifying the base model or existing specialists. ### Router Tool Macaron exposes model selection as a tool call rather than as an opaque separate router model. The default adapter is `l0`. When a specialist is needed, the serving harness can route through an OpenAI-compatible tool call such as: ```json { "name": "change_model", "arguments": { "target_model": "l1" } } ``` The route is visible in traces and compatible with a standard tool-calling serving loop. A complete deployment should define the adapter registry, routing policy, confirmation policy, and how the system returns to the default adapter after a specialist turn. ### Harness Co-Design Macaron-V1-Preview-749B is a model-and-harness release. The model was trained and evaluated with a production-style agent harness that manages LoRA routing, tool calls, memory/state exposure, system prompts, and task metadata. Deployments that remove or replace that harness should expect behavior and benchmark results to change. ## Generative UI and A2UI Generative UI is a core Macaron capability. For many personal-agent tasks, the best answer is not only text: it may be a comparison card, editable task summary, booking form, route choice, slider, or dashboard. Macaron-V1-Preview-749B is trained and evaluated with A2UI-style protocol actions. A2UI-Bench scores Generative UI along three layers: - **Protocol correctness**: emitted actions are well formed and faithful to protocol semantics. - **Task construction correctness**: the generated UI answers the user's request. - **User-experience lift**: the UI makes the task easier than a text-only answer. The evaluation also includes rendered visual checks for failures that text-only judges can miss, such as overflow, broken layouts, hidden controls, and spacing issues. ## Evaluation The headline benchmark suite focuses on personal-agent behavior, daily-life task surfaces, Generative UI, and OpenClaw-style workflows. ![Macaron-V1-Preview-749B benchmark bar chart](assets/macaron_benchmark_bar_chart.png) ![Macaron-V1-Preview-749B benchmark radar chart](assets/macaron_benchmark_radar_chart.png) ![Macaron-V1-Preview-749B benchmark table](assets/macaron_benchmark_table.png) Higher is better for all scores shown in the figures. ### Evaluation Protocols **Macaron LivingBench.** Models are evaluated on 30 multi-turn personal-agent cases with a 10-turn budget. The tested agent may make up to three tool-use decisions per user turn. API calls use a 240-second timeout and up to three request-level retries. The reported mean case score is `0.7 x need score + 0.3 x process score`. **A2UI-Bench.** Macaron-V1-Preview-749B is evaluated without explicit schema hints. Scores include protocol correctness, task construction correctness, and rendered UI quality. **VitaBench.** VitaBench is used to stress realistic daily-life workflows. Since the original official judge model is no longer available, GLM-5.1 is used as both the judge and user model. Each query is run three times and the reported value is the average score. **PinchBench.** PinchBench is used for search-grounded, high-precision personal-agent tasks. The reported setup uses Claude Haiku 4.5 as the judge model and Perplexity as the search API, and reports the best observed score. **Tau3 Bench.** The reported setup uses GPT-5.2 with `reasoning_effort=low` as the user simulator and reports pass@1. **SWE-Bench Verified.** The reported setup allows up to three retries only when an evaluation error occurs and reports the best successful attempt. The overall evaluation-error rate is approximately 0.8%. **Terminal-Bench 2.0.** The reported setup uses the Harbor framework to run Macaron with the Pi Coding Agent Harness in sandboxed environments, with a maximum timeout of 4 hours, and reports pass@1. **AIME 2026.** The reported score is included as a general-capability reference; the preview release is optimized primarily for personal-agent behavior and Generative UI rather than for maximizing this benchmark. ## Intended Use Macaron-V1-Preview-749B is intended for: - personal assistant research - multi-turn tool-use agents - daily-life planning and automation - coding and terminal-agent research - Generative UI / A2UI research - agent benchmark evaluation - research on modular post-training and LoRA specialization ## Out-of-Scope Use Macaron-V1-Preview-749B is not intended for: - autonomous high-stakes decisions without human confirmation - medical, legal, financial, or safety-critical advice as a sole authority - covert surveillance or privacy-invasive automation - fully unsupervised payments, bookings, messages, calendar changes, or other external write actions - production deployment without task-specific safety testing, audit logs, and confirmation flows ## Installation and Loading The repository contains both the base checkpoint and LoRA adapters, but full Macaron behavior depends on the router-aware serving harness. Loading a single LoRA is useful for inspection and specialist experiments; it is not equivalent to the full routed personal-agent system. Install dependencies: ```bash pip install -U transformers accelerate peft safetensors ``` Example: load the base checkpoint and attach one specialist LoRA: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel repo_id = "mindlab-research/Macaron-V1-Preview-749B" adapter = "l1" tokenizer = AutoTokenizer.from_pretrained( repo_id, trust_remote_code=True, ) base_model = AutoModelForCausalLM.from_pretrained( repo_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) model = PeftModel.from_pretrained( base_model, repo_id, subfolder=adapter, ) model.eval() ``` For full routed serving, use a harness that: - registers all five LoRA specialists - starts each conversation from `l0` - exposes `change_model` as a tool call - routes to specialists according to the adapter registry - returns control to `l0` after specialist turns - enforces confirmation for external write actions ## Tool Use Macaron-V1-Preview-749B is designed to operate with external tools. Personal-agent deployments may include: - search - calendar - route planning - restaurant/place lookup - booking - messaging - task-specific APIs - A2UI rendering actions - coding, shell, and repository tools The model should request explicit user confirmation before external write actions such as booking, sending messages, changing calendars, or making purchases. ## Safety, Privacy, and Limitations Macaron-V1-Preview-749B is designed for personal-agent settings where user state, calendar details, preferences, and inferred motivations may be sensitive. The model should avoid revealing private state unless the user explicitly authorizes disclosure. Deployment recommendations: - keep audit logs for tool calls - require confirmation for external write actions - separate private user state from visible conversation - evaluate privacy leakage in the target harness - test tool schemas before production use Limitations: - Preview release; behavior may change across versions. - Full behavior depends on a correct harness, router, and tool schema. - Agent performance can degrade if tools return stale, partial, or contradictory data. - Long-horizon personal-agent tasks still require human confirmation for external actions. - A2UI quality depends on renderer and protocol compatibility. - Benchmark scores may not transfer to deployments with different tools, user simulators, routing policies, or safety constraints. ## License Macaron-V1-Preview-749B is released under the MIT License. Users should also respect any requirements inherited from the GLM-5.1 base model and from dependencies used by the serving harness. ## Citation ```bibtex @misc{macaron2026preview749b, title = {Macaron-V1-Preview-749B: Mixture-of-LoRA Personal Agent Model}, author = {MindLab Research}, year = {2026}, howpublished = {Hugging Face} } ``` ## Contact - Organization: MindLab Research - Project: Macaron - Release blog: https://macaron.im/mindlab/research/macaron-v1-preview