File size: 12,041 Bytes
1b27377
 
c7f461d
 
 
 
 
 
 
 
 
c64ea23
 
 
 
 
c7f461d
1b27377
c7f461d
 
 
c64ea23
c7f461d
e37b76e
c7f461d
c64ea23
c7f461d
c64ea23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
582a364
 
 
 
 
 
 
c64ea23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e37b76e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
library_name: transformers
base_model:
- zai-org/GLM-5.1
tags:
- macaron
- personal-agent
- tool-use
- mixture-of-lora
- generative-ui
- a2ui
- glm
---

# Macaron-V1-Preview-749B

Macaron-V1-Preview-749B is a 749B-class Mixture-of-LoRA personal-agent model from MindLab Research, post-trained from GLM-5.1 with MinT. It combines a 744B base model with five specialist LoRA adapters and a router-driven serving design for multi-turn personal-life assistance, tool-grounded planning, coding and terminal workflows, and protocol-grounded Generative UI.

Release blog: https://macaron.im/mindlab/research/macaron-v1-preview

## Highlights

- 749B-class Mixture-of-LoRA preview model: 744B base + 5 specialist LoRAs.
- Built for personal-agent tasks where user intent, private state, tools, and world state change across turns.
- Uses an explicit router-tool design: the default adapter can route to specialist LoRAs through `change_model`.
- Covers personal planning, search/calendar/tool workflows, coding and terminal tasks, computer-agent workflows, and A2UI Generative UI.
- Ships as a single Hugging Face repository: base model files at root, LoRA adapters in `l0/` through `l4/`.

## Model Overview

| Field | Value |
|---|---|
| Model name | Macaron-V1-Preview-749B |
| Organization | MindLab Research |
| Base model | GLM-5.1 |
| Architecture | Mixture-of-LoRA |
| Parameter footprint | 749B-class: 744B base + 5 x ~1B LoRA |
| Post-training system | MinT |
| Primary domain | Personal agents, tool-use agents, Generative UI |
| Release type | Preview |
| Checkpoint format | Single HF repo: base checkpoint at root; LoRAs under `l0/`-`l4/` |
| Context length | 202,752 tokens, from `config.json` / `tokenizer_config.json` |
| Precision | bfloat16, from `config.json` |
| License | MIT; see [License](#license) |

## Repository Layout

The release is intentionally kept in one Hugging Face model repository:

```text
.
|-- config.json
|-- generation_config.json
|-- model.safetensors.index.json
|-- model-00001-of-00282.safetensors
|-- ...
|-- model-00282-of-00282.safetensors
|-- tokenizer.json
|-- tokenizer_config.json
|-- l0/
|   |-- adapter_config.json
|   `-- adapter_model.safetensors
|-- l1/
|-- l2/
|-- l3/
`-- l4/
```

Adapter roles:

| Adapter | Role |
|---|---|
| `l0` | Default chat, general-purpose behavior, and routing entry point |
| `l1` | Personal-agent tasks such as calendar, planning, search, and life automation |
| `l2` | Coding, terminal, repository, and shell tasks |
| `l3` | A2UI and Generative UI |
| `l4` | Computer-agent / OpenClaw-style workflows |

## What Macaron Is For

A useful personal agent has to work where the user actually lives. Daily life is full of small contingent decisions: what to eat tonight, where to find a quiet table, how to reroute when traffic changes, how to schedule an errand around family obligations, or how to choose the right UI surface for a task. These tasks become hard because the user, tools, and environment all change while the agent is working.

Macaron-V1-Preview-749B targets three linked abilities:

- **Capability**: using real tools such as search, maps, restaurants, calendars, coding environments, and task APIs.
- **Coherence**: tracking a real human across turns, preferences, constraints, and changing intent.
- **Expression**: choosing the right surface, such as text, card, form, table, slider, or dashboard, and rendering it quickly enough to remain useful.

## Architecture

### Mixture-of-LoRA

Macaron-V1-Preview-749B keeps divergent skill families in separate LoRAs over a shared base model. This is intended to reduce interference between chat, personal-agent tool use, coding, computer-agent behavior, and Generative UI, while still allowing the system to add new specialist domains without modifying the base model or existing specialists.

### Router Tool

Macaron exposes model selection as a tool call rather than as an opaque separate router model. The default adapter is `l0`. When a specialist is needed, the serving harness can route through an OpenAI-compatible tool call such as:

```json
{
  "name": "change_model",
  "arguments": {
    "target_model": "l1"
  }
}
```

The route is visible in traces and compatible with a standard tool-calling serving loop. A complete deployment should define the adapter registry, routing policy, confirmation policy, and how the system returns to the default adapter after a specialist turn.

### Harness Co-Design

Macaron-V1-Preview-749B is a model-and-harness release. The model was trained and evaluated with a production-style agent harness that manages LoRA routing, tool calls, memory/state exposure, system prompts, and task metadata. Deployments that remove or replace that harness should expect behavior and benchmark results to change.

## Generative UI and A2UI

Generative UI is a core Macaron capability. For many personal-agent tasks, the best answer is not only text: it may be a comparison card, editable task summary, booking form, route choice, slider, or dashboard.

Macaron-V1-Preview-749B is trained and evaluated with A2UI-style protocol actions. A2UI-Bench scores Generative UI along three layers:

- **Protocol correctness**: emitted actions are well formed and faithful to protocol semantics.
- **Task construction correctness**: the generated UI answers the user's request.
- **User-experience lift**: the UI makes the task easier than a text-only answer.

The evaluation also includes rendered visual checks for failures that text-only judges can miss, such as overflow, broken layouts, hidden controls, and spacing issues.

## Evaluation

The headline benchmark suite focuses on personal-agent behavior, daily-life task surfaces, Generative UI, and OpenClaw-style workflows.

![Macaron-V1-Preview-749B benchmark bar chart](assets/macaron_benchmark_bar_chart.png)

![Macaron-V1-Preview-749B benchmark radar chart](assets/macaron_benchmark_radar_chart.png)

![Macaron-V1-Preview-749B benchmark table](assets/macaron_benchmark_table.png)

Higher is better for all scores shown in the figures.

### Evaluation Protocols

**Macaron LivingBench.** Models are evaluated on 30 multi-turn personal-agent cases with a 10-turn budget. The tested agent may make up to three tool-use decisions per user turn. API calls use a 240-second timeout and up to three request-level retries. The reported mean case score is `0.7 x need score + 0.3 x process score`.

**A2UI-Bench.** Macaron-V1-Preview-749B is evaluated without explicit schema hints. Scores include protocol correctness, task construction correctness, and rendered UI quality.

**VitaBench.** VitaBench is used to stress realistic daily-life workflows. Since the original official judge model is no longer available, GLM-5.1 is used as both the judge and user model. Each query is run three times and the reported value is the average score.

**PinchBench.** PinchBench is used for search-grounded, high-precision personal-agent tasks. The reported setup uses Claude Haiku 4.5 as the judge model and Perplexity as the search API, and reports the best observed score.

**Tau3 Bench.** The reported setup uses GPT-5.2 with `reasoning_effort=low` as the user simulator and reports pass@1.

**SWE-Bench Verified.** The reported setup allows up to three retries only when an evaluation error occurs and reports the best successful attempt. The overall evaluation-error rate is approximately 0.8%.

**Terminal-Bench 2.0.** The reported setup uses the Harbor framework to run Macaron with the Pi Coding Agent Harness in sandboxed environments, with a maximum timeout of 4 hours, and reports pass@1.

**AIME 2026.** The reported score is included as a general-capability reference; the preview release is optimized primarily for personal-agent behavior and Generative UI rather than for maximizing this benchmark.

## Intended Use

Macaron-V1-Preview-749B is intended for:

- personal assistant research
- multi-turn tool-use agents
- daily-life planning and automation
- coding and terminal-agent research
- Generative UI / A2UI research
- agent benchmark evaluation
- research on modular post-training and LoRA specialization

## Out-of-Scope Use

Macaron-V1-Preview-749B is not intended for:

- autonomous high-stakes decisions without human confirmation
- medical, legal, financial, or safety-critical advice as a sole authority
- covert surveillance or privacy-invasive automation
- fully unsupervised payments, bookings, messages, calendar changes, or other external write actions
- production deployment without task-specific safety testing, audit logs, and confirmation flows

## Installation and Loading

The repository contains both the base checkpoint and LoRA adapters, but full Macaron behavior depends on the router-aware serving harness. Loading a single LoRA is useful for inspection and specialist experiments; it is not equivalent to the full routed personal-agent system.

Install dependencies:

```bash
pip install -U transformers accelerate peft safetensors
```

Example: load the base checkpoint and attach one specialist LoRA:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

repo_id = "mindlab-research/Macaron-V1-Preview-749B"
adapter = "l1"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(
    base_model,
    repo_id,
    subfolder=adapter,
)
model.eval()
```

For full routed serving, use a harness that:

- registers all five LoRA specialists
- starts each conversation from `l0`
- exposes `change_model` as a tool call
- routes to specialists according to the adapter registry
- returns control to `l0` after specialist turns
- enforces confirmation for external write actions

## Tool Use

Macaron-V1-Preview-749B is designed to operate with external tools. Personal-agent deployments may include:

- search
- calendar
- route planning
- restaurant/place lookup
- booking
- messaging
- task-specific APIs
- A2UI rendering actions
- coding, shell, and repository tools

The model should request explicit user confirmation before external write actions such as booking, sending messages, changing calendars, or making purchases.

## Safety, Privacy, and Limitations

Macaron-V1-Preview-749B is designed for personal-agent settings where user state, calendar details, preferences, and inferred motivations may be sensitive. The model should avoid revealing private state unless the user explicitly authorizes disclosure.

Deployment recommendations:

- keep audit logs for tool calls
- require confirmation for external write actions
- separate private user state from visible conversation
- evaluate privacy leakage in the target harness
- test tool schemas before production use

Limitations:

- Preview release; behavior may change across versions.
- Full behavior depends on a correct harness, router, and tool schema.
- Agent performance can degrade if tools return stale, partial, or contradictory data.
- Long-horizon personal-agent tasks still require human confirmation for external actions.
- A2UI quality depends on renderer and protocol compatibility.
- Benchmark scores may not transfer to deployments with different tools, user simulators, routing policies, or safety constraints.

## License

Macaron-V1-Preview-749B is released under the MIT License. Users should also respect any requirements inherited from the GLM-5.1 base model and from dependencies used by the serving harness.

## Citation

```bibtex
@misc{macaron2026preview749b,
  title = {Macaron-V1-Preview-749B: Mixture-of-LoRA Personal Agent Model},
  author = {MindLab Research},
  year = {2026},
  howpublished = {Hugging Face}
}
```

## Contact

- Organization: MindLab Research
- Project: Macaron
- Release blog: https://macaron.im/mindlab/research/macaron-v1-preview