usermma commited on
Commit
3d4d564
·
verified ·
1 Parent(s): 33a6452

Update README.md

Browse files

parsing the text into better one..

Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -137,20 +137,20 @@ The headline benchmark suite focuses on personal-agent behavior, daily-life task
137
 
138
  | Category | Benchmark | Macaron V1 Preview | GLM 5.1 | GPT 5.4 | Claude Opus 4.6 | Gemini 3.1 Pro | Qwen 3.6 Plus | Minimax 2.7 |
139
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
140
- | Personal Agent Benchmark | Macaron Livingbench | 75.2 | 63.2 | 66.5 | 68.9 | 57.6 | 59.0 | 58.2 |
141
- | | VitaBench | 59.6 | 56.8 | 48.7 | 53.0 | 55.2 | 47.5 | 52.2 |
142
- | | VitaBench (Delivery) | 67.0 | 64.2 | 50.0 | 65.0 | 63.0 | 58.0 | 63.0 |
143
- | | VitaBench (In-Store) | 75.0 | 70.0 | 55.0 | 66.0 | 68.0 | 58.0 | 58.0 |
144
- | | VitaBench (OTA) | 51.0 | 54.0 | 41.0 | 45.0 | 48.0 | 36.0 | 51.0 |
145
- | | VitaBench (Cross-shop) | 45.3 | 39.0 | -- | 36.0 | 42.0 | 38.0 | 37.0 |
146
- | | A2UI-Bench | 75.6 | 61.7 | 74.1 | 67.6 | 71.0 | 69.8 | 54.4 |
147
- | | A2UI L1 | 89.5 | 72.2 | 82.3 | 81.5 | 85.1 | 84.1 | 75.1 |
148
- | | A2UI L2 | 67.2 | 54.7 | 71.8 | 59.4 | 64.1 | 59.9 | 46.3 |
149
- | | A2UI L3 | 65.7 | 54.5 | 65.4 | 57.5 | 59.2 | 60.7 | 34.8 |
150
- | | PinchBench | 92.5 | 76.6 | 88.4 | 88.9 | 82.9 | 85.9 | 84.5 |
151
- | General Agent Benchmark | Tau3 Bench | 67.6 | 70.6 | 72.9 | 72.4 | 67.1 | 70.7 | 67.6 |
152
- | | SWE-bench Verified | 78.1 | 76.4 | 78.2 | 78.2 | 78.8 | 73.4 | 73.8 |
153
- | | Terminal-Bench 2.0 | 67.4 | 63.5 | 75.1 | 65.4 | 68.5 | 61.6 | 57.0 |
154
 
155
  Higher is better for all scores shown in the charts and table.
156
 
 
137
 
138
  | Category | Benchmark | Macaron V1 Preview | GLM 5.1 | GPT 5.4 | Claude Opus 4.6 | Gemini 3.1 Pro | Qwen 3.6 Plus | Minimax 2.7 |
139
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
140
+ | Personal Agent Benchmark | Macaron Livingbench | **75.2** | 63.2 | 66.5 | 68.9 | 57.6 | 59.0 | 58.2 |
141
+ | | VitaBench | **59.6** | 56.8 | 48.7 | 53.0 | 55.2 | 47.5 | 52.2 |
142
+ | | VitaBench (Delivery) | **67.0** | 64.2 | 50.0 | 65.0 | 63.0 | 58.0 | 63.0 |
143
+ | | VitaBench (In-Store) | **75.0** | 70.0 | 55.0 | 66.0 | 68.0 | 58.0 | 58.0 |
144
+ | | VitaBench (OTA) | 51.0 | **54.0** | 41.0 | 45.0 | 48.0 | 36.0 | 51.0 |
145
+ | | VitaBench (Cross-shop) | **45.3** | 39.0 | -- | 36.0 | 42.0 | 38.0 | 37.0 |
146
+ | | A2UI-Bench | **75.6** | 61.7 | 74.1 | 67.6 | 71.0 | 69.8 | 54.4 |
147
+ | | A2UI L1 | **89.5** | 72.2 | 82.3 | 81.5 | 85.1 | 84.1 | 75.1 |
148
+ | | A2UI L2 | 67.2 | 54.7 | **71.8** | 59.4 | 64.1 | 59.9 | 46.3 |
149
+ | | A2UI L3 | **65.7** | 54.5 | 65.4 | 57.5 | 59.2 | 60.7 | 34.8 |
150
+ | | PinchBench | **92.5** | 76.6 | 88.4 | 88.9 | 82.9 | 85.9 | 84.5 |
151
+ | General Agent Benchmark | Tau3 Bench | 67.6 | 70.6 | **72.9** | 72.4 | 67.1 | 70.7 | 67.6 |
152
+ | | SWE-bench Verified | 78.1 | 76.4 | 78.2 | 78.2 | **78.8** | 73.4 | 73.8 |
153
+ | | Terminal-Bench 2.0 | 67.4 | 63.5 | **75.1** | 65.4 | 68.5 | 61.6 | 57.0 |
154
 
155
  Higher is better for all scores shown in the charts and table.
156