deucebucket commited on
Commit
5eb447f
·
verified ·
1 Parent(s): 9935c1b

docs: update templatefix test notes

Browse files
Files changed (1) hide show
  1. agentic_eval_20260522/README.md +81 -0
agentic_eval_20260522/README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gemma 4 26B Non-Coding Agentic Tool Evaluation - 2026-05-22
2
+
3
+ This directory records non-coding agentic tool-use tests for the Gemma 4 26B
4
+ Cerebellum templatefix releases.
5
+
6
+ The goal is narrow: prove OpenAI-style tool automation for assistant workflows,
7
+ not coding-agent performance. Coding-agent behavior in opencode is tracked
8
+ separately and is not treated as a clean release claim yet.
9
+
10
+ ## Harness
11
+
12
+ Runtime:
13
+
14
+ - `llama-server`
15
+ - `--jinja`
16
+ - `--reasoning auto`
17
+ - request payload included `chat_template_kwargs: {"enable_thinking": false}`
18
+ - request payload included `thinking_budget_tokens: 0`
19
+
20
+ Mock tools:
21
+
22
+ - `list_calendar`
23
+ - `create_calendar_hold`
24
+ - `search_notes`
25
+ - `save_note`
26
+ - `add_task`
27
+
28
+ Tasks:
29
+
30
+ - Scheduling assistant: inspect calendar, find a Tuesday slot, create a hold.
31
+ - Release-note workflow: search internal notes, save a draft, create a follow-up
32
+ task.
33
+ - Creative-brief workflow: search style notes and save a production brief.
34
+
35
+ Pass criteria:
36
+
37
+ - Required tools are called.
38
+ - Tool arguments are valid JSON.
39
+ - No repeated search/tool loop.
40
+ - No template or thinking leakage in no-thinking mode.
41
+ - The scheduling task preserves the user-provided literal day instead of
42
+ inventing an ISO date.
43
+
44
+ ## Results
45
+
46
+ Regular v6.1 templatefix:
47
+
48
+ - File: `regular_v6_1_noncoding_agentic_tools_strict_summary.json`
49
+ - Result: 3/3 clean pass.
50
+ - Passed cases: `schedule_strict`, `release_notes_strict`,
51
+ `creative_brief_strict`.
52
+ - Warnings: none.
53
+ - Failures: none.
54
+
55
+ Heretic v1.1 templatefix:
56
+
57
+ - File: `heretic_v1_1_noncoding_agentic_tools_strict_retry_summary.json`
58
+ - Result: 3/3 clean pass on strict retry.
59
+ - Passed cases: `schedule_strict`, `release_notes_strict`,
60
+ `creative_brief_strict`.
61
+ - Warnings: none on strict retry.
62
+ - Failures: none on strict retry.
63
+
64
+ The first Heretic permissive run demonstrated tool capability but had quality
65
+ warnings: it invented an ISO date for a Tuesday scheduling task and over-called
66
+ `search_notes`. The strict retry corrected those issues.
67
+
68
+ ## Release Claim Supported
69
+
70
+ Supported:
71
+
72
+ - Non-coding OpenAI-compatible tool automation passed a strict three-task
73
+ harness.
74
+ - The model can plan over simple external state, call tools with valid JSON,
75
+ preserve user constraints, write notes, and create tasks.
76
+
77
+ Not supported by this directory:
78
+
79
+ - Proven coding-agent behavior.
80
+ - Fully autonomous agent behavior without human review.
81
+ - General MCP/opencode reliability.