File size: 2,401 Bytes
5eb447f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Gemma 4 26B Non-Coding Agentic Tool Evaluation - 2026-05-22

This directory records non-coding agentic tool-use tests for the Gemma 4 26B
Cerebellum templatefix releases.

The goal is narrow: prove OpenAI-style tool automation for assistant workflows,
not coding-agent performance. Coding-agent behavior in opencode is tracked
separately and is not treated as a clean release claim yet.

## Harness

Runtime:

- `llama-server`
- `--jinja`
- `--reasoning auto`
- request payload included `chat_template_kwargs: {"enable_thinking": false}`
- request payload included `thinking_budget_tokens: 0`

Mock tools:

- `list_calendar`
- `create_calendar_hold`
- `search_notes`
- `save_note`
- `add_task`

Tasks:

- Scheduling assistant: inspect calendar, find a Tuesday slot, create a hold.
- Release-note workflow: search internal notes, save a draft, create a follow-up
  task.
- Creative-brief workflow: search style notes and save a production brief.

Pass criteria:

- Required tools are called.
- Tool arguments are valid JSON.
- No repeated search/tool loop.
- No template or thinking leakage in no-thinking mode.
- The scheduling task preserves the user-provided literal day instead of
  inventing an ISO date.

## Results

Regular v6.1 templatefix:

- File: `regular_v6_1_noncoding_agentic_tools_strict_summary.json`
- Result: 3/3 clean pass.
- Passed cases: `schedule_strict`, `release_notes_strict`,
  `creative_brief_strict`.
- Warnings: none.
- Failures: none.

Heretic v1.1 templatefix:

- File: `heretic_v1_1_noncoding_agentic_tools_strict_retry_summary.json`
- Result: 3/3 clean pass on strict retry.
- Passed cases: `schedule_strict`, `release_notes_strict`,
  `creative_brief_strict`.
- Warnings: none on strict retry.
- Failures: none on strict retry.

The first Heretic permissive run demonstrated tool capability but had quality
warnings: it invented an ISO date for a Tuesday scheduling task and over-called
`search_notes`. The strict retry corrected those issues.

## Release Claim Supported

Supported:

- Non-coding OpenAI-compatible tool automation passed a strict three-task
  harness.
- The model can plan over simple external state, call tools with valid JSON,
  preserve user constraints, write notes, and create tasks.

Not supported by this directory:

- Proven coding-agent behavior.
- Fully autonomous agent behavior without human review.
- General MCP/opencode reliability.