gemma-4-12b-sdpo-pi-mono-trace-feedback-v2

LoRA adapter trained with TRL SDPO on filtered badlogicgames/pi-mono coding-agent traces that contain concrete tool errors or later user corrections.

This adapter is a self-distillation experiment, not a general-purpose coding model release. It uses TRL's experimental SDPO trainer with include_environment_feedback=True, so filtered trace diagnostics are supplied as privileged_context for teacher-conditioned reprompts.

Training Run

Field	Value
Hub repo	`burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2`
Base model	`google/gemma-4-12B-it`
Dataset	`badlogicgames/pi-mono`
Dataset split	`train`
Trackio SDK	`static`
Selected samples	`96`
Minimum filter score	`11.2`
Average filter score	`14.247`
Max steps	`48`
Learning rate	`5e-05`
Training method	`4-bit NF4 QLoRA + TRL SDPO`
LoRA rank	`8`
LoRA alpha	`16`
Num generations	`4`
Success reward threshold	`0.25`
Max prompt length	`768`
Max completion length	`160`

Trackio: Trackio run
Run name: gemma4-12b-sdpo-pi-mono-static-r48-g4-20260604
Output directory: outputs/sdpo-pi-mono-trace-feedback

Data Filtering

The source data is badlogicgames/pi-mono. The script uses the Dataset Viewer parquet export by default because the raw JSONL files have schema drift that can make direct load_dataset() reconstruction brittle.

A row is kept when it has a substantive prompt plus at least one concrete tool error, environment diagnostic, or later user correction. Rows are scored higher for test/build failures, runtime exceptions, missing files or commands, explicit user feedback, and evidence that the trace continued after the failure.

Selected-data summary:

Source sessions: 96
Source files: 96
Score range: 11.2 to 19.5
Average score: 14.247

Category counts:

build_lint_compile: 58
command_error: 96
missing_file_or_command: 57
other_marked_error: 3
permission_auth: 3
runtime_exception: 27
test_or_assertion: 89
tool_schema_validation: 5
user_feedback: 88

First selected sample preview:

score: 19.5
categories: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback
reward_terms: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback, environment, diagnostic, promises:332, triggeruncaughtexception

prompt:
Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/2291

For each issue:

1. Read the issue in full, including all comments and linked issues/PRs.
2. Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path.

3. **For bugs**:
   - Ignore any root cause analysis in the issue (likely wrong)
   - Read all related code files in full (no truncation)
   - Trace the code path and identify the actual root cause
   - Propose a fix

4. **For feature requests**:
   - Do not trust implementation proposals in the issue without verification
   - Read all related code files in full (no truncation)
   - Propose the most concise implementation approach
   - List affected files and changes needed

Do NOT implement unless explicitly asked. Analyze and propose only.

privileged_context:
Tool/environment diagnostic 1 (test_or_assertion, runtime_exception, command_error):
node:internal/process/promises:332
    triggerUncaughtException(err, true /* fromPromise */);
    ^

Error: Transform failed with 3 errors:
/eval.ts:29:2: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:31:22: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:46:2: ERROR: Top-level await is currently not supported with the "cjs" output format
    at failureErrorWithLog ($WORKSPACE/node_modules/esbuild/lib/main.js:1748:15)
    at $WOR

[... trimmed ...]

el' does not exist in type 'SimpleStreamOptions'.
packages/coding-agent/src/core/extensions/runner.ts(242,46): error TS2304: Cannot find name 'ProviderConfig'.

Command exited with code 2

Tool/environment diagnostic 3 (test_or_assertion):
No changes made to packages/coding-agent/src/core/agent-session.ts. The replacement produced identical content. This might indicate an issue with special characters or the text not existing as expected.

Later user correction 1:
what'st he most concise fix? this sounds overly complex

Later user correction 2:
well, we can't just fix it for session_start then

Reward Function

The included trace_grounding_reward is intentionally lightweight. It rewards completions that look like concrete coding-agent responses and mention terms grounded in the trace diagnostic. That is enough to exercise SDPO, Trackio, HF Jobs, LoRA push-to-Hub, and the filtered trace format.

For a serious training run, replace the heuristic reward with a verifier that can replay or grade the task, such as tests, build checks, lint output, tool-call validation, or another sandboxed outcome signal.

Pi Harness via VLLM

This repository contains a PEFT LoRA adapter, not a merged full model. A practical harness setup is to serve the Gemma base model once with VLLM, attach this adapter as a named LoRA module, and point the Pi harness at VLLM's OpenAI-compatible /v1 endpoint.

python -m pip install -U "vllm>=0.10.0" "huggingface_hub>=0.30.0"

hf download burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 \
  --repo-type model \
  --local-dir ./adapters/pi-mono-sdpo

vllm serve google/gemma-4-12B-it \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --api-key token-pi-harness \
  --enable-lora \
  --lora-modules pi-mono-sdpo=./adapters/pi-mono-sdpo

Then configure the Pi harness as an OpenAI-compatible provider:

export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=token-pi-harness
export OPENAI_MODEL=pi-mono-sdpo

Smoke-test the server before running the harness:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Authorization: Bearer token-pi-harness" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "pi-mono-sdpo",
    "messages": [
      {"role": "system", "content": "You are a careful coding agent."},
      {"role": "user", "content": "A test says expected 2 but got 3. What should you inspect first?"}
    ],
    "max_tokens": 128
  }'

If your Pi harness uses different variable names, map the same three values: OpenAI-compatible base URL, API key, and model name. If your VLLM version cannot serve this Gemma 4 adapter directly, merge the adapter into the base model first or run the adapter with a Transformers/PEFT inference process behind the same OpenAI-compatible API.

Training Metrics

epoch: 0.25
total_flos: 0.0
train_loss: -0.34031375994284946
train_runtime: 1108.2066
train_samples_per_second: 0.087
train_steps_per_second: 0.043

Reproduce

hf jobs uv run <raw-gist-url-for-train_sdpo_pi_mono_full.py> \
  --flavor a10g-large \
  --timeout 4h \
  --secrets HF_TOKEN \
  --env HUB_MODEL_ID=burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 \
  --mode train \
  --trackio-space-id burtenshaw/sdpo-pi-mono-trackio-static-v3 \
  --trackio-sdk static \
  --run-name gemma4-12b-sdpo-pi-mono-static-r48-g4-20260604

Limitations

The data is filtered from existing traces, so it reflects the trace collector's task distribution and failure modes.
The reward is a smoke-test heuristic and should not be interpreted as a reliable coding benchmark.
The model is pushed as a PEFT LoRA adapter; load it with the base model listed above.
SDPO is experimental in TRL, so pin versions for long-running comparisons.

Downloads last month: 29

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Adapter

(12)

this model

burtenshaw
/

gemma-4-12b-sdpo-pi-mono-trace-feedback-v2