Instructions to use burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-12B-it") model = PeftModel.from_pretrained(base_model, "burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2") - Notebooks
- Google Colab
- Kaggle
gemma-4-12b-sdpo-pi-mono-trace-feedback-v2
LoRA adapter trained with TRL SDPO on filtered badlogicgames/pi-mono coding-agent traces that contain concrete tool errors or later user corrections.
This adapter is a self-distillation experiment, not a general-purpose coding model release. It uses TRL's experimental SDPO trainer with include_environment_feedback=True, so filtered trace diagnostics are supplied as privileged_context for teacher-conditioned reprompts.
Training Run
| Field | Value |
|---|---|
| Hub repo | burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 |
| Base model | google/gemma-4-12B-it |
| Dataset | badlogicgames/pi-mono |
| Dataset split | train |
| Trackio SDK | static |
| Selected samples | 96 |
| Minimum filter score | 11.2 |
| Average filter score | 14.247 |
| Max steps | 48 |
| Learning rate | 5e-05 |
| Training method | 4-bit NF4 QLoRA + TRL SDPO |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| Num generations | 4 |
| Success reward threshold | 0.25 |
| Max prompt length | 768 |
| Max completion length | 160 |
- Trackio: Trackio run
- Run name:
gemma4-12b-sdpo-pi-mono-static-r48-g4-20260604 - Output directory:
outputs/sdpo-pi-mono-trace-feedback
Data Filtering
The source data is badlogicgames/pi-mono. The script uses the Dataset Viewer parquet export by default because the raw JSONL files have schema drift that can make direct load_dataset() reconstruction brittle.
A row is kept when it has a substantive prompt plus at least one concrete tool error, environment diagnostic, or later user correction. Rows are scored higher for test/build failures, runtime exceptions, missing files or commands, explicit user feedback, and evidence that the trace continued after the failure.
Selected-data summary:
- Source sessions:
96 - Source files:
96 - Score range:
11.2to19.5 - Average score:
14.247
Category counts:
build_lint_compile: 58command_error: 96missing_file_or_command: 57other_marked_error: 3permission_auth: 3runtime_exception: 27test_or_assertion: 89tool_schema_validation: 5user_feedback: 88
First selected sample preview:
score: 19.5
categories: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback
reward_terms: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback, environment, diagnostic, promises:332, triggeruncaughtexception
prompt:
Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/2291
For each issue:
1. Read the issue in full, including all comments and linked issues/PRs.
2. Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path.
3. **For bugs**:
- Ignore any root cause analysis in the issue (likely wrong)
- Read all related code files in full (no truncation)
- Trace the code path and identify the actual root cause
- Propose a fix
4. **For feature requests**:
- Do not trust implementation proposals in the issue without verification
- Read all related code files in full (no truncation)
- Propose the most concise implementation approach
- List affected files and changes needed
Do NOT implement unless explicitly asked. Analyze and propose only.
privileged_context:
Tool/environment diagnostic 1 (test_or_assertion, runtime_exception, command_error):
node:internal/process/promises:332
triggerUncaughtException(err, true /* fromPromise */);
^
Error: Transform failed with 3 errors:
/eval.ts:29:2: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:31:22: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:46:2: ERROR: Top-level await is currently not supported with the "cjs" output format
at failureErrorWithLog ($WORKSPACE/node_modules/esbuild/lib/main.js:1748:15)
at $WOR
[... trimmed ...]
el' does not exist in type 'SimpleStreamOptions'.
packages/coding-agent/src/core/extensions/runner.ts(242,46): error TS2304: Cannot find name 'ProviderConfig'.
Command exited with code 2
Tool/environment diagnostic 3 (test_or_assertion):
No changes made to packages/coding-agent/src/core/agent-session.ts. The replacement produced identical content. This might indicate an issue with special characters or the text not existing as expected.
Later user correction 1:
what'st he most concise fix? this sounds overly complex
Later user correction 2:
well, we can't just fix it for session_start then
Reward Function
The included trace_grounding_reward is intentionally lightweight. It rewards completions that look like concrete coding-agent responses and mention terms grounded in the trace diagnostic. That is enough to exercise SDPO, Trackio, HF Jobs, LoRA push-to-Hub, and the filtered trace format.
For a serious training run, replace the heuristic reward with a verifier that can replay or grade the task, such as tests, build checks, lint output, tool-call validation, or another sandboxed outcome signal.
Pi Harness via VLLM
This repository contains a PEFT LoRA adapter, not a merged full model. A practical harness setup is to serve the Gemma base model once with VLLM, attach this adapter as a named LoRA module, and point the Pi harness at VLLM's OpenAI-compatible /v1 endpoint.
python -m pip install -U "vllm>=0.10.0" "huggingface_hub>=0.30.0"
hf download burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 \
--repo-type model \
--local-dir ./adapters/pi-mono-sdpo
vllm serve google/gemma-4-12B-it \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--api-key token-pi-harness \
--enable-lora \
--lora-modules pi-mono-sdpo=./adapters/pi-mono-sdpo
Then configure the Pi harness as an OpenAI-compatible provider:
export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
export OPENAI_API_KEY=token-pi-harness
export OPENAI_MODEL=pi-mono-sdpo
Smoke-test the server before running the harness:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Authorization: Bearer token-pi-harness" \
-H "Content-Type: application/json" \
-d '{
"model": "pi-mono-sdpo",
"messages": [
{"role": "system", "content": "You are a careful coding agent."},
{"role": "user", "content": "A test says expected 2 but got 3. What should you inspect first?"}
],
"max_tokens": 128
}'
If your Pi harness uses different variable names, map the same three values: OpenAI-compatible base URL, API key, and model name. If your VLLM version cannot serve this Gemma 4 adapter directly, merge the adapter into the base model first or run the adapter with a Transformers/PEFT inference process behind the same OpenAI-compatible API.
Training Metrics
epoch: 0.25total_flos: 0.0train_loss: -0.34031375994284946train_runtime: 1108.2066train_samples_per_second: 0.087train_steps_per_second: 0.043
Reproduce
hf jobs uv run <raw-gist-url-for-train_sdpo_pi_mono_full.py> \
--flavor a10g-large \
--timeout 4h \
--secrets HF_TOKEN \
--env HUB_MODEL_ID=burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2 \
--mode train \
--trackio-space-id burtenshaw/sdpo-pi-mono-trackio-static-v3 \
--trackio-sdk static \
--run-name gemma4-12b-sdpo-pi-mono-static-r48-g4-20260604
Limitations
- The data is filtered from existing traces, so it reflects the trace collector's task distribution and failure modes.
- The reward is a smoke-test heuristic and should not be interpreted as a reliable coding benchmark.
- The model is pushed as a PEFT LoRA adapter; load it with the base model listed above.
- SDPO is experimental in TRL, so pin versions for long-running comparisons.
- Downloads last month
- 29
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-12B-it") model = PeftModel.from_pretrained(base_model, "burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback-v2")