Dataset-loading failures, if any, are listed in dataset_audit_round3.json; failed anchors were dropped as instructed, while preserving at least six anchors per domain when available.
Code-task evaluation is string/span matching against reference code, not sandboxed unit-test execution; numbers should be interpreted as a cheap adapter-locality proxy rather than pass@1.
If an oracle adapter does not improve over base Y, the corresponding gap_recovered is unstable/meaningless and should be treated as diagnostic rather than evidence of mapping quality.
Math exact-match uses numeric extraction from generated text; formatting failures are counted as wrong.