Spaces:
Running
Running
feat(v0.3): inspector + what-if slider + falsification + community widget + registry bootstrap
c11b76c | name: ❌ Refute a prediction | |
| about: TAF prediction contradicted by empirical measurement | |
| title: '[Refute] ' | |
| labels: refuted | |
| ## Hash of analysis being refuted | |
| `#__________` ← paste the hash from the original issue's title | |
| ## Original issue | |
| Link: #__ | |
| ## TAF prediction | |
| What did TAF say: | |
| - Verdict: __ | |
| - Key number: __ (e.g. d_horizon = 47781) | |
| ## My empirical measurement | |
| What actually happened: | |
| - Verdict observed: __ | |
| - Key number measured: __ (e.g. NIAH collapse at L=12K, well before predicted ceiling) | |
| - Magnitude of disagreement: __ (% or absolute) | |
| ## Setup | |
| - Hardware: __ | |
| - Software: __ (versions matter!) | |
| - Random seed(s) tried: __ | |
| - Number of trials: __ | |
| ## Method | |
| Detailed enough that a third party can reproduce: | |
| ```bash | |
| # Step-by-step commands | |
| ``` | |
| ```python | |
| # Or full Python script | |
| ``` | |
| ## Hypothesis on why TAF was wrong | |
| - [ ] Out-of-regime (e.g. extrapolation beyond validity zone) | |
| - [ ] Architecture-specific quirk not captured in formulas | |
| - [ ] Model has unusual training data | |
| - [ ] Bug in TAF formulas | |
| - [ ] Other: __ | |
| Detailed thoughts: | |
| ## Suggested update to TAF | |
| If applicable, what should the framework do differently? | |
| - [ ] Update validity bounds for this recipe | |
| - [ ] Add a caveat for this architecture family | |
| - [ ] Withdraw the prediction (move to NR-X in paper appendix) | |
| - [ ] No change needed (this is a known edge case) | |