--- name: ❌ Refute a prediction about: TAF prediction contradicted by empirical measurement title: '[Refute] ' labels: refuted --- ## Hash of analysis being refuted `#__________` ← paste the hash from the original issue's title ## Original issue Link: #__ ## TAF prediction What did TAF say: - Verdict: __ - Key number: __ (e.g. d_horizon = 47781) ## My empirical measurement What actually happened: - Verdict observed: __ - Key number measured: __ (e.g. NIAH collapse at L=12K, well before predicted ceiling) - Magnitude of disagreement: __ (% or absolute) ## Setup - Hardware: __ - Software: __ (versions matter!) - Random seed(s) tried: __ - Number of trials: __ ## Method Detailed enough that a third party can reproduce: ```bash # Step-by-step commands ``` ```python # Or full Python script ``` ## Hypothesis on why TAF was wrong - [ ] Out-of-regime (e.g. extrapolation beyond validity zone) - [ ] Architecture-specific quirk not captured in formulas - [ ] Model has unusual training data - [ ] Bug in TAF formulas - [ ] Other: __ Detailed thoughts: ## Suggested update to TAF If applicable, what should the framework do differently? - [ ] Update validity bounds for this recipe - [ ] Add a caveat for this architecture family - [ ] Withdraw the prediction (move to NR-X in paper appendix) - [ ] No change needed (this is a known edge case)