CY-Sieve Attention Kernel — a falsifiable experiment (NEGATIVE result)
A positional-attention bias derived from the weight-5 Apéry-like binomial sum
whose holonomic structure is a Calabi–Yau 3-fold period (an order-4 MUM block; the geometry fixes the long-range decay slope $\log\lambda = 3.762$ and curvature $\beta = 2$). The engineering bet: generate the positional bias on the fly from the order-4 recurrence, costing O(L) HBM instead of an O(L²) table.
Code: GitHub — Mirror-Map-Sieve · Math paper: Zenodo 10.5281/zenodo.20747943 · Benchmark data: callensxavier/cy-sieve-attention-benchmark
⚠️ Honest result (2026-06-22, NVIDIA L4): the quality gate KILLED it
This card corrects an earlier version that claimed "zero overhead" and implied a win. Those numbers came from an invalid method (monkey-patching the bias into a frozen pretrained model, which collapses every alternative scheme equally). The correct test — training small GPTs from scratch, one per positional scheme, on real WikiText-2 — gives:
| scheme | ppl @512 (train) | @1024 (2×) | @2048 (4×) |
|---|---|---|---|
| learned-absolute | 4.22 | 12.10 | 20.82 |
| ALiBi | 10.74 | 11.73 | 11.35 |
| sliding-window | 4.99 | 5.07 | 5.03 |
| CY-Sieve τ-ladder | 11.33 | 12.31 | 12.05 |
| CY-Sieve τ=128 | 6.80 | 7.12 | 7.00 |
| CY-Sieve τ=512 | 4.65 | 6.08 | 10.62 |
Verdict: KILL (+10.15%). Best CY-Sieve (4.65) vs best baseline (4.22) exceeds the pre-committed >5% kill threshold; a plain sliding window won outright. The geometry-fixed slope is too steep for a drop-in positional scheme.
What did hold up
- §4 kernel correctness — PASS. The Triton kernel matches the NumPy reference within FP16 tolerance (4/4 tests).
- §6 memory claim — confirmed. The on-the-fly bias reads O(L) bytes of HBM vs O(L²) for a materialized table (8192× less at L=16384). But the current unfused kernel is ~4–6× slower than fused dense SDPA — a memory-traffic win, not a latency win. Per the project's reporting rule, with the quality gate failing, these numbers are not a contribution.
Why publish a negative result?
Because a fast, correct kernel that degrades model quality is a failed kernel, and saying so is the point. The Calabi–Yau geometry is a sound prior for the bias shape, not the right value — pinning the attention slope to the sequence's growth rate was the mistake. Redesign directions (learnable geometry-initialized slope; exact-local-window + gentle-tail hybrid; β=2 ablation) are in the findings writeup.
Reproduce
git clone https://github.com/xaviercallens/Mirror-Map-Sieve.git
cd Mirror-Map-Sieve
pip install -r 4_ai_hardware_attention/requirements-gpu.txt
python 4_ai_hardware_attention/run_gpu_phase.py # §4 parity + §5 quality + §6 perf
Citation
@misc{callens2026cysieve,
author = {Callens, Xavier},
title = {CY-Sieve Attention: a Calabi--Yau positional bias and its negative quality result},
year = {2026},
url = {https://huggingface.co/callensxavier/s20-attention-kernel},
doi = {10.5281/zenodo.20747943}
}