CY-Sieve Attention Kernel — a falsifiable experiment (NEGATIVE result)

A positional-attention bias derived from the weight-5 Apéry-like binomial sum

S20(n)=k=0n(nk)4(n+kk)S_{20}(n) = \sum_{k=0}^{n} \binom{n}{k}^4 \binom{n+k}{k}

whose holonomic structure is a Calabi–Yau 3-fold period (an order-4 MUM block; the geometry fixes the long-range decay slope $\log\lambda = 3.762$ and curvature $\beta = 2$). The engineering bet: generate the positional bias on the fly from the order-4 recurrence, costing O(L) HBM instead of an O(L²) table.

Code: GitHub — Mirror-Map-Sieve · Math paper: Zenodo 10.5281/zenodo.20747943 · Benchmark data: callensxavier/cy-sieve-attention-benchmark

⚠️ Honest result (2026-06-22, NVIDIA L4): the quality gate KILLED it

This card corrects an earlier version that claimed "zero overhead" and implied a win. Those numbers came from an invalid method (monkey-patching the bias into a frozen pretrained model, which collapses every alternative scheme equally). The correct test — training small GPTs from scratch, one per positional scheme, on real WikiText-2 — gives:

scheme ppl @512 (train) @1024 (2×) @2048 (4×)
learned-absolute 4.22 12.10 20.82
ALiBi 10.74 11.73 11.35
sliding-window 4.99 5.07 5.03
CY-Sieve τ-ladder 11.33 12.31 12.05
CY-Sieve τ=128 6.80 7.12 7.00
CY-Sieve τ=512 4.65 6.08 10.62

Verdict: KILL (+10.15%). Best CY-Sieve (4.65) vs best baseline (4.22) exceeds the pre-committed >5% kill threshold; a plain sliding window won outright. The geometry-fixed slope is too steep for a drop-in positional scheme.

What did hold up

  • §4 kernel correctness — PASS. The Triton kernel matches the NumPy reference within FP16 tolerance (4/4 tests).
  • §6 memory claim — confirmed. The on-the-fly bias reads O(L) bytes of HBM vs O(L²) for a materialized table (8192× less at L=16384). But the current unfused kernel is ~4–6× slower than fused dense SDPA — a memory-traffic win, not a latency win. Per the project's reporting rule, with the quality gate failing, these numbers are not a contribution.

Why publish a negative result?

Because a fast, correct kernel that degrades model quality is a failed kernel, and saying so is the point. The Calabi–Yau geometry is a sound prior for the bias shape, not the right value — pinning the attention slope to the sequence's growth rate was the mistake. Redesign directions (learnable geometry-initialized slope; exact-local-window + gentle-tail hybrid; β=2 ablation) are in the findings writeup.

Reproduce

git clone https://github.com/xaviercallens/Mirror-Map-Sieve.git
cd Mirror-Map-Sieve
pip install -r 4_ai_hardware_attention/requirements-gpu.txt
python 4_ai_hardware_attention/run_gpu_phase.py   # §4 parity + §5 quality + §6 perf

Citation

@misc{callens2026cysieve,
  author = {Callens, Xavier},
  title  = {CY-Sieve Attention: a Calabi--Yau positional bias and its negative quality result},
  year   = {2026},
  url    = {https://huggingface.co/callensxavier/s20-attention-kernel},
  doi    = {10.5281/zenodo.20747943}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train callensxavier/s20-attention-kernel