CoDA-GQA-L Neural Database
Ingest documents, save neural states, and ask questions about them
Large language models have a memory problem. A 70B model serving 128K tokens of context burns 160 GB on KV cache alone. That cache grows linearly with sequence length, making long-context inference expensive and stateless — every new session re-reads the entire document from scratch.
CoDA-GQA-L (Constrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks) replaces the standard KV cache with a fixed-size, 384-slot buffer per layer that never grows, regardless of input length.
| Context Length | Standard KV (70B) | CoDA-GQA-L | Compression |
|---|---|---|---|
| 2K | 2.56 GB | 120 MB | 21× |
| 32K | 40 GB | 120 MB | 341× |
| 128K | 160 GB | 120 MB | 1,365× |
The buffer has three segments:
The key insight: this bounded state is serializable. You can torch.save() it, come back a week later, torch.load() it, and query the document without re-reading a single token. We call this pattern stateful neural databases.
The Stateful Neural Database demo on Hugging Face Spaces showcases this workflow in three steps:
Paste or upload a document. The model (Qwen3-4B with CoDA adapters) processes it through bounded attention layers and produces a fixed-size neural state. The state size is the same whether your document is 500 tokens or 50,000.
The state serializes to a .pt file (~61 MB at 4B scale). Save it to the built-in library with a label, or download it for later.
Load any saved state — from the library dropdown or by uploading a .pt file — and ask questions. The original document is never re-read. Each query deep-copies the state, so multiple questions don't interfere with each other.
The demo ships with three bundled example documents so you can skip the ingest step and jump straight to querying. It runs on ZeroGPU with carefully tuned duration budgets to stay within free-tier limits.
The stateful neural database pattern decouples document processing from document querying:
This isn't retrieval-augmented generation. There's no vector database, no chunk-and-embed pipeline, no retrieval step. The model's attention layers are the database — they've compressed the document into a fixed-size representation that supports direct querying.
Three ideas make this work:
Differential attention via orthogonal rotation. Building on Microsoft's Diff Transformer, CoDA uses a single query projection with a learned per-head rotation to derive signal and noise streams. Signal minus gated noise, one SDPA call. A factorial ablation shows this reduces the bounded-memory penalty by 5.7× compared to standard GQA.
Value-routing for memory banks. Keys carry RoPE positional encodings, making their cosine similarity position-dependent — the same word at position 100 vs. 5000 looks orthogonal in key-space. Memory banks route on Values instead (RoPE-free, pure semantic content), which is what makes deduplication and EMA blending actually work.
Two-phase training. Phase 1 trains unbounded differential attention to baseline quality. Phase 2 switches to bounded cache with gradient flow through evictions, so the write gate learns what to keep and what to discard. Without Phase 2, bounded evaluation is catastrophic (PPL jumps from 5.75 to 2,464).
Load one of the bundled examples and ask a question. Or paste your own document, ingest it, save the state, and come back to it later.
The full source — attention module, memory banks, Triton kernels, training pipeline, and this demo — is available at github.com/anthony-maio/CoDA-GQA-L under the MIT license.
Built by @anthonym21. The CoDA-GQA-L mechanism is described in detail in the project's architecture deep dive.
Ingest documents, save neural states, and ask questions about them