TropicalGT and ToricGT: Tropical, Toric, and Graph-Token Geometry for Auditable Reasoning

Community Article
Published June 15, 2026

TLDR. ToricGT and TropicalGT are built around three linked breakthroughs. First, TokenGT-style graph tokenization changes what a transformer can see: vertices and edges become tokens, orthonormal node identifiers make incidence and equality tests available to ordinary dot-product attention, and this yields universal approximation of continuous permutation-equivariant graph-to-graph maps, exact finite graph-to-graph interpolation with output slots, threshold-circuit simulation over graph predicates, and tropical region growth controlled by the graph-token count NG=n+mN_G=n+m or dense order- kk count nkn^k. Second, GraphCG-style learned concept directions give a controllable basis for graph edits; when combined with tropical ring attention and graph-of-thought training with GFlowNets, those directions become a multi-axis, graph-conditioned, trajectory-aware generalization of Anthropic’s “Assistant Axis”: instead of monitoring one persona direction in a language model’s residual stream, the model learns many concept directions, connects them to graph edits and reasoning trajectories, and audits when those directions cross tropical fan walls. Third, these graph-to-graph universality and control results point toward a new hardware/software target: not a universal sequence-only transformer ASIC, but graph-token and tropical-ring accelerators for a universal graph-to-graph reasoning architecture that can operate across modalities represented as graphs. The claim is stronger than universal seq2seq approximation that transformers have because the model class approximates equivariant graph-to-graph functions, reasons over explicit relational structure and boolean circuits (more general and expressive than transformers without the tokenization method), and can generalize out of distribution through incidence-aware tokenization, tropical dynamic-programming heads, and finite toric sidecars.

In other words, we have found a universal architecture for any modality or data type based on vanilla transformer architecture with mild modifications to the tokenization and positional encoding, and the option of tropical (ring) attention (a lifting of standard attention), that expands the expressiveness and generalization capacity of vanilla transformers (to boolean circuits and chip design for example) without modifying the base architecture, making transformer ASICs viable and perhaps prudent, and equipping the researcher with a more powerful and steerable version of Anthropic's "Assistant Axis".

ToricGT and TropicalGT are two parts of one research program. TropicalGT is the tropical-first model family: it uses max-plus and min-plus attention, graph tokens, graph-of-thought trajectories, GFlowNet training, and bits-per-byte-safe evaluation. ToricGT is the toric and homological audit branch: it takes finite tropical sidecars and studies them with fans, initial degenerations, toric ideals, vector bundles, sheaf gluing, Koszul complexes, BGG-style certificates, and multiparameter persistence.

The point is not to claim that a Transformer is a toric variety. That would be mathematically wrong. The correct claim is local and finite:

A fixed checkpoint, layer, head, token window, and rationalized probe can define a finite tropical or toric geometry. That geometry can be audited with exact algebraic and polyhedral tools.

The project has a practical aim: make parts of neural reasoning inspectable. A model should not only answer; it should expose which graph tokens, evidence paths, dynamic-programming predecessors, concept directions, or toric cells carried the computation.

1. The TokenGT breakthrough: graph tokenization changes the input universe

A standard sequence transformer receives a sequence. If the input is a graph, the graph must be serialized somehow. A weak serialization can destroy the relational structure. A node-only graph transformer without structural encodings may see a multiset of vertex features but not the edge relation itself.

TokenGT changes the object before learning begins. A graph is represented by vertex tokens and edge tokens. Each token receives structural identifiers. In the order-2 graph case, a vertex token for vertex ii carries identifiers like (Pi,Pi)(P_i,P_i), while an edge token for edge (u,v)(u,v) carries identifiers like (Pu,Pv)(P_u,P_v). The rows P1,,PnP_1,\ldots,P_n are orthonormal node identifiers, so

Pi,Pj=1{i=j}. \langle P_i,P_j\rangle=\mathbf 1\{i=j\}.

This one equation is the key. It means dot-product attention can test equality of graph indices. It can detect whether a vertex token is incident to an edge token. It can distinguish diagonal vertex tokens from off-diagonal edge tokens. It can build equality-pattern basis tensors.

For a sparse graph G=(V,E)G=(V,E) with V=n|V|=n and E=m|E|=m, the graph-token sequence length is

NG=n+m. N_G=n+m.

For dense order- kk tokenization, the graph-token count behaves like

Nk=nk. N_k=n^k.

This is not just a bigger sequence length. It is a different input vocabulary. Attention no longer sees only positions; it sees vertices, edges, hyperedges, and incidence relations.

2. Graph-to-graph universality

The paper develops the expressivity consequences of this tokenization. Its first major theorem is a graph-to-graph universal approximation theorem.

For fixed graph size nn, an order- (k,)(k,\ell) graph-to-graph map has the form

F:Xn,kR[n]k×dR[n]×d. F:\mathcal X_{n,k}\subset \mathbb R^{[n]^k\times d}\longrightarrow \mathbb R^{[n]^\ell\times d'}.

If the graph is unlabeled, the meaningful functions are permutation equivariant:

F(πX)=πF(X). F(\pi\cdot X)=\pi\cdot F(X).

TokenGT-style tokenization with order- kk identifiers can approximate continuous permutation-equivariant graph-to-graph maps on compact domains. The proof route is important:

  1. Invariant and equivariant graph networks are universal for continuous equivariant graph maps on fixed-size compact domains.
  2. Such networks are built from equivariant linear layers.
  3. Equivariant linear layers decompose into equality-pattern basis tensors.
  4. TokenGT identifiers let attention approximate every equality-pattern basis tensor.
  5. Feed-forward layers supply pointwise nonlinearities.
  6. Output-slot tokens allow node-and-edge or graph-to-graph outputs.

So the theorem is not merely inherited from sequence-transformer universality. It uses the TokenGT construction to recover the equivariant tensor basis.

For finite labeled graph domains, the claim becomes stronger. If D\mathcal D is a finite set of labeled graphs and output slots are present, a sufficiently large TokenGT-style transformer can exactly interpolate any finite graph-to-graph function on D\mathcal D. For unlabeled graphs, the target must be isomorphism respecting; otherwise the function is not well-defined on graph isomorphism classes.

This is a genuine step beyond universal seq2seq approximation. Sequence-to-sequence approximation says that a transformer can approximate functions on token sequences. TokenGT graph-to-graph approximation says that, after the right tokenization, a transformer can approximate equivariant relational maps whose inputs and outputs are graphs.

3. Why this matters for Boolean circuits

The same reference paper adapts transformer circuit-complexity results to graph-token inputs.

A graph property can be written in terms of Boolean variables such as adjacency bits

auv{0,1}, a_{uv}\in\{0,1\},

vertex-label bits xu,rx_{u,r}, and edge-label bits euv,se_{uv,s}. A threshold circuit over these variables can compute properties such as majority over edges, parity over a fixed finite graph size, fixed-subgraph count thresholds, or combinations of local graph predicates.

TokenGT can simulate nonuniform constant-depth threshold circuits over graph predicates by using gate tokens. Gate tokens attend to the selected vertex and edge tokens, saturated attention aggregates selected inputs, and feed-forward layers threshold the result.

The circuit theorem is nonuniform in graph size. It does not say that one fixed finite model computes every graph property for all nn. But it does say that for each graph size, polynomially many graph tokens, heads, and hidden units can simulate polynomial-size threshold circuits over explicit graph predicates.

This separates TokenGT from “naked” graph transformers. Without structural encodings, a graph transformer may not even see adjacency. With TokenGT identifiers, adjacency and incidence become available to ordinary attention.

This is why the tokenization method is not cosmetic. It is an expressivity upgrade.

4. Tropical geometry of graph-token attention

The tropical expressivity picture begins with zero-temperature attention. A softmax attention score becomes hard routing in the low-temperature limit. For fixed keys, the query space is partitioned into regions where one key wins. This is a power-diagram or normal-fan picture.

A tropical polynomial has the form

ψ(h)=max1rR{ar,h+br}. \psi(h)=\max_{1\le r\le R}\{\langle a_r,h\rangle+b_r\}.

The active set is

Aψ(h)=argmaxr{ar,h+br}. A_\psi(h)=\arg\max_r\{\langle a_r,h\rangle+b_r\}.

The lifted Newton polytope is

Pψ=Conv{(ar,br):1rR}. P_\psi=\operatorname{Conv}\{(a_r,b_r):1\le r\le R\}.

The hidden state hh exposes a face of this polytope by the functional

(a,b)a,h+b. (a,b)\mapsto \langle a,h\rangle+b.

For a fixed active subset AA, the corresponding cell is

CA={h:aras,h+brbs0 for all rA, sA}. C_A= \left\{ h: \langle a_r-a_s,h\rangle+b_r-b_s\ge 0 \text{ for all }r\in A,\ s\notin A \right\}.

Thus tropical attention partitions hidden space into polyhedral cells.

For TokenGT, the important substitution is

NNG. N\longmapsto N_G.

If a standard sequence transformer has region growth controlled by sequence length NN, then a TokenGT graph transformer has region growth controlled by graph-token count NG=n+mN_G=n+m or dense order- kk count nkn^k.

Under the same genericity assumptions as the tropical transformer expressivity theorem, the paper states a TokenGT region bound of the form

NTokenGT(NG,d,L)=Θ(NGdL), \mathcal N_{\mathrm{TokenGT}}(N_G,d,L)=\Theta(N_G^{dL}),

where dd is embedding dimension and LL is depth.

For dense ordinary graphs with all ordered edge tokens, NG=n2+O(1)N_G=n^2+O(1), so

NTokenGT=Θ(n2dL). \mathcal N_{\mathrm{TokenGT}}=\Theta(n^{2dL}).

For dense order- kk hypergraph tokenization,

NTokenGT=Θ(nkdL). \mathcal N_{\mathrm{TokenGT}}=\Theta(n^{kdL}).

For sparse graph families with m=O(n)m=O(n), the exponent in nn remains Θ(ndL)\Theta(n^{dL}), but the model still gains relational information and better constants because edge tokens expose graph structure.

5. Tropical attention as dynamic programming

TropicalGT uses this geometry operationally. Max-plus and min-plus attention are natural dynamic-programming primitives.

The max-plus tropical semiring is

Tmax=R{},ab=max(a,b),ab=a+b. \mathbb T_{\max}=\mathbb R\cup\{-\infty\}, \qquad a\oplus b=\max(a,b), \qquad a\odot b=a+b.

The min-plus tropical semiring is

Tmin=R{+},aminb=min(a,b),ab=a+b. \mathbb T_{\min}=\mathbb R\cup\{+\infty\}, \qquad a\oplus_{\min} b=\min(a,b), \qquad a\odot b=a+b.

A shortest-path recurrence is min-plus:

d(v)=minuv{d(u)+w(u,v)}. d(v)=\min_{u\to v}\{d(u)+w(u,v)\}.

A best-evidence or Viterbi-style recurrence is max-plus:

s(v)=maxuv{s(u)+score(u,v)}. s(v)=\max_{u\to v}\{s(u)+\operatorname{score}(u,v)\}.

TokenGT makes the external graph available to attention. Edge tokens store endpoints and weights. Vertex tokens store state. A max-plus graph update can be simulated by a two-stage edge-to-vertex computation:

  1. Each edge token (u,v)(u,v) attends to source vertex uu and forms du(t)+wuvd_u^{(t)}+w_{uv}.
  2. Each vertex token vv attends over incoming edge tokens and takes the max or min.

For a max-plus update,

dv(t+1)=max(u,v)E{du(t)+wuv}. d_v^{(t+1)}=\max_{(u,v)\in E}\{d_u^{(t)}+w_{uv}\}.

A TokenGT transformer with edge tokens can simulate TT steps of this recurrence in T+O(1)T+O(1) zero-temperature attention layers. This is the algorithmic meaning of the architecture: graph tokenization supplies the relational substrate, and tropical attention supplies the recurrence operator.

6. Tropical ring attention for long contexts

TropicalGT also uses blockwise tropical evaluation. Partition keys into blocks B1,,BRB_1,\ldots,B_R. Define

Yic(r)=maxjBr{Sij+Vjc},Yic=maxrYic(r). Y_{ic}^{(r)}=\max_{j\in B_r}\{S_{ij}+V_{jc}\}, \qquad Y_{ic}=\max_r Y_{ic}^{(r)}.

Then

maxj{Sij+Vjc}=maxrmaxjBr{Sij+Vjc}. \max_j\{S_{ij}+V_{jc}\} = \max_r\max_{j\in B_r}\{S_{ij}+V_{jc}\}.

This is exact. It is not an approximation. The schedule changes, but the function does not.

If each block returns its local argmax set, the global argmax set is recovered by keeping local argmax sets from globally maximal blocks. Thus ring tropical attention preserves both values and provenance.

This matters for long graph-of-thought contexts. The model can stream over candidate proof branches, retrieved memories, graph paths, or tool outputs while retaining exact max-plus support information.

7. The toric sidecar: attention provenance as initial degeneration

ToricGT begins from the tropical attention sidecar.

Let

NZr N\simeq\mathbb Z^r

be a cocharacter lattice, and let

M=Hom(N,Z) M=\operatorname{Hom}(N,\mathbb Z)

be the character lattice. Let

TN=SpecK[M] T_N=\operatorname{Spec}K[M]

be an algebraic torus over a nonarchimedean field KK, with uniformizer τ\tau and valuation

val(τ)=1. \operatorname{val}(\tau)=1.

A rationalized tropical head or audit probe supplies characters

a1,,aRM a_1,\ldots,a_R\in M

and affine candidates

r(u)=ar,u+br,u=PhNR,brQ. \ell_r(u)=\langle a_r,u\rangle+b_r, \qquad u=Ph\in N_{\mathbb R}, \qquad b_r\in\mathbb Q.

Define the Laurent polynomial

f=r=1RτbrχarK[M]. f=\sum_{r=1}^R \tau^{-b_r}\chi^{a_r}\in K[M].

The sign convention is essential. TropicalGT uses max-plus scores, while many algebra systems use nonarchimedean min conventions. The bridge is

val(τbr)+u,ar=brar,u=r(u). \operatorname{val}(\tau^{-b_r})+\langle -u,a_r\rangle = -b_r-\langle a_r,u\rangle = -\ell_r(u).

Therefore the monomials kept by inu(f)\operatorname{in}_{-u}(f) are exactly the active attention candidates:

suppinu(f)=argmaxrr(u). \operatorname{supp}\operatorname{in}_{-u}(f) = \arg\max_r \ell_r(u).

This is the key ToricGT statement:

attention provenance is an initial degeneration.

The tropical hypersurface of ff is the tie locus. Unique-max chambers are not part of the hypersurface, but they are part of the attention decision stratification because they record the selected predecessor, proof step, evidence item, or graph-token witness.

8. From Laurent polynomials to toric audit envelopes

For multiple heads, output coordinates, or graph-token windows, form a finite family

FH={fα:αAH}K[M]. \mathcal F_H=\{f_\alpha:\alpha\in\mathcal A_H\}\subset K[M].

If an ideal sidecar is enabled, define

IH=FH. I_H=\langle \mathcal F_H\rangle.

Set

YH=V(IH)TN. Y_H=V(I_H)\cap T_N.

The tropicalization is

Trop(YH)={uNR:inu(IH) contains no monomial}. \operatorname{Trop}(Y_H) = \{u\in N_{\mathbb R}: \operatorname{in}_{-u}(I_H)\text{ contains no monomial}\}.

Choose a rational fan Σ\Sigma refining the active normal decomposition and any relevant finite Gröbner decomposition. The fan defines a toric variety

XΣ. X_\Sigma.

If

Σ=Trop(YH) |\Sigma|=\operatorname{Trop}(Y_H)

and the multiplication map

TN×YHXΣ T_N\times\overline Y_H\to X_\Sigma

is flat and surjective, then the closure is a tropical compactification. If the multiplication map is smooth and surjective, it is a schön compactification.

Those hypotheses are serious. Ordinary neural training windows do not automatically satisfy them. Without those checks, XΣX_\Sigma is a finite toric audit envelope, not a global theorem about the entire network.

The audit envelope still gives useful data:

  • active-cell labels;
  • orbit-cone incidence;
  • one-dimensional-cone incidence;
  • initial forms;
  • initial ideals;
  • toric ideal relations;
  • bend data;
  • Hilbert series;
  • Betti tables;
  • free-resolution summaries.

9. Toric ideals and exact algebraic sidecars

Given exponent vectors a1,,amZra_1,\ldots,a_m\in\mathbb Z^r, define the monomial map

φA:k[x1,,xm]k[t1±1,,tr±1],φA(xi)=tai. \varphi_A:k[x_1,\ldots,x_m]\to k[t_1^{\pm1},\ldots,t_r^{\pm1}], \qquad \varphi_A(x_i)=t^{a_i}.

The toric ideal is

IA=kerφA=xuxv:Au=Av. I_A=\ker\varphi_A = \langle x^u-x^v:Au=Av\rangle.

A binomial relation xuxvx^u-x^v is exact when Au=AvAu=Av. A training loop may use a differentiable residual such as

Lbinom=1R(i,j,k,l)R(zi+zjzkzl)2, \mathcal L_{\rm binom} = \frac{1}{|\mathcal R|} \sum_{(i,j,k,l)\in\mathcal R} (z_i+z_j-z_k-z_l)^2,

but the exact sidecar must verify the exponent relation, for example

ai+aj=ak+al. a_i+a_j=a_k+a_l.

This is a central project rule:

Exact sidecars certify finite mathematical objects. Differentiable losses are companions, not replacements.

The same distinction applies to fan cells, Klyachko filtrations, Cech cocycles, Koszul complexes, persistence modules, and Chow/Minkowski weights.

10. GraphCG directions and the Assistant Axis

Anthropic’s “Assistant Axis” work identifies a direction in activation space associated with a model’s default assistant-like persona. The Anthropic research post describes the axis as a way to monitor drift away from the helpful, professional Assistant persona, and the paper defines it as a contrast vector between mean default Assistant activations and mean role-playing activations. The paper reports that this direction aligns with the leading component of persona space and that steering or capping activations along it can stabilize behavior in some settings.

That is a useful one-dimensional activation-space control. But it is narrow:

  • it is primarily persona-oriented;
  • it is a single dominant axis;
  • it is measured in language-model residual-stream activations;
  • it monitors or steers default-assistant behavior;
  • it does not by itself give a graph-to-graph function model;
  • it does not expose dynamic-programming support, graph-token incidence, or tropical fan-wall crossings.

GraphCG-style training generalizes this idea in a different direction. Instead of one assistant-persona axis, GraphCG learns multiple semantic directions did_i and step sizes α\alpha for graph editing. A graph xx is encoded as z=f(x)z=f(x), edited as

zˉi,α=h(z,di,α), \bar z_{i,\alpha}=h(z,d_i,\alpha),

and decoded as

xˉ=g(zˉi,α). \bar x=g(\bar z_{i,\alpha}).

A linear edit has the form

h(z,di,α)=z+αdi. h(z,d_i,\alpha)=z+\alpha d_i.

A residual edit has the form

hψ(z,di,α)=z+αdi+rψ(z,di,α). h_\psi(z,d_i,\alpha)=z+\alpha d_i+r_\psi(z,d_i,\alpha).

In TokenGT and TropicalGT, this becomes a graph-to-graph edit system. The directions are not merely persona axes; they can correspond to graph concepts, proof modes, molecular properties, retrieval styles, safety constraints, tool-use regimes, or behavior-control states.

The reference paper proves that GraphCG-style objectives do not enlarge the formal architecture class by themselves. They change which parameters optimization prefers. For linear GraphCG, the reachable latent set for fixed input xx lies in

fθ(x)+span{d1,,dr}. f_\theta(x)+\operatorname{span}\{d_1,\ldots,d_r\}.

So GraphCG imposes a low-dimensional semantic-flow structure on a universal graph-to-graph architecture.

11. Why GraphCG plus tropical attention is better than a single axis

The Assistant Axis is a powerful demonstration that high-level behavior can align with a direction in activation space. But ToricGT/TropicalGT needs a richer object.

A reasoning model does not only need to stay “assistant-like.” It must choose evidence, move along graph paths, edit graph states, preserve equivariance, avoid contradictory claims, select proof branches, retrieve memories, call tools, and decide when to stop.

GraphCG-style basis directions can be many-dimensional:

D={d1,,dr}. D=\{d_1,\ldots,d_r\}.

Each direction can be tied to a concept or edit family. Tropical ring attention supplies hard supports and margins for the local computation. Graph-of-thought GFlowNet training supplies trajectory-level credit assignment.

This gives a richer control stack:

  • GraphCG directions say which semantic edit or concept flow is being applied.
  • Tropical attention says which token, edge, predecessor, or evidence item won locally.
  • Tropical fan cells say which piecewise-linear regime the latent edit occupies.
  • GFlowNet training says which graph-of-thought trajectories receive probability mass.
  • Toric sidecars say which finite algebraic cell, initial form, or relation is active.

In the zero-temperature piecewise-linear skeleton, a linear GraphCG edit path

z(α)=fθ(x)+αdi z(\alpha)=f_\theta(x)+\alpha d_i

moves through a polyhedral fan. The decoded trajectory

γx,i(α)=gϕ(fθ(x)+αdi) \gamma_{x,i}(\alpha)=g_\phi(f_\theta(x)+\alpha d_i)

is piecewise affine. Its breakpoints occur when the line crosses fan walls.

So GraphCG directions are not just vectors; they become navigational axes through the tropical complex of the decoder.

Compared with the Assistant Axis, this is better for ToricGT/TropicalGT because:

  1. It is multi-axis, not single-axis.
  2. It is graph-conditioned, not only persona-conditioned.
  3. It is tied to graph-to-graph edits, not only text persona drift.
  4. It interacts with explicit tokenized structure: vertices, edges, proofs, memories, and tools.
  5. It can be audited by tropical fan crossings.
  6. It can be trained through GFlowNet trajectory rewards.
  7. It supports concept disentanglement while preserving graph equivariance.
  8. It can represent controlled transformations across modalities that are encoded as graphs.

The honest caveat is that GraphCG does not guarantee unique semantic disentanglement. Contrastive objectives identify condition-discriminative factors only up to transformations that preserve the relevant density ratios. Orthogonality and sparsity regularizers help, but they do not solve identifiability by themselves.

That is why the project combines GraphCG with tropical supports, graph-of-thought trajectories, exact sidecars, and BPB-gated ablations.

12. GFlowNets and graph-of-thought reasoning

A graph-of-thought trajectory is a sequence or DAG of reasoning states: expand, retrieve, verify, merge, refine, compress, stop. GFlowNets train a policy to assign probability mass to trajectories according to a reward.

In TropicalGT, a trajectory state can carry:

  • graph-token hidden states;
  • active tropical supports;
  • top-two margins;
  • GraphCG direction summaries;
  • retrieved memory atoms;
  • persistence signatures;
  • verifier labels;
  • toric sidecar cells.

A bundle-guided trajectory reward can have the schematic form

Rbundle(τ)=Rtask(τ)η1(a,b)τLtransport(a,b)η2(a,b,c)τLcocycle(a,b,c)η3aτLflat(a). R_{\rm bundle}(\tau) = R_{\rm task}(\tau) -\eta_1\sum_{(a,b)\in\tau}\mathcal L_{\rm transport}(a,b) -\eta_2\sum_{(a,b,c)\in\tau}\mathcal L_{\rm cocycle}(a,b,c) -\eta_3\sum_{a\in\tau}\mathcal L_{\rm flat}(a).

This does not claim that a neural reasoning trajectory is literally a vector bundle. It says that trajectories that reuse information across charts should pay for inconsistent transports, while task reward and BPB decide whether the auxiliary is useful.

The interplay with GraphCG is important. A GFlowNet can explore graph-of-thought branches; GraphCG directions can steer the latent graph edits; tropical attention can certify local predecessor choices; toric sidecars can audit finite algebraic structure. This is a much richer control system than a single assistant-persona vector.

13. Why this points toward transformer ASICs

The reference paper’s graph-to-graph universality result has hardware implications.

The usual transformer hardware story is sequence-centric. Accelerators optimize dense matrix multiplication, softmax attention, and sequence batches. That is useful, but it assumes that the universal substrate is a sequence.

ToricGT and TropicalGT suggest a different substrate: graph-token computation with tropical and toric sidecars.

If many modalities can be represented as graphs, then a universal reasoning architecture should operate on graph-to-graph maps:

  • text as token graphs;
  • molecules as atom-bond graphs;
  • circuits as component-wire graphs;
  • proofs as dependency graphs;
  • programs as AST or control-flow graphs;
  • retrieval contexts as citation graphs;
  • multimodal scenes as object-relation graphs;
  • tool-use workflows as state-transition graphs;
  • memory as typed relational records.

A universal graph-to-graph approximator is therefore a stronger target than a universal seq2seq approximator. It can still process sequences, because a sequence is a path graph. But it can also process relational objects without flattening them into brittle strings.

This motivates future transformer ASIC work that is not just “more sequence attention.” The next hardware step should support:

  • vertex and edge token streams;
  • incidence-aware attention patterns;
  • sparse graph-token routing;
  • tropical max-plus and min-plus reductions;
  • exact argmax provenance;
  • blockwise ring attention;
  • GFlowNet trajectory sampling;
  • low-rank concept-direction steering;
  • sidecar telemetry export;
  • efficient graph-BPB evaluation.

The hardware thesis is that reasoning accelerators should not be designed only around universal sequence modeling. They should be designed around universal graph-to-graph modeling, with tropical reductions as first-class operations.

14. Why graph-to-graph universality is stronger than seq2seq universality

Sequence-to-sequence universality says a transformer can approximate continuous functions on compact sets of sequences, under the relevant positional or equivariant assumptions. That is powerful, but it does not automatically preserve graph symmetries or expose graph relations.

Graph-to-graph universality says the model can approximate maps whose inputs and outputs are relational structures. The output may itself be a graph: edited molecules, proof graphs, tool plans, reasoning DAGs, memory updates, or circuit transformations.

TokenGT supplies output slots for graph outputs. This matters because a graph-level vector is not enough to represent a graph-to-graph map. The model needs places to write vertices, edges, labels, and relations.

A sequence model can imitate graph reasoning after serialization, but it must learn or infer the relational structure through the serialization. TokenGT moves the relational structure into the input basis. That reduces the burden on learning and improves out-of-distribution prospects.

The OOD argument is not magic. It is structural:

  • incidence is hard-coded by identifiers;
  • equivariance is preserved under relabeling;
  • edge tokens expose graph relations;
  • tropical heads implement recurrence-like updates;
  • GraphCG directions organize semantic edits;
  • GFlowNets train trajectory distributions;
  • toric sidecars audit finite polyhedral regimes.

This combination should generalize better when the test graph is relabeled, when edge structure matters, when a recurrence extends to larger graphs, or when a concept direction transfers across graph instances.

15. Current status of the project

The current program has several pieces already specified or implemented in papers and planning notes:

  • TokenGT-style graph tokenization with vertex and edge tokens.
  • Orthonormal identifier theory for equality and incidence tests.
  • Universal approximation of continuous equivariant graph-to-graph maps.
  • Exact finite graph-to-graph interpolation with output slots.
  • Threshold-circuit simulation over graph predicates.
  • Tropical region bounds controlled by graph-token count.
  • Tropical attention heads for max-plus and min-plus dynamic programs.
  • Ring tropical attention for exact long-context max evaluation.
  • GraphCG-style concept directions and fan-slice interpretation.
  • GFlowNet graph-of-thought training objectives.
  • Finite tropical-to-toric sidecars.
  • Explicit max-plus/min-initial sign convention.
  • Toric ideal, Gröbner, Hilbert, Betti, and free-resolution audit language.
  • Klyachko/vector-bundle and Cech/sheaf training signals.
  • Multiparameter persistence audits with GUDHI and Macaulay2.
  • BPB and graph-BPB promotion gates.

The project is not finished. The most important implementation needs are:

  • automatic sidecar export from real checkpoints;
  • stable JSON schemas for CAS certificates;
  • richer Sage and Macaulay2 certificate generation;
  • exact teacher supports for algorithmic graph tasks;
  • better graph-of-thought datasets;
  • matched ablations for every auxiliary loss;
  • hardware-aware tropical reduction kernels;
  • artifact-byte accounting for any exported telemetry;
  • clearer dashboards linking geometry metrics to BPB and OOD performance.

16. Failure modes

The project has real risks.

A geometric auxiliary can look elegant and fail to improve the model. A GraphCG direction can be non-identifiable. A toric relation set can be too shallow. A persistence summary can be stable but irrelevant. A fan-cell metric can correlate with dataset artifacts. A CAS certificate can certify the wrong sidecar if projection metadata are wrong. A graph-token model can still overfit to finite graph sizes. A hardware design can accelerate the wrong bottleneck.

The mitigation is staged promotion:

  1. exact small-window certification;
  2. telemetry-only logging;
  3. low-weight auxiliary training;
  4. matched ablation;
  5. BPB and graph-BPB validation;
  6. OOD graph-slice validation;
  7. artifact-byte accounting;
  8. export only if useful.

Geometry is not allowed to win by vibes. It earns its place by improving held-out behavior, exact auditability, or interpretable control without damaging the primary score.

17. A minimal end-to-end example

Suppose a TropicalGT head chooses a proof predecessor. Its candidate scores are

r(u)=ar,u+br. \ell_r(u)=\langle a_r,u\rangle+b_r.

The active predecessor is

r=argmaxrr(u). r^\star=\arg\max_r \ell_r(u).

The margin is

Δ(u)=minsr[r(u)s(u)]. \Delta(u)= \min_{s\ne r^\star} \left[ \ell_{r^\star}(u)-\ell_s(u) \right].

After rationalization, define

f=rτbrχar. f=\sum_r \tau^{-b_r}\chi^{a_r}.

Then inu(f)\operatorname{in}_{-u}(f) retains χar\chi^{a_{r^\star}} if the maximizer is unique. If two candidates tie, the initial form retains both monomials and uu lies on the tropical hypersurface.

A GraphCG direction did_i may steer the reasoning state along a concept direction. The edit path

z(α)=z0+αdi z(\alpha)=z_0+\alpha d_i

moves through the decoder’s tropical fan. Wall crossings correspond to changes in the affine decoder law. A GFlowNet policy assigns probability mass to graph-of-thought trajectories using task reward, verifier reward, and geometry-aware penalties.

The full loop is:

  • graph tokens expose incidence;
  • tropical attention selects predecessors;
  • GraphCG directions steer concept flows;
  • GFlowNets train trajectory distributions;
  • toric sidecars audit finite algebraic structure;
  • BPB and graph-BPB decide what survives.

18. Conclusion

ToricGT and TropicalGT are not attempts to replace neural networks with symbolic algebra. They are attempts to make selected pieces of neural computation finite, inspectable, and mathematically auditable.

The TokenGT result is the foundation: vertex and edge tokenization with orthonormal identifiers makes ordinary attention aware of graph incidence, equality patterns, and relational structure. This yields graph-to-graph universality, finite interpolation, threshold-circuit simulation, and tropical region growth governed by graph-token count.

TropicalGT adds the algorithmic layer: max-plus and min-plus heads, ring attention, active supports, margins, graph-of-thought trajectories, and BPB-safe validation.

ToricGT adds the audit layer: Laurent-polynomial sidecars, initial degenerations, toric ideals, fans, toric audit envelopes, vector-bundle filtrations, sheaf gluing, BGG-style skeletons, Koszul complexes, and multiparameter persistence.

GraphCG plus tropical ring attention plus GFlowNet reasoning generalizes the idea behind the Assistant Axis from a single persona direction to a multi-axis, graph-conditioned, trajectory-aware control geometry. The result is not just a safer or more steerable language model; it is a path toward universal graph-to-graph reasoning systems.

That is why the ASIC implication matters. If the universal substrate is graph-to-graph reasoning rather than sequence-to-sequence prediction, then future transformer hardware should accelerate graph-token routing, incidence-aware attention, tropical reductions, provenance extraction, and trajectory search.

The project’s honest claim remains local and finite: a sampled head window can define an exact sidecar. That sidecar can be audited. Its compact outputs can supervise training. Its losses remain auxiliary. Its deployment value must be earned.

References

  • Kim et al., “TokenGT: Graph Transformers with Joint Tokenization and Structural Encoding.” arXiv:2207.02505
  • Su and Liu, “Expressivity of Transformers: A Tropical Geometry Perspective.”
  • Yun et al., “Are Transformers Universal Approximators of Sequence-to-Sequence Functions?” arXiv:1912.10077
  • Merrill, Sabharwal, and Smith, “Saturated Transformers are Constant-Depth Threshold Circuits.” arXiv:2110.16249
  • Hahn, “Theoretical Limitations of Self-Attention in Neural Sequence Models.” arXiv:1906.06755
  • Liu et al., “GraphCG: Unsupervised Discovery of Steerable Factors in Graph Generation.”
  • Anthropic, “The assistant axis: situating and stabilizing the character of large language models.” Anthropic research post
  • Lu, Gallagher, Michala, Fish, and Lindsey, “The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models.” arXiv:2601.10387
  • Hashemi, Pasque, Teska, and Yoshida, “Tropical Attention.”
  • Cox, Little, and Schenck, Toric Varieties. American Mathematical Society, 2011.
  • Miller and Sturmfels, Combinatorial Commutative Algebra. Springer, 2005.
  • Maclagan and Sturmfels, Introduction to Tropical Geometry. American Mathematical Society, 2015.
  • Tevelev, “Compactifications of subvarieties of tori.” arXiv:math/0412329
  • Maclagan and Rincón, “Tropical ideals.” arXiv:1609.03838
  • Fulton and Sturmfels, “Intersection theory on toric varieties.” arXiv:alg-geom/9403002
  • Katz, “Tropical intersection theory from toric varieties.” arXiv:0907.2488
  • Améndola et al., “Computing tropical varieties in Macaulay2.” arXiv:1710.10651
  • Kaveh and Manon, “Toric vector bundles, valuations and tropical geometry.” arXiv:2304.11211
  • GUDHI Project documentation. https://gudhi.inria.fr/python/latest/
  • Macaulay2 documentation. https://macaulay2.com/

Community

Sign up or log in to comment