An autoregressive transformer language model, asked for the next token, sometimes performs a lookup and sometimes computes something genuinely new. Projective Incidence Calculus (PIC) is a theory that makes that distinction precise and, more ambitiously, identifies the model’s irreducible “computing” part with a forty-year-old idea from symbolic AI: Alan Bundy’s incidence calculus. This guide builds the theory from the ground up for a reader who knows linear algebra and a little probability but is not assumed to know either transformers or incidence calculus. We first develop just enough transformer mathematics (the residual stream, logits, the softmax, direct logit attribution) and then work through classical incidence calculus with full numerical examples. We then explain, conceptually, visually, and mathematically, what PIC is: a single object viewed at two “temperatures” as a probabilistic logic, a piece of tropical geometry, and an executable Datalog program. We walk through the key proofs that hold the theory together, ground every abstraction in measurements from a real model (Qwen2.5 and the Pythia ladder, via the fieldrun toolchain), and close with the open questions the theory leaves on the table.
This section collects the notation used throughout, so that a reader who has not seen some of it can refer back. Nothing here is needed before it appears in context; skim it now and return as required. Each symbol is also re-explained at first use, and the most important ones are restated in the Glossary (Appendix B). Standing conventions. Lowercase italic letters such as r, dj , Uv denote vectors in a real space Rd (we do not bold them); the same letters subscripted are still vectors unless they appear as a single coordinate. The index j ranges over sources (the model’s components: the embedding, each attention head, each MLP), and v, w range over tokens in the vocabulary V . Capital Lv , cvj , Gvw are real scalars. We write := for “is defined as.” The one symbol to fix first: ⟨·, ·⟩. The angle brackets ⟨u, v⟩ denote the inner product (the generalised dot product) of two vectors: a single number measuring how much they point the same way. For ordinary real vectors it is
so it is large and positive when u, v are aligned, zero when orthogonal, and negative when A dot used as an argument marks an empty slot: ⟨·, Uv ⟩ is not a number but a function, the rule “feed me any vector x and I will return the number ⟨x, Uv ⟩.” Writing the dot is just a way to name that function without inventing a letter for its input; the same object could be written x 7→ ⟨x, Uv ⟩. (Such a “fix one slot, leave the other open” inner product is called a linear functional: it eats a vector and outputs a scalar, linearly.) So Lv = ⟨r, Uv ⟩ + bv is what you get by feeding the specific residual vector r into the slot of ⟨·, Uv ⟩ and adding the bias. (Watch one reuse: in Theorem 6 the notation P ⟨E⟩ means “the probability of event E,” where the brackets merely delimit a described event rather than pair two vectors.)
| Symbol | Meaning (section of first use) |
|---|---|
| Spaces and vectors | |
| ℝd | the d-dimensional real representation space (§3) |
| ⟨u, v⟩ | inner product / generalised dot product of u and v (§3) |
| ‖u‖ | Euclidean length (norm) of u, ‖u‖ = √⟨u, u⟩ (§3) |
| r | the residual-stream vector at the final position (§3) |
| dj | the write of source j into the residual stream, r = Σj dj (§3) |
| Uv | the unembedding direction (output direction) of token v (§3) |
| bv | the (optional) output bias for token v (§3) |
| The readout | |
| Lv | the logit (score) of token v, Lv = ⟨r, Uv⟩ + bv = Σj cjv (§3) |
| cjv | source j's vote for token v, cjv = ⟨dj, Uv⟩ (DLA) (§3) |
| softmax(L)v | exp(Lv) / Σw exp(Lw), the output probability of v (§3) |
| arg maxv | the token attaining the largest value (greedy prediction) (§3) |
| Gvw | Gram matrix of token directions, Gvw = ⟨Uv, Uw⟩ (§3) |
| Diagnostics | |
| PR | participation ratio, the effective number of contributing sources (§3) |
| µt | readout multiplicity: how many sources' own arg max is t (§3) |
| Δ | winning margin Lt − Lv* (v* is the runner-up) (§3) |
| Dj | differential incidence cjt − cjv* = ⟨dj, Ut − Uv*⟩ (§6) |
| Logic and incidence | |
| I | a finite set of incidences (possible worlds / samples) (§4) |
| i(A) | the incidence set of proposition A, i(A) ⊆ I (§4) |
| P(A) | probability of A, recovered as |i(A)| / |I| (§4) |
| |·| | cardinality of a set (or absolute value of a number) (§4) |
| Semiring / temperature | |
| ⊕, ⊗ | the "add" and "multiply" of a semiring (§5) |
| T | temperature: T = 1 gives softmax, T → 0 gives greedy arg max (§5) |
| ⊕v | semiring sum over tokens (e.g. maxv in the tropical case) (§5) |
Ask a large language model for the capital of France and it answers “Paris.” It is tempting to say the model has memorised a fact and simply looked it up. But producing “Paris” as the next token under a real prompt is not a clean lookup: internally the decision is a mixture of many partial contributions, and tightening the context can push the answer out of anything resembling a stored table and into a regime where the model is genuinely assembling the answer on the fly. This raises a sharp, empirical question. For any given next-token decision, is the model
(a) retrieving, reproducing the output of a single symbolic rule (an n-gram, an induction copy, a grammar constraint);
(b) selecting, choosing among a small set that a rule has already proposed; or
(c) composing, producing something no single rule proposed, an irreducible computation?
(ii) a linear readout into an inner-product geometry, Lv = ⟨r, Uv ⟩ + bv (this is what supplies the Gram matrix G); and (iii) a softmax over a finite outcome set V . None of (i)–(iii) is autoregressive, and none is even specific to transformers, any model meeting them inherits the static theory. What is specifically autoregressive is the empirical content of Sections 7: the retrieve/select/compose split, the route fractions, the participation ratio PR ≈ 45, the scale-growing composed core, and the ablation results. These are measured on next-token prediction with causal attention, and the very notions of “a decision” and “its sources dj ” are fixed by the autoregressive forward pass. Accordingly we say transformer only for architectural facts (the residual stream, attention and MLP writes) and autoregressive wherever an empirical or decoding claim is made; the generalisation to non-autoregressive models (e.g. diffusion) is an explicit open direction, treated in Section 8.
Fix a model with hidden width d and a vocabulary V of tokens. As a transformer processes a sequence, each token position carries a vector in Rd that is repeatedly updated by attention heads and feed-forward (MLP) blocks. The decisive architectural fact, the one that makes the whole theory possible, is that these updates are additive. Each component writes its output into a shared running sum called the residual stream. If r ∈ Rd is the residual vector at the final position after the last layer, then
an inner product (plus an optional bias bv ). The predicted token is the one with the largest logit, t = arg maxv Lv , and the full probability distribution is the softmax
j dj and the logit (2) is linear in r, the logit itself splits into one contribution per
The scalar cjv is how much component j “votes” for token v. This exact, bias-free decomposition is direct logit attribution (DLA), and it is the bridge between the continuous machinery of the network and the discrete, logical reading PIC will give it: each component j is a source casting a real-valued vote cjv for each proposition (token) v, and the model’s decision is driven by the column sums Lv = j cvj .
P t 2 , an effective count of how many sources
meaningfully contribute to the winning logit (a vote spread evenly over k sources gives PR ≈ k; a vote dominated by one source gives PR ≈ 1);a • the normalised margin ∆/∥Ut − Uv∗ ∥ with ∆ = Lt − Lv∗ , which Section 6 shows is the exact Euclidean distance from r to the decision boundary between t and v ∗ .
The empirical surprise is that PR is route-invariant: it is nearly the same for retrieved, selected and composed tokens alike, so no single circuit ever “decides” a token by dominating the sum. Whatever separates the three routes, it is not vote magnitude. The value of PR is, however, an architecture-shape parameter rather than a universal constant: it is ≈ 42–49 on the Qwen models that anchor most of our examples (we will write “PR ≈ 45” for those), but it ranges more widely across the Pythia ladder (13 → 27 → 63 → 35 from 70M to 1B), tracking how many circuits a given architecture has to spread the vote over. Treat 45 as a fact about these models, not a law.
The participation ratio is a borrowed metric: the same ( x)2 / x2 is the inverse participation ratio used in physics to measure how localised a state is, and the effective dimensionality used in neuroscience to count active modes (e.g. Litwin-Kumar et al. 2017; see Further reading in §10).
One more idea closes the primer. Suppose two tokens t and v are near-synonyms, their unembedding directions point almost the same way, Ut ≈ Uv . Then any evidence for t is almost automatically evidence for v: the two logits move together. Conversely if Ut and Uv are orthogonal, evidence for one says nothing about the other. The matrix of all these pairwise is the Gram matrix of the unembedding frame. It is the geometric carrier of the fact that, for a transformer, the “probabilities” of competing tokens are not independent; they are coupled through G. This non-independence is exactly the non-truth-functionality that incidence calculus was invented to handle, and G is the object that will let us handle it. We turn to incidence calculus now and return to G in Section 5.
Paris Lyon bicycle
Read it off: the diagonal is 1 because each unit direction is perfectly aligned with itself; the large off-diagonal GParis,Lyon = 0.95 couples the two French cities, so pushing the model toward one nudges it toward the other; and GParis,bicycle = 0.00 leaves those two effectively independent. That single off-diagonal 0.95 is exactly why the probabilities of “Paris” and “Lyon” cannot be set independently, the whole non-truth-functional story in one number.
Incidence calculus was introduced by Alan Bundy in 1985 to do probabilistic reasoning correctly in a setting where naive probabilistic logic gives wrong answers. The motivating defect is worth stating carefully because the entire PIC programme is built on repairing it.
A logic is truth-functional if the truth value of a compound proposition is a function of the truth values of its parts: A ∧ B is true exactly when A is true and B is true, full stop. Probability has no such property. Knowing P (A) and P (B) does not determine P (A ∧ B):
for any fixed f.
If A = B then P (A ∧ B) = P (A); if A and B are mutually exclusive then P (A ∧ B) = 0; if independent then P (A)P (B). Same marginals, three different answers. The missing information is how the events overlap, and no amount of bookkeeping with the marginal numbers alone can
Bundy’s move is to stop treating probability as the primitive object and treat overlap as primitive. Fix a finite set I of incidences: think of them as possible worlds, or sampled situations. To each proposition A attach the set of incidences in which A holds,
The logical connectives become honest set operations on these incidence sets:
Probability is then recovered as the relative size of an incidence set,
A truth-functional guess would have had no way to produce 1/3: the independent guess gives 23 · 23 = 49 , the mutually-exclusive guess gives 0. Both are wrong, and they are wrong because they discard the overlap i(A) ∩ i(B). Now move the same probabilities to a different overlap, i(B ′ ) = {1, 2, 3, 4} = i(A): same P (A), P (B ′ ), but now P (A ∧ B ′ ) = 46 = 23 . The marginals are identical in both scenarios; only the incidences distinguish them.
Notice what |i(A) ∩ i(B)| is: if we represent each proposition by its incidence indicator vector an inner product. Classical incidence calculus already carries a non-diagonal “Gram matrix” Gvw = |i(v) ∩ i(w)| counting overlaps. The single conceptual step from Bundy’s calculus to PIC is to let these indicator vectors become arbitrary directions in an inner-product space rather than {0, 1}-vectors over a fixed world set, and to notice that a transformer has handed us exactly such directions and exactly such a Gram matrix already, in the unembedding frame of Section 3. That is the projective generalisation, and it is the subject of the next section.
We now have both halves of the bridge. From the transformer side (Section 3) we have sources dj , propositions Uv , votes cvj = ⟨dj , Uv ⟩, an additive logit Lv = j cvj , and a Gram matrix Gvw = ⟨Uv , Uw ⟩. From the logic side (Section 4) we have incidences, propositions identified with incidence vectors, set-operation connectives, probability as overlap, and an overlap Gram matrix. PIC is the identification of these two pictures.
Definition 1 (The PIC dictionary). Projective Incidence Calculus reads the composition core of a transformer as an incidence calculus in which incidence sets are replaced by directions in an inner-product space:
| Bundy's incidence calculus | Projective Incidence Calculus | in the model |
|---|---|---|
| incidence set i(A) ⊆ I | direction Uv in ℝd | unembedding row of token v |
| |i(A)| / |I| | Lv = ⟨r, Uv⟩ + bv | token logit |
| overlap |i(A) ∩ i(B)| | Gvw = ⟨Uv, Uw⟩ | Gram of the unembedding frame |
| a source / sample | dj, with vote cjv = ⟨dj, Uv⟩ | a head / MLP / embedding write |
| ∧, ∨ (set ops) | weighted threshold Σj wj xj > θ | the composed connective |
| P(A) = |i(A)| / |I| | softmax(L)v | output probability |
The word “projective” marks the key liberalisation: in Bundy’s calculus the incidence of a proposition is a set and overlap is an intersection; in PIC propositions are directions and overlap is the kernel value Gvw . Because propositions now share a continuous inner-product space, their incidences overlap intrinsically, near-synonyms have Gvw close to its maximum, mutually exclusive outcomes have Gvw = 0, and the non-truth-functional structure is carried by G rather than by an explicit list of shared worlds. When G is diagonal (all outcomes mutually exclusive) PIC collapses back to Bundy’s classical disjoint-outcome case exactly. PIC is therefore a strict generalisation: it adds the off-diagonal of G, the part that couples competing tokens.
Before the three interpretations, here is the empirical picture the theory must explain: every next-token decision falls into one of three routes, and the routes are stable across model scale.
Figure 1: The measured three-way route split. Retrieval (Retrieved + Selected) dominates every model, but the composed band, the irreducible computation, grows monotonically with scale against a fixed retrieval store (7.8%→14.8% up the Pythia ladder), even as the underlying geometry stays scale-stable. PIC is the theory of that red band. Two facts about Figure 1 drive the whole theory. First, the split is not a magnitude distinction: the participation ratio PR ≈ 45 is the same for all three routes, so no circuit “wins” by shouting loudest. Second, the routes separate cleanly on two other, measurable axes, a decision margin (geometry) and a readout multiplicity (logic). Those two axes are what the three interpretations
The central structural claim of PIC is that the composition core is a single mathematical object that can be read three ways, and that which reading you get is governed by a single knob, a
sums and products
the softmax distribution
the greedy arg max
Reading the same program in the log-semiring gives the probabilities the model samples from; reading it in the tropical semiring gives the single token greedy decoding picks. Nothing about the program changes, only the arithmetic it is interpreted under. This is the machinery behind all three interpretations below and behind Theorem 5.
At T = 1 this is ordinary log-sum-exp (the soft, smooth “add” of the log-semiring, the one that yields the softmax). As T → 0 it becomes
the hard “add” of the tropical semiring (the one that yields the greedy arg max). Maslov dequantization is exactly this limit: the smooth log-semiring deforms continuously into the tropical (max, +) semiring as the temperature drops to zero. The name is by analogy with physics: T plays the role of a Planck-like parameter, and T → 0 is the “classical limit” in which the soft, quantum-like superposition of many tokens collapses onto the single sharpest one, just as a cooling physical system settles into its lowest-energy ground state. Theorem 5 makes the limit rigorous via a squeeze (the Maslov sandwich); the takeaway here is simply that softmax and arg max are not two algorithms but two ends of one temperature-controlled deformation.
One kernel G a semiring-weighted program Π
T = 1 (log-semiring) probabilistic incidence calculus; softmax recovered as incidence frequency
T = 0 (tropical semiring) arg max decision surface = Laguerre power diagram of {Uv } (greedy decode)
III. Computation any T (one program) semiring-weighted Datalog Π; provenance evaluation = inference
Figure 2: The unification. The composition core is one object, a semiring-weighted program Π carrying the Gram kernel G, read at three settings of a single temperature knob. At T = 1 it is a probabilistic logic whose output probabilities are softmax; at T = 0 it is a tropical geometry whose decision surface is a power diagram; as an executable artifact it is a Datalog program. Maslov dequantization is the formal T → 0 limit linking the two temperatures. Interpretation I: Logic (T = 1). Sources S (the circuits) are vectors dj ; propositions V (the tokens) are directions Uv . The vote cvj = ⟨dj , Uv ⟩ is the direct logit attribution, and aggregated evidence is the residual r = j dj . Where Bundy’s incidence of a proposition is a set i(v), PIC’s is the linear functional ⟨·, Uv ⟩, and the overlap of two propositions is the kernel value Gvw , not a set intersection. Combination is signed linear accumulation in the exponent, Lv = j cvj , which is equivalently a product of experts, each source reweighting a proposition’s mass by exp(cvj ). The decision is a weighted threshold (Boolean ∩/∪ become j wj xj > θ), and, as Theorem 1 will show, the output probabilities come out exactly as incidence frequencies, i.e. the softmax, parameter-free.
“Product of experts” vs. the “Mixture of Experts” you may have heard of The phrase product of experts sounds like the Mixture of Experts (MoE) in modern large models, and the shared word “experts” hides a real difference. Both combine several opinions into one distribution, but they combine them in opposite ways. Product of experts (PoE) multiplies the experts’ distributions and renormalises, p(v) ∝ k log pk
(v) − log Z. An outcome k pk (v). Taking logs, the log-scores add: log p(v) = survives only if every expert grants it some mass, so PoE behaves like a logical AND: it intersects constraints and sharpens the distribution. This is precisely what the residual stream does, Lv = j cvj is a sum of log-votes, so by Theorem 1 the model’s readout is a product of experts (the “experts” being the circuits dj ). Mixture of Experts (MoE) instead averages the experts’ distributions with gating weights from a learned router, p(v) = k gk pk
(v) with k gk = 1. An outcome is likely if any weighted expert grants it mass, so a mixture behaves like a soft OR: it unions options and hedges. In today’s transformers “MoE” names an architectural trick: a router sends each token to its top-k feed-forward expert subnetworks (the experts the models in §7 page in from disk), saving computation.
multiply (×), log-scores add
AND / intersection / sharpen
no, all experts always contribute
yes, selects top-k experts
the circuits dj (heads, MLPs)
dedicated FFN subnetworks
The two are complementary, not rival, and an MoE model uses both at once: the router selects which expert subnetworks run (a mixture step), and then whatever components fire write additively into the residual stream and are read out as a product of experts. MoE picks the panel; PoE tallies its votes.
Interpretation II: Geometry (T = 0). Take the hard decision arg maxv (⟨r, Uv ⟩ + bv ). The bv ⊗ xUv = max ⟨r, Uv ⟩ + bv
is a tropical polynomial in r (in the (max, +) semiring, a ⊕ b = max(a, b) and a ⊗ b = a + b). Its linear regions, the cells of r-space on which a fixed token wins, tile space into a Laguerre power diagram of the sites {Uv }. To unpack that name: start from the everyday Voronoi diagram, which carves a map into the region closest to each site, like the coverage zones of cell towers or the catchment areas of shops. A power diagram (also called a Laguerre diagram) is the same idea but each site carries a weight, so the boundaries shift toward the lighter sites; ordinary Voronoi is the special case of equal weights. Here the sites are the token directions {Uv }, the weights encode the biases and lengths (bv , ∥Uv ∥2 ), and the cell a residual vector r lands in names the token the model predicts. The wall between two adjacent cells, where two tokens tie for the lead, is what tropical geometry calls the tropical hypersurface (just the set of points where the max in M (r) is achieved by two tokens at once). This is the picture of Figure 3, and its facet distances are exactly the normalised margins of Section 3: a residual sitting deep inside its cell is a confident, large-margin decision, and one hugging a wall is a near-tie.
Interpretation III: Computation (executable). The same core is a finite relational computation: a semiring-weighted Datalog program Π. Propositions are facts with directions Uv and Gram G; sources are clauses (an induction head, for instance, is the recursive clause next(T):-match_prefix(P),follows(P,T)); the retrievable fragment is compact stratified clauses and the composed fragment is a dense aggregate. Evaluating Π under a provenance semiring is inference, and the temperature is the semiring choice: the log-semiring gives the softmax distribution, the tropical semiring gives the greedy decode.
• Uv : a site, the unembedding (output) direction of token v Ut winner, Uv∗ runner-up; Ua , Ub , Uc other tokens • r: the residual vector (current state) shaded area: a cell, where one token wins the arg max grey edge: a facet, where two tokens tie
Figure 3: Interpretation II made concrete. The residual-space decision surface is the Laguerre power diagram of the unembedding directions {Uv } with weights (bv , ∥Uv ∥2 ): each cell is the set of residual vectors r for which a given token wins the arg max. A retrieved token sits deep inside its cell (large margin); a composed token sits close to a facet (small margin). The normalised margin ∆/∥Ut − Uv∗ ∥ is exactly the perpendicular Euclidean distance from r to the nearest facet The empirical payoff of Figure 3 is that this facet distance is measurable and monotone across the three routes: retrieved tokens sit far inside their cells, composed tokens sit near facets. Geometry, not magnitude, is one of the two axes that separate the routes. The other axis, readout multiplicity, is logical and is the subject of the next section’s key proofs.
The theory’s claims are not posited; they are recovered as consequences of the single additive structure Lv = j cvj together with the Gram matrix G. We walk through the load-bearing results. Each is elementary: this is part of the point: the conceptual leap is in the identification, after which the mathematics is short. (Each result below corresponds to a theorem in the source paper and has been transcribed into a proof skeleton checkable by an interactive theorem prover; we give the human-readable argument.)
This is the result that most directly inherits Bundy’s “probability as proportion of worlds,” and it is the cleanest justification for calling the core an incidence calculus at all. Theorem 1 (Recovered probability). Maintain a measure over propositions with a uniform base M0 > 0, and let each source j reweight every proposition v’s mass multiplicatively by exp(cvj ).
Then the normalised mass of v is exactly the softmax of the logits:
The Gibbs/softmax measure is recovered as a PIC incidence frequency, parameter-free, and is equivalently the distribution of the i.i.d.-Gumbel arg max over {Lv } (the Gumbel-max trick). Proof. After all sources act, the mass of v is the base times the product of the per-source exp(cvj ) = M0 exp cvj = M0 exp(Lv ).
The total mass is w m(w) = M0 w exp(Lw ), pulling the common base M0 out of the sum (this uses only distributivity). Since M0 > 0 it is nonzero and cancels:
A natural worry: maybe a token wins because many sources vote for it. The next result says the raw count of voting sources is causally inert; only the totals matter. Theorem 2 (Cardinality-inertness). Under the projective pairing the decision depends only on the column totals Lv = j cvj (hence on the pairwise differences ∆ and the differential incidences Dj ), and is invariant to the readout multiplicity µt , the number of individual sources whose own arg max is t. Two source tensors with equal column sums induce the same arg max even when their per-source arg max counts differ. Proof. The decision is t = arg maxv Lv , a function of the totals {Lv } alone. If c and c′ satisfy j cj for every v, then Lv = Lv for all v and the two arg max sets coincide. The multiplicity µt = #{j : arg maxv cj = t} never enters L, so it cannot affect the decision.
arg max c1 = A, arg max c2 = A ⇒ µA = 2 (both sources already pick A)
arg max c1 = B, arg max c2 = C ⇒ µA = 0 (no source picks A)
Both source tables have the identical column sums L = (10, 6, 6), so they produce the identical softmax and the identical winner, A, by a margin of ∆ = 4. Yet the multiplicities could hardly be more different: in Scenario 1 both sources already vote A (µA = 2, the textbook “retrieved” look), while in Scenario 2 one source insists on B, the other on C, and neither proposes A (µA = 0, the “composed” look where A is the winner of a sum that is the winner of no summand). The count of supporters swung from two to zero and the decision did not move at all. That is cardinality-inertness: only the totals L vote, and µA is a label we read off afterwards, never a cause. This matters because µt is precisely what would distinguish “retrieved” (some single source already picks t, µt ≥ 1) from “composed” (µt = 0, no single source picks t). The theorem says µt is a readout property, a diagnostic label, not the causal lever; empirically the causal variable is the pivotality/margin, and µt is only a proxy for where the margin sits.
The differential incidence Dj = ctj − cvj = ⟨dj , Ut − Uv∗ ⟩ is the amount by which source j pushes the t-versus-v ∗ margin. The theorem says competition hardness is read directly off the frame Gram: when two outcomes are near-synonyms, evidence for one cancels in the difference (common mode), and the genuine competition lives in ρ. This is the structural fact that the whole non-truth-functionality story rests on.
so the amount it actually moves the t-vs-v ∗ margin is the differential incidence
Now watch what the runner-up’s coherence ρ does to that same physical source:
orthogonal: full strength half survives near-synonyms: almost all cancels
The source delivers “3 units of evidence for t” in every row, yet the margin it buys collapses from 3 to 0.15 as t and v ∗ line up. The reason is geometric and is exactly Theorem most any unit-strength source can contribute to the t-vs-v ∗ margin is ∥Ut −Uv∗ ∥ = 2(1 − ρ) (Cauchy–Schwarz), so the separation of the two outcomes is a hard budget on how much differential evidence is even available. When the outcomes are near-synonyms that budget is tiny, the shared part of the evidence is common mode (it lifts t and v ∗ together and cancels in the difference), and only the sliver orthogonal to the shared direction can separate them.
Theorem 4 (Weighted-threshold expressivity). There exist composed conclusions (µt = 0, no single source’s arg max is t) that are the arg max of the weighted sum j cvj yet are not expressible in the Horn / ∩–∪ fragment of classical incidence calculus. A two-source, three-outcome witness: Then L0 > L1 and L0 > L2 (the sum prefers outcome 0), while source 1 prefers outcome 1 (c11 > c01 ) and source 2 prefers outcome 2 (c22 > c02 ). Outcome 0 is the winner of the sum but the arg max of no single source. Proof. Direct arithmetic: L0 = 2 + 2 = 4, L1 = 3 + 0 = 3, L2 = 0 + 3 = 3, so L0 > L1 and L0 > L2 . And c11 = 3 > 2 = c01 , c22 = 3 > 2 = c02 . Hence µ0 = 0: no single source selects 0, yet the weighted threshold j cvj does. This is exactly the operational definition of emergence, “the arg max of a sum that is the arg max of no summand”, realised by a weighted-threshold connective but by no singleton sufficient sub-conjunction.
The two geometric facts of Figure 3 are propositions, not pictures. Proposition 1 (Cells are a Laguerre power diagram). The linear regions of the max-logit M (r) = maxv (⟨r, Uv ⟩ + bv ) are the Laguerre power diagram of the sites {Uv } with weights (bv , ∥Uv ∥2 ): the power-distance difference between two sites equals −2 times the score difference, so the cell of minimum power distance is exactly the arg max token. Proof sketch. Polarise the squared distances ∥r − Uv ∥2 and ∥r − Uw ∥2 as in Theorem 3. Subtracting, the ∥r∥2 terms cancel and one obtains, with weight ωv = ∥Uv ∥2 + 2bv ,
The left side is the difference of power distances; the right is −2 times the difference of scores. Minimum power distance therefore coincides with maximum score, which is the arg max token. Proposition 2 (Margin is facet distance). The normalised margin (Lt − Lv∗ )/∥Ut − Uv∗ ∥ is the exact signed Euclidean distance from r to the t–v ∗ facet of the tropical hypersurface. Proof. By linearity ⟨r, Ut −Uv∗ ⟩ = ⟨r, Ut ⟩−⟨r, Uv∗ ⟩, so the facet numerator ⟨r, Ut −Uv∗ ⟩+(bt −bv∗ ) equals Lt − Lv∗ = ∆. The set {r : ⟨r, Ut − Uv∗ ⟩ + (bt − bv∗ ) = 0} is the bisecting hyperplane (the facet) with normal Ut − Uv∗ ; the signed distance from a point to such a hyperplane is the linear form evaluated at the point divided by the norm of the normal, i.e. ∆/∥Ut − Uv∗ ∥.
The unification of Figure 2 is itself a theorem: the same program Π, evaluated under two different semirings, returns the softmax distribution and the greedy decode, and the two are joined by Maslov dequantization. Theorem 5 (Two-temperature soundness). Read Π as a semiring functional aggregate query in which sources are combined by ⊗ along the residual derivation (so j cvj = j cvj = Lv ) and competing propositions are combined by ⊕. Then on a finite nonempty V : • under the log-semiring (⊕ = log - exp, ⊗ = +), the aggregate is log v exp(Lv ) and the per-proposition share is exp(Lv )/Z = P (v), the softmax distribution (Interpretation I,
• under the tropical semiring (⊕ = max, ⊗ = +), the aggregate is maxv Lv with witness arg maxv Lv , the greedy decode (Interpretation II, T = 0); and the two are linked by the Maslov sandwich max Lv ≤ T log
exp(Lv /T ) ≤ max Lv + T log|V |,
Proof sketch. The log-semiring claim is the identity defining log-sum-exp and Theorem 1. For the tropical claim, on a finite nonempty set the maximum is attained, maxv Lv ∈ {Lv }, giving an explicit witness. For the sandwich: the lower bound holds because the max term is one of the summands and log is monotone, so v exp(Lv /T ) ≥ exp(maxv Lv /T ) and multiplying by T gives T log v exp(Lv /T ) ≥ maxv Lv . The upper bound holds because every summand is at most exp(maxv Lv /T ), so the sum is at most |V | exp(maxv Lv /T ), and taking T log gives maxv Lv + T log|V |. As T → 0 the slack T log|V | → 0, squeezing the middle to the max. This T → 0 limit is Maslov dequantization, the homomorphism from the log-semiring to the tropical This is the formal content of “one program, two temperatures.” The softmax you sample from at T = 1 and the greedy arg max you decode at T = 0 are not two algorithms; they are one semiring program evaluated under two semirings, and the temperature interpolates continuously between them. The logic picture (I) and the geometry picture (II) are the two ends of this single dial.
PIC is a measured theory: every quantity above is computed on real autoregressive models (primarily two Qwen2.5-0.5B variants, the Coder-Instruct and the base Instruct,1 and the Pythia ladder from 70M to 1B) with the fieldrun toolchain,2 which decompiles a model into a flat retrieval store plus a composition kernel and runs explain-only probes that classify each decision and report PR, the margin, µt , and the nearest power-diagram facet. We ground the three routes in concrete decisions.
Consider a position where the prompt ends “. . . the capital of France is”. A single induction/ngram circuit already places “ Paris” at the top of its own vote: µt ≥ 1. In power-diagram terms (Figure 3) the residual r sits deep inside the “ Paris” cell, a large normalised margin, so a small perturbation will not flip the decision. The measured signature: large facet distance, µt ≥ 1, and (the diagnostic that makes this “retrieved” rather than merely “selected”) the candidate set is exactly one store idiom’s top-1. The participation ratio is still ≈ 45; even this canonical lookup is assembled from a ∼45-way additive sum, not decided by one dominant circuit.
Now a position whose continuation no single circuit proposes, a token that is the arg max of the summed vote but of no summand, µt = 0. The witness of Theorem 4 is the schematic: source 1 votes (2, 3, 0), source 2 votes (2, 0, 3), and the sum (4, 3, 3) elects outcome 0, which neither source chose. On the real model the measured signature is the mirror image of the retrieved case: the residual sits near a facet (small margin), µt = 0, and, a sharper geometric test, the nearest facet is not the bisector with the store’s own predicted token for ∼85% of composed tokens, so composition is a genuine non-local divergence from the rule, not a near-miss. Roughly 80% of composed tokens are strictly emergent (µt = 0).
Qwen2.5: see the Qwen Team’s technical report in the references (§10, additional attributions); the Pythia suite is ref. [3]. The two Qwen variants share a vocabulary, hence a shared unembedding frame. fieldrun: https://github.com/jascal/fieldrun.
Proof. With em = E/PR and E = ̸ 0, em /E = 1/PR immediately. For A ⊆ {1, . . . , PR}, m∈A em /E = |A| · (E/PR)/E = |A|/PR. For fixed k = |A|, k/PR → 0 as PR → ∞.
The deployment consequence is concrete: because the retrievable tier is low-participationratio and the computed core is the high-PR, diffuse layer, you can quantise the retrievable tier hard but should protect the computed core, which is exactly where quantisation error concentrates in the measurements (composed int4 flip rate 5.9–18.8% versus retrieved
PIC is anchored where it can be measured and deliberately conjectural where the measurements stop. Incompleteness is treated as first-class: several of the results above are proved only on the measured set, and the corresponding general statements are explicit “frontier holes.” The sharpest open questions: Soundness and completeness of weighted incidence resolution. Is the coalition bound j∈S ′ Dj a sound inference rule for weighted incidence resolution, and is the resolution complete? The measured coalition additivity predicts joint-ablation flips at ∼75–83%, but a general soundness/completeness theorem is open. The general expressivity separation. Theorem 4 exhibits composed tokens inexpressible in the Horn / ∩–∪ fragment on the measured µt = 0 set. Proving the separation in general rather than only on measured points is open; it is the precise sense in which “composition is genuinely more expressive than retrieval” would become a theorem. The support number σ(t). Define the support number σ(t) of a token to be the size of its smallest sufficient circuit set: the fewest sources whose partial sum already makes t the winner. It is the quantitative handle on the reducible/irreducible distinction of §6: an irreducible composed token is one whose support requires all of its sources (σ(t) large and no proper sub-coalition suffices), whereas a merely µt = 0 token may have small σ(t). Two questions are open. First, does σ(t) scale with the participation ratio PR (i.e. is the size of the deciding coalition tied to the effective number of contributing sources)? The relationship is cleanest at the individual-circuit level; measured at coarser block granularity it mildly reverses, so the circuit level is where the question is well posed. Second, and relatedly, is there a general criterion that separates the genuinely irreducible composed tokens (no proper sub-coalition decides t) from the merely µt = 0 ones? The two classes already come apart on small machine-checked witnesses; a general separation would turn the upper-bound nature of the “∼80% strictly emergent” figure into a sharp count. The provenance gap. The aggregate value of Π is correct under any G (the static decomposition is exact). What a dense, non-diagonal G breaks is not the value but the provenance: the per-source explanation and the ablation counterfactuals. Scalar semiring Datalog is provenance-exact only on a diagonal G (the independent-outcome case); for dense G, scalar provenance cannot carry the cross-outcome correlation the frame Gram encodes. Closing this “provenance gap” would simultaneously settle the non-truth-functionality theorem; the one-to-one correspondence is the strongest evidence the three interpretations are one theory. The tropical-rank floor. The irreducible computation is conjectured to be lower-bounded by the gap between the tropical (Barvinok) rank of the core’s decision map and that of any flat lookup table, a gap that a linear (SVD) rank structurally cannot measure. This would explain why a low-rank linear
approximation misranks the composed core: its hardness is a count of tropical monomials (decision cells), which a Frobenius rank does not see. Treewidth as a third measure. The same wall may be seen as the treewidth of the core’s factor graph: sum-product is exponential in treewidth, so a high-treewidth dense-G region is at once intractable and non-compact. Whether participation ratio, tropical rank, and treewidth coincide (one wall, three measures) or diverge on the dense fragment is open. Beyond autoregressive transformers. The static algebra of Section 3 needs only additive composition, a linear readout, and a softmax over a finite vocabulary, so it is not, in itself, autoregressive or even transformer-specific. The empirical taxonomy, by contrast, is defined by the autoregressive forward pass. The cleanest generalisation target is therefore a non-autoregressive model, and diffusion language models split the question in two. Discrete / masked diffusion (with a transformer backbone and a per-position categorical softmax at each denoising step) inherits the static decomposition, DLA, G, recovered probability, the power diagram, essentially unchanged per position per step; what is genuinely undone is the dynamics, since a “decision” becomes a whole denoising trajectory rather than one step, bidirectional attention changes what counts as a source dj , and the margin/multiplicity statistics must be redefined over the trajectory. Continuous diffusion (score/noise prediction in an embedding space with a separate rounding step) has no softmax over a vocabulary and no discrete arg max, so the recovered-probability anchor (Thm 1) and the power-diagram geometry (Prop 1) have no direct analog and would require a measure-theoretic reformulation rather than a transcription. Establishing which PIC results survive each setting, and what the analog of the retrieve/compose split is when generation is iterative refinement rather than left-to-right, is open. The evidence is from sub-billion-parameter models. Whether the same three-way structure and the PR-mechanism survive at frontier scale, and whether the split behaves the same at long context, is untested. Why it matters. If the provenance gap and the expressivity separation close, the consequence is a clean, falsifiable picture of a transformer: the retrievable fragment is an additive, low-rank, low-treewidth part that exports to a compact, verifiable program, while the computed fragment is a diffuse, high-rank, dense-G remainder that admits no compact symbolic form. That would turn “which parts of a model can we extract and audit, and which are irreducibly computed” from a slogan into a measurable, bounded region, with direct consequences for interpretability (audit the retrievable tier), safety (the computed core is where capability genuinely grows with scale), and efficiency (quantise the retrievable tier, protect the core).
An autoregressive transformer, asked for a token, spends most of its labour retrieving and selecting, and a small, scale-growing remainder computing something new. PIC makes that distinction measurable on real models, roughly 25% retrieved, 60% selected, 15% composed, gives the composed core an exact geometric and causal characterisation, and shows that the same structure is simultaneously a probabilistic logic, a tropical geometry, and an executable Datalog program, according to a single degree of freedom, the temperature. The conceptual payoff is the lineage. Incidence calculus was built in 1985 on the insight that probability is not truth-functional but incidence is, and that one recovers probability by measuring incidences. Forty years on, a transformer’s composition core turns out to be exactly
The references are organised in four groups. We begin with the source paper this guide accompanies, then reproduce the works it cites, then add curated further reading (broad introductions and the seminal works behind the ideas this guide leans on: transformers, mechanistic interpretability, mixtures and products of experts, the softmax/Gibbs link, tropical geometry, and graphical-model intractability), and finally a short list of attributions the source paper leaves implicit (the measured model, the participation-ratio metric, the XOR/perceptron lineage, and the proof assistant).
Each theorem in Section 6 corresponds to a result in the source paper that has been transcribed into a machine-checkable proof skeleton (in the i-orca format,3 lowering to Isabelle/Isar). Two honest caveats apply, and are worth stating because they model how the theory treats its own limits. First, a passing static check verifies the proof skeleton, the method coverage, and is not, by itself, a kernel-checked proof; the real status requires running the prover. Second, the steps the paper itself leaves open (general Horn separation, the cited Maslov limit handled by automation, the asymptotic localisation bound) are marked as explicit frontier holes rather than hidden. The
i-orca: https://github.com/jascal/i-orca.
map between the human-readable results here and the formal artifacts is: cardinality-inertness (Thm 2), non-truth-functionality budget (Thm 3), weighted-threshold expressivity (Thm 4), recovered probability (Thm 1), diffuseness (Thm 6), two-temperature soundness (Thm 5), the power diagram (Prop. 1) and the margin distance (Prop. 2).
Residual stream r the running additive sum of all component writes at a position, r = j dj . the score of token v, Lv = ⟨r, Uv ⟩ + bv = j cvj . DLA cvj direct logit attribution, the vote of source j for token v, cvj = ⟨dj , Uv ⟩. Gram matrix Gvw ⟨Uv , Uw ⟩, the carrier of non-truth-functional coupling between tokens. Participation ratio PR effective number of contributing sources; route-invariant, with a value that is architecturedependent (≈ 45 on the Qwen models; wider across the Pythia ladder). Readout multiplicity µt number of sources whose own arg max is t; µt = 0 marks a composed (emergent) token. Differential incidence Dj ctj − cvj = ⟨dj , Ut − Uv∗ ⟩, source j’s push on the winning margin. Normalised margin ∆/∥Ut − Uv∗ ∥, the Euclidean distance from r to the nearest decision facet. Power (Laguerre) diagram the tessellation of residual space into arg max cells, with sites {Uv } and weights (bv , ∥Uv ∥2 ). Maslov dequantization the T → 0 homomorphism from the log-semiring (softmax) to the tropical semiring (greedy arg max).
The measurements and the formal artifacts referenced throughout are available in two public the pure-Rust inference and explain toolchain that decompiles a model into a flat retrieval store plus a composition kernel and produces the per-decision diagnostics (PR, margin, µt , nearest power-diagram facet) used in Section 7. https://github.com/jascal/fieldrun the proof-skeleton format into which the theorems of Section 6 are transcribed for machine checking (lowering to Isabelle/Isar), as discussed in Appendix A. A representative fieldrun example lives at examples/fieldrun/fieldrun.i.orca.md. https://github.com/jascal/i-orca
The reader above is the supplement's text, and the three figures are recreated inline. Tables and equations remain vector graphics pdftotext can't extract, so view or download the full supplement below. The companion research paper, “What a Transformer Retrieves and What It Computes”, is available here.