Percepta's 2D Heads: Programs Compiled Into Weights

The paper

Percepta published a result in March 2025: a WebAssembly interpreter compiled into the weights of a standard PyTorch transformer. The model doesn't call an external tool — it executes the program itself, token by token, in its own decoding loop. The execution trace is transparent: each step appears in the output.

The core technical innovation: restricting attention head dimension to 2. In 2D, "find the key with the highest dot product with the query" becomes a computational geometry problem — finding the farthest point on a convex hull in a given direction. This is solvable in O(log t) instead of the standard O(t) per step. On their benchmark: 31,000 tok/s vs 305 tok/s with standard KV cache on the same workload.

The two paths

The architecture splits into two paths:

Slow path — full-dimensional heads. Standard LLM behavior: reasoning, planning, generation. This is the path that thinks.

Fast path — 2D heads with convex hull lookup. Deterministic execution at O(log t). This is the path that executes. The program lives in the weights. The data lives in the trace.

The key insight: changing what the fast path does doesn't require retraining. The algorithm is in the weights; the data (dictionaries, schemas, routing tables) is injected as tokens. Swap the dictionary, swap the behavior. Hot-reload without touching the model.

Applications on small LLMs

The paper demonstrates a WebAssembly interpreter, which is a proof of concept. The interesting question is what else you could compile into the fast path of a small, deployable model.

Correction and eviction patterns:

NER with dynamic dictionaries. The slow path predicts an entity; the fast path looks it up in an injected dictionary; if it's a false positive, the fast path overrides. Dictionary changes don't require retraining — they're token injections.
SQL validation. The slow path generates a query; the fast path runs a parser embedded in the weights, validates syntax and schema, corrects inline. Zero external round-trip.
Format control. The slow path generates JSON or YAML; the fast path validates against an injected schema, corrects if invalid. Zero post-processing.

Embedded execution patterns:

Agnostic router. The routing algorithm is in the weights; the routing table is injected as tokens. Strategy selection (local, HTTP, noop) is data, not code. Hot-reload the routing table without reloading the model.
Constraint solver. The slow path formulates the problem; the fast path solves it exactly. Scheduling, assignment, bin-packing — deterministic results, guaranteed correct.
Persistent state machine. Workflow transitions are the compiled algorithm; current state is the generated trace. Exact recovery after crash, because the trace is the state.

The differentiability question

The fast path is differentiable. A correction made by the dictionary can propagate a gradient signal back to the slow path. A false positive caught by the fast path can teach the slow path to not make that prediction again. This feedback loop between deterministic execution and probabilistic reasoning is where the real research potential is — and it's completely open.

Connection to HOROS

Several HOROS patterns map directly to these concepts:

SQLite WAL as persistent trace — exact state recovery between sessions
dbsync (snapshot + atomic swap) as a state transfer mechanism analogous to HullKVCache
squeueHA (bounded execution, claim/ack, MaxAttempts) as guarantees on the execution trace
usertenant catalog (routing agnostic to content) as data-in-tokens, not algorithm-in-code

The convergence isn't accidental. Both architectures separate the "what to do" (compiled, deterministic) from the "what to do it on" (injected, hot-swappable).

hazyhaar — open research, sovereign infrastructure github.com/hazyhaar · hazyhaar.fr