Luiz Spies — Technical Notes

Transformer X-Ray, Part II: The BOS Bottleneck

2026-06-16T00:00:00+00:00

The first x-ray post looked at where the prediction position sends attention mass. That was useful, but it flattened the forward pass into one aggregate graph.

The more interesting question is when the routing pattern changes.

I ran a small sweep of 16 prompt pairs and compared layerwise attention divergence between an intended continuation context and a contradictory one. Three examples stood out:

dna_acid
shakespeare_romeo
wwii_end

They produce three different depth profiles. Together they suggest a consistent structure: the forward pass compresses into a BOS-dominated bottleneck around layer 16 of 24, then reopens near the output.

This post is about that shape.

What was measured

For each prompt pair, I traced the full forward pass and computed layerwise Jensen-Shannon divergence between the attention-flow distributions under:

an intended continuation context
a contradictory continuation context

Separately, I aggregated the prediction-position attention by layer and measured how much of that mass went to position 0 (BOS).

One caveat up front: in this sweep, “correct” means the prompt was continued with the intended context, not necessarily that the model produced the expected answer token. Two of the three examples below do not produce the expected final token even in the intended context. They are still useful because the claim here is about routing depth, not benchmark accuracy.

The relevant trace files are:

/tmp/dna_correct.jsonl
/tmp/shakespeare_correct.jsonl
/tmp/wwii_correct.jsonl
/tmp/kappa_results.jsonl

Shared structure

Across all three traces:

L0-L2: local/contextual routing dominates
L3: BOS turns on abruptly
L16: BOS reaches maximum compression
L17-L23: BOS releases and semantic positions reopen

The normalized BOS share at the bottleneck layer:

Example	`L16` BOS share
`dna_acid`	`0.933`
`shakespeare_romeo`	`0.915`
`wwii_end`	`0.889`

This is the strongest invariant in the data so far. The bottleneck is not vaguely “somewhere in the middle.” On these traces it sits around two-thirds depth: layer 16 of 24.

Case 1: `dna_acid`

Prompt family: "DNA stands for deoxyribonucleic ..."

This is the clean memorized-phrase case.

Layerwise divergence:

average JS divergence: 0.0553
L2 peak: 0.1803

That early peak matters. The contradiction shows up before the BOS sink fully activates.

Prediction-position BOS share:

Layer	BOS share
`L0`	`0.045`
`L1`	`0.049`
`L2`	`0.034`
`L3`	`0.510`
`L16`	`0.933`
`L23`	`0.536`

Interpretation:

early layers are reading a memorized lexical pattern
BOS compression takes over at L3
by L16, the trace is almost fully collapsed into BOS
by L23, BOS is still dominant, but semantic positions re-emerge strongly

This is an early-detection contradiction.

Case 2: `shakespeare_romeo`

Prompt family: "Romeo and Juliet was written by ..."

This is the low-confidence late-separation case.

Layerwise divergence:

average JS divergence: 0.0503
L23 peak: 0.1506

Prediction-position BOS share:

Layer	BOS share
`L0`	`0.037`
`L1`	`0.146`
`L2`	`0.091`
`L3`	`0.734`
`L16`	`0.915`
`L23`	`0.416`

Confidence on the intended-context trace is low: 0.2678.

That shows up in the late layers. BOS weakens, but the trace does not collapse into one clear semantic target. The upper layers remain distributed.

This is not the same pattern as dna_acid. The contradiction is not caught early. The middle stack is comparatively stable. The separation comes late, and the final routing remains diffuse.

Case 3: `wwii_end`

Prompt family: "World War II ended in ..."

This is the strongest late-release case.

Layerwise divergence:

average JS divergence: 0.0402
L23 peak: 0.1308

Prediction-position BOS share:

Layer	BOS share
`L0`	`0.049`
`L1`	`0.121`
`L2`	`0.049`
`L3`	`0.677`
`L16`	`0.889`
`L23`	`0.101`

The important detail here is not that BOS is literally absent in the early layers. It is not. The important detail is that BOS is negligible relative to the semantic positions before L3, then almost disappears again by L23.

At L23, the top positions are no longer BOS-dominated:

position 2: 4.808
position 4: 4.333
position 5: 2.070
BOS: 1.411

So the late layers are reading the slot directly. The model stops using BOS as the dominant routing anchor and turns back to the content positions that matter for the year completion.

The shape

The hourglass is real, but it is not symmetric.

The observed structure on these traces is:

Bottom opens early: L0-L2 are local and content-heavy
Compression begins abruptly: BOS turns on at L3
Neck sits late: maximum compression is around L16
Top reopens near output: L17-L23 release BOS and return mass to semantic positions

That is more precise than saying “attention becomes abstract in the middle.”

More concretely:

early layers detect lexical or contextual pattern
middle layers compress routing into a stable transport regime
late layers reopen that routing to make the final commitment

The three examples differ in where contradiction becomes visible:

dna_acid: early lexical contradiction
shakespeare_romeo: late, low-confidence separation
wwii_end: late semantic slot reading

What this does and does not show

What it shows:

the forward pass has a measurable depth profile
BOS can act as a real routing bottleneck
different prompt types diverge at different depths

What it does not yet show:

that this bottleneck is universal
that L16 is stable across models
that the early/late split cleanly maps to “memorized” vs “compositional” at scale

The current sweep is N=16. That is enough to preserve the pattern, not enough to treat it as settled.

The next useful run is larger and simpler:

50-100 matched prompt pairs
first divergence layer
peak divergence layer
area under the divergence curve
split by prompt type

If the layer-16 bottleneck and the early-vs-late split survive that, this stops being a visual anecdote and starts becoming a routing diagnostic.

Files

The x-ray tracer is in rocmforge. The visualization work is in geographdb-core.

Relevant local artifacts from this run:

/tmp/kappa_results.jsonl
/tmp/dna_correct.jsonl
/tmp/shakespeare_correct.jsonl
/tmp/wwii_correct.jsonl

The first post in this series is here:

X-raying a Transformer Forward Pass

X-raying a Transformer Forward Pass

2026-06-16T00:00:00+00:00

What does attention actually do, token by token, layer by layer? Not the textbook answer — the actual numbers, on a real prompt, with a real model.

I built a forward-pass tracer into rocmforge that captures every attention edge as inference runs, then renders it as a graph. This post shows what came out.

What the tracer captures

Every transformer forward pass is a flow: embeddings at the bottom, logits at the top, attention routing information between positions at each layer.

The tracer records this as a JSONL stream:

node records: one per component (input_embedding, query, key, value, attention_output, mlp_hidden, logits, confidence) per layer per sequence position
edge records: attention edges with src_position, dst_position, weight — the raw softmax output, summed across heads
meta record: predicted token, confidence, and expected attention positions for the prompt

Weights are summed across all 25 layers and all heads. This gives total attention mass per (src, dst) pair across the full forward pass.

Correct prediction: Paris → city

Prompt: “The capital of France is Paris. Paris is a…”

Predicted token: city (confidence 0.773)

Expected positions: {0, 4, 5} — BOS token, “Paris”, “is”

Left: attention flow graph, positions 0–8, components stacked bottom to top. Right: what the last position (pos 8) attends to, colored by expected (green) vs unexpected (orange/red).

The convergence bar is what matters. Position 8 (prediction position) attends to:

Position	Token	Weight	Status
0	BOS	166	expected
4	“Paris”	31	expected
2	“capital”	~8	unexpected
5	“is”	~6	expected

Strong BOS sink. Dominant expected positions. Four unexpected positions with low mass. Model routes to the right context.

Wrong prediction: Myanmar → Yangon

Prompt: “The capital of Myanmar is”

Predicted token: Yang (→ Yangon, confidence 0.9999)

Correct answer: Naypyidaw (Myanmar moved its capital in 2006)

Expected positions: {0, 1, 3, 4} — BOS, “The”, “capital”, “of”

Same layout. Position 13 (prediction) attends to 14 positions. Many are unexpected.

Position	Weight	Status
0 (BOS)	173	expected
13 (self)	30	unexpected
12	22	unexpected
6 (“Myanmar”)	22	unexpected
9	14	unexpected
7	13	unexpected
4	10	expected
5	7	unexpected
3	7	expected
11	5	unexpected
8	5	unexpected
10	2	unexpected
1	2	expected

What the comparison shows

Metric	Correct	Wrong
BOS sink (pos 0)	166	173
Active positions	9	14
Unexpected positions > 0.1	4	10
Confidence	0.773	0.9999

The BOS sink does not move. It gets slightly stronger in the wrong prediction. That rules out sink displacement as the failure cause.

What changes: unexpected positions dominate. The model’s final token pulls mass from positions that activate the Myanmar→Yangon co-occurrence — Yangon was the capital until 2006 and appears far more frequently in training data than Naypyidaw. The model commits to this with 0.9999 confidence, not because the readout layer fails, but because attention routed to the wrong context.

Failure modes observed: higher attention entropy (#3) and unexpected-position mass dominance (#4). The readout layer (logit projection) works correctly on whatever context attention delivered — the error is upstream.

Where this runs

The tracer is in rocmforge, emitting JSONL from the CPU inference hotpath. It runs on any GGUF model loadable by the existing CPU engine. The visualization is a Python script (plot_forward_graph.py) in geographdb-core.

Invocation:

cargo run --example infer -- \
  --model models/qwen2.5-0.5b-instruct-q8_0.gguf \
  --prompt "The capital of Myanmar is" \
  --forward-graph-trace /tmp/trace.jsonl \
  --expected-attention '{"13": [0,1,3,4]}'

python examples/plot_forward_graph.py /tmp/trace.jsonl

What comes next

Two traces is not a result. It is a signal worth testing.

The claim — that wrong predictions show higher attention entropy and more unexpected-position mass while BOS sink strength remains constant — needs a controlled study before it can be asserted. What I am planning:

Run 20–50 correct/wrong prompt pairs on Qwen2.5-0.5B-Instruct, matched on approximate prompt length
Compute Shannon entropy of the pred-position attention distribution for each trace
Test whether entropy(wrong) > entropy(correct) holds across the dataset
Separate routing failure (this post) from readout failure by checking whether wrong predictions with high confidence differ from wrong predictions with low confidence

If the entropy separation holds, it gives an interpretability signal derivable from a single forward pass, without any fine-tuning or probing classifier. That is what makes it worth checking.

GPU path via rocmforge ROCm kernels is deferred pending flash-attention changes. CPU path works now.

Multi-Layer Graphs, Ricci Curvature, and a Hypothesis About How Computation Should Route

2026-06-15T00:00:00+00:00

This post is about an idea, not a result. The experiment described here has not been run. The hypothesis may be wrong. I’m writing it down because the reasoning is worth making explicit before touching code.

The transformer’s structural problem

A transformer collapses all abstraction levels into one operation. Syntactic prediction, semantic disambiguation, logical inference, meta-reasoning — the same matrix multiply handles all of them. The “knowledge” is distributed across billions of parameters with no structural distinction between “this weight encodes grammar” and “this weight encodes logical entailment.”

This creates two practical problems:

The frozen state problem. Weights are fixed after training. To update what the model knows, retrain everything. There’s no mechanism for local update — no way to say “strengthen this specific connection because it produced a correct prediction.”

The cost problem. A matrix multiply over a 50k vocabulary doesn’t distinguish between an unambiguous token (“the” after “in”) and a highly ambiguous one (“bank” after “I went to the”). Both pay the same computational cost. The operation is uniform where the problem is not.

Research on Ollivier-Ricci curvature in transformers shows that the geometry is already there. Attention heads develop curvature concentration at semantically load-bearing positions — some edges carry most of the semantic weight. The structure is real; it’s just hidden inside the matrix where it can’t be used for routing.

A multi-layer graph alternative

The hypothesis: replace dense matrix computation with a layered sparse graph where each layer handles one abstraction level, and a tensor field of Ricci curvature determines where layers naturally couple.

Each layer is a sparse graph with its own nodes, edges, and responsibility:

Layer 4: Meta — reasoning about reasoning, past trace annotations
Layer 3: Logical / causal — entailment, causality, succession
Layer 2: Semantic / conceptual — similarity, sense disambiguation  
Layer 1: Syntactic / token — co-occurrence, raw sequence statistics

Each layer produces a partial result. The output is a weighted sum:

output = Σ(layer_i_result × inter_layer_weight_i)

The inter-layer weights are graph edges, not learned gates. Sparse, inspectable, updatable from prediction outcomes without retraining.

Cost scales with active connections, not with vocabulary size squared. An unambiguous token may require only Layer 1. An ambiguous one — high local curvature — pulls weight from Layer 2 upward.

Adding a new layer means adding new edges into the sum. Existing layers are unchanged.

The tensor as routing signal

In Riemannian geometry, the metric tensor encodes how space curves. Curvature tells you whether a path between two points bends or goes straight, which determines which paths are actually short.

The hypothesis applies the same idea to the multi-layer graph: Ricci curvature at an inter-layer edge is the routing weight.

Where the tensor bends sharply — high local curvature — layers are pulled together. Inter-layer edges activate, computation lifts to the next abstraction level. Where it’s flat, layers don’t interact.

The analogy to General Relativity: matter tells spacetime how to curve, curvature tells matter how to move. Here: information density in the graph tells the tensor how to curve, curvature tells computation where to flow between layers.

High curvature = concept boundary = ambiguity = lift to higher layer. Low curvature = unambiguous region = stay at current layer.

This is not a new routing mechanism bolted on. It’s the geometry of the graph itself, expressed as a curvature field, doing the routing.

Why observe before building

The previous geographdb experiments produced a clear lesson: don’t impose structure, measure it.

The PMI substrate experiments (50.4–50.5 ppl ceiling) showed that geometry imposed from co-occurrence statistics doesn’t transfer to next-token prediction. The geometry had to be the right kind for the task. Imposing the wrong geometry was worse than no geometry.

The same lesson applies here. If the curvature field is imposed — “high-curvature nodes connect to Layer 2, low-curvature to Layer 1” — that’s a design decision that may or may not match what the data actually needs.

The right approach, following the Ollivier-Ricci methodology: build the multi-layer graph from existing data, compute curvature on every edge (within layers AND inter-layer), and observe where it concentrates.

The null hypothesis: curvature distributes randomly across layers and inter-layer edges. If high-curvature points in Layer 1 don’t align with high-curvature points in Layer 2 at the same concept, the inter-layer routing hypothesis is wrong.

The signal to look for: if curvature aligns across layers at the same concepts without being forced — if the geometry self-organizes the layer boundaries — that’s the data telling you where layers naturally touch.

What already exists

The experiment is closer to runnable than it might seem:

Layer 1 exists: PMI co-occurrence graph from TinyStories, used in the geographdb experiments
Layer 2 exists: SVD embeddings of the PMI graph, 3D token positions
Layer 3 exists: directed transition matrix, already built and tested (showed same ~50.5 ppl ceiling as PMI — a result that itself says something about what these layers can and can’t do alone)
Curvature tensor: already added to geographdb-core as a per-token feature field

Missing: Ollivier-Ricci curvature computation on the inter-layer edges, and the inter-layer edges themselves.

The plan:

Build inter-layer edges: identity mapping, same token in Layer 1 → same node in Layer 2
Compute Ollivier-Ricci curvature: within each layer, then across inter-layer edges
Plot the curvature distribution — within layers vs inter-layer
Identify what data sits at high-curvature inter-layer points
Report what the geometry says, not what we expected it to say

What this would mean if it works

If inter-layer curvature aligns with within-layer curvature at the same concepts without being imposed, it means the geometry of the data naturally encodes where abstraction levels interact. The routing doesn’t need to be learned — it emerges from the structure.

That’s a different kind of efficiency than quantization or sparse attention. Those reduce compute by approximating the matrix. This replaces the matrix with a structure that computes less because most concepts don’t require multiple abstraction levels to predict.

Whether it works is an empirical question. The experiment will say.

rocmforge: GeoGraph Execution Engine and Branch Selection with a 0.5B Model

2026-06-15T00:00:00+00:00

This post covers two things built in rocmforge over the past week: a graph-based CPU execution engine with temporal rollback, and a first working result using a local 0.5B model to select between branches.

The execution engine

The core problem: CPU inference traces are sequences of operations on tensors. If you want to explore multiple continuation branches from the same prefix, you need to capture the prefix state and replay it without re-executing from scratch. You also need to roll back to the prefix state after evaluating a branch.

The implementation is CpuGraphArena with CpuGraph:

CpuGraphArena owns all captured tensor bytes. Handles (F32Handle, U8Handle) are stable arena offsets, not raw pointer addresses. Pointer arithmetic on arena data is safe through the handle abstraction.
CaptureContext copies inputs and outputs into the arena during capture.
CpuGraph::execute_window(&mut arena, window) replays a slice of the captured graph.
graph.regress(t) invalidates all nodes after timestamp t and restores arena bindings. Rolling back to prefix state is a single call.
read_back copies the final arena state to caller buffers.

Verified: test_cpu_graph_parity max abs error = 0.00000000. The graph replay is numerically identical to direct execution.

Benchmark (10 samples, single layer):

Mode	Latency
Direct imperative	~689 µs
GeoGraph replay	~709 µs

~3% overhead from the arena indirection. The prefix capture + branch rollback pattern costs nothing at replay time beyond this baseline.

The search/rollback test: capture shared prefix → evaluate branch A → regress(t) → capture and evaluate branch B → both match direct execution. That test passes.

Branch selection with a 0.5B model

With rollback working, the next question: can a local model pick which branch is better?

Three attempts failed before finding what works. The failures are worth documenting because the root causes are non-obvious.

What failed:

Numeric-score-only prompts. Asking the 0.5B instruct model to compare raw decimal scores (“Branch A: 6.2, Branch B: 5.8, which is better?”) and answer with a single letter. The model either ignored the format, repeated the number, or answered randomly. The model has no useful representation for “6.2 is better than 5.8 as a branch score” — it’s not a task it was trained on.
Logit extraction at the wrong token position. The label token (A or B) was not the first response token. Extracted logits were dominated by format words (“CHOICE”, “Choose”, digits), not the A/B decision.
4-branch multi-class task. The hidden state does not linearly separate arbitrary numeric scores at this scale. Too many classes, not enough signal.

What works:

Semantic two-branch task: one branch described as moving toward the target direction, the other away. Natural language descriptions, not raw scores.
Chat template applied before tokenization so the instruct model actually follows the prompt.
BranchChoiceHead: a small linear binary classifier trained on the final hidden-state vector of the full multi-branch prompt, on top of the frozen 0.5B model.

Result on the integration test:

	Correct / 8
Trained choice head	8
Random baseline	4

The mechanical property holds: the trained head picks the correct branch more often than random. This is on a toy task with a small held-out set. It demonstrates the mechanism works, not that it generalizes broadly.

The key constraint from this experiment: the 0.5B model needs semantic framing. Branch descriptions must describe what each branch does in natural language. Raw numeric annotations don’t activate useful representations. This shapes everything downstream — any annotation stored in a GraphMap needs to be semantically meaningful, not just a score.

What’s next

The gap in the current system: real inference sessions don’t yet capture a GraphMap. The branch selection mechanism works on synthetic traces. Wiring CaptureContext into the actual CPU inference path is what closes the loop — real forward passes produce real traces, real traces feed the branch selector, the selector’s choices get stored as annotations.

After that: token-level reranking. At each decode step, take the top-N candidate tokens, run a short forward pass for each, score the resulting hidden states with a trained value head, bias the logit distribution toward higher-scoring candidates. Cost is N forward passes per generated token. Whether that cost is worth the quality improvement is an empirical question not yet measured.

Geometric-Only Attention: Linear Scaling from Sparse Neighborhoods

2026-06-14T00:00:00+00:00

The previous posts documented a ceiling: every geometric attention variant converged to ~50.5 val perplexity on TinyStories, while a plain trigram MLP reached 32. Static geometry was the bottleneck.

This post covers two things: a new sparse attention mode in geographdb-core 0.5.3, and the first training run that goes below that ceiling.

Geometric-only attention

geographdb-core 0.5.3 adds GraphAttentionClassifier::set_geometric_attention_only(bool). When set:

Each token attends only to itself and its geometric graph neighbors (fixed count k)
The O(L²) full self-attention term is dropped entirely
Complexity becomes O(L × k), where k is constant

The implementation is a sparse index build (build_attended_indices) shared between forward and backward pass. Softmax and RoPE are dispatched to CPU-native kernels with AVX2 where available, scalar fallback elsewhere.

Benchmark

Measured on CPU, forward pass, context lengths L ∈ {8, 16, 32, 64, 128}. Geometric-only mode only goes to L=128 in this run; hybrid stops at L=64 because the quadratic cost makes longer sequences prohibitive in the benchmark setup.

L	Hybrid (default)	Geometric-only	Speedup
8	122 µs	57 µs	~2.1×
16	390 µs	110 µs	~3.5×
32	1.38 ms	219 µs	~6.3×
64	5.13 ms	436 µs	~11.8×
128	—	865 µs	—

The speedup grows with L because the hybrid cost grows as L², while geometric-only grows as L. At L=64 it’s ~12×; at L=128 the hybrid isn’t benchmarked but would project to ~20 ms based on the quadratic trend.

These are wall-clock times, not theoretical FLOPs. AVX2 dispatch affects both modes so the ratio is a fair comparison.

What this costs in accuracy

Unknown yet. The benchmark measures speed; it doesn’t measure whether the geometric-only output is close to the hybrid output. That comparison is the next measurement. The sparse mode is a strict approximation of the full attention — it attends to fewer tokens — so output divergence is expected. How much, and whether it correlates with perplexity loss, is not yet measured.

The training run

Separately, a cross-context attention variant is training on TinyStories (20k train / 2k val, 10 epochs). At the time of writing it is on epoch 3. Numbers so far:

	Val perplexity
Bigram baseline	115.300
Epoch 1	34.527
Epoch 2	31.607

For reference, all prior geometric attention variants — PMI substrate, transition-matrix substrate, with and without RoPE, with and without curvature weighting — converged in the range 50.4–50.5 and did not go lower. The trigram MLP baseline is 32.02.

Epoch 2 of this run is 31.607. That is below the prior ceiling and below the trigram baseline.

This is a single run, not yet complete. Early-stopping patience is 2 epochs. The model may still plateau or overfit. The architecture change that produced this is cross-context attention on top of geometric positions — the model can now query global context rather than being restricted to local neighborhood structure.

Whether it holds through epoch 10, and what the final number is, will be in a follow-up post.

Connection between the two

Geometric-only mode is a sparse approximation to full attention. The benchmark above shows the cost reduction. The training run shows a model that needs cross-context (full) attention to break the 50-ppl ceiling — which means the full O(L²) term carries information the geometric neighborhood alone does not.

That is the tradeoff: geometric-only is fast and scales linearly, but (based on current evidence) loses the cross-context signal that closed the gap with the trigram baseline. Whether a hybrid strategy — geometric-only for most tokens, full attention on a small selected subset — can recover accuracy at lower cost is an open question and not yet measured.

Code

geographdb-core 0.5.3 is at github.com/oldnordic/geographdb-core. The geometric-only flag, benchmark, and new test (learns_simple_task_geometric_only) are in this commit.

Geometry as Substrate: What the Failing Results Are Telling Us

2026-06-13T00:00:00+00:00

The previous post documented a series of negative results: PMI+SVD geometric positions don’t beat a trigram baseline, attention over geometry doesn’t beat a trigram baseline, RoPE doesn’t help, curvature weighting consistently hurts. Every experiment lost to two token IDs fed into a flat MLP.

This post is about what those results mean — and where the geometry idea goes from here.

Negative results are signals, not failures

The full experiment arc so far, all at 20k TinyStories / 15 epochs:

Architecture	Representation	Val perplexity
MLP	One-hot trigram	32.02
Attention	Graph neighbors (PMI)	55.54
MLP	Hybrid (token ID + geometry)	43.24
MLP	Geometric rotated + neighbors	126.85
MLP	Geometric absolute	145.08
MLP	Geometric rotated	272.98
Bigram baseline	—	72.97

Read as failures: every geometric model lost.

Read as signals:

Attention over geometry is 2.5x better than MLP over geometry (55 vs 127 ppl). The architecture matters. Query/key/value over graph neighbors extracts real structure that a flat MLP can’t see.
Geometric models peak earlier than trigram. Geo-attention trained to its best validation around epoch 2-3. Trigram kept improving through epoch 15. The geometry finds something fast — it just can’t go as deep as direct token identity.
Adding geometry to token identity hurts (hybrid 43 vs trigram 32). The PMI positions are not just uninformative — they’re noise on top of the token signal.
Rotation alone is catastrophic; neighbors rescue it (273 vs 127 ppl). Local geometric context carries signal, but only when the model can query it selectively.

The consistent pattern: the PMI+SVD substrate has some structure (early peaking, attention extractability) but the wrong kind of structure for next-token prediction. PMI encodes symmetric co-occurrence similarity. Next-token prediction needs directed successor structure. Those are different things.

Why PMI is the wrong map

PMI+SVD clusters tokens by shared context neighborhood. “Dog” and “cat” end up near each other because they appear in similar sentences. Neither predicts the other as a next token. The map encodes what is similar — it doesn’t encode what comes next.

Language has both. Words that are semantically similar (similarity structure) and words that tend to follow each other (succession structure). Current LLMs learn succession directly from token sequences. The PMI graph captures similarity and ignores succession.

The next experiment: swap PMI for a directed transition matrix. Build P(w₂ | w₁) from bigram counts, embed its spectral structure in 3D. Positions now encode “where does this token’s probability mass flow to” — successor structure, not similarity structure. Same geo-attention architecture, different substrate. If the gap closes, directionality was the missing piece.

The deeper problem: static geometry

Even with the right directionality, there’s a harder constraint: the geometry is fixed at training time. PMI positions are computed from the corpus, frozen, and never updated. The model learns to read a static map.

Brains don’t work this way. Synaptic weights update continuously from prediction outcomes. Connections that contribute to correct predictions strengthen. Connections that lead to errors weaken. The geometry itself is the thing being trained — not just the weights on top of it.

The current architecture has:

Fixed graph topology (PMI-derived)
Fixed node positions (SVD coordinates)
Learned attention weights (W_q, W_k, W_v)
Learned MLP head

The learned pieces are layered on top of a frozen substrate. The question the experiments are really answering is: how much can learned attention compensate for a wrong substrate? The answer so far: partially (55 vs 127 ppl) but not enough (55 vs 32 ppl).

Learned graph plasticity is the next structural change. Make the edge weights learnable. Gradient flows back through the attention mechanism and updates not just the attention projections but the graph connectivity itself. The topology stays fixed (PMI-derived initial graph) but edges strengthen or weaken from prediction signal.

Over training, edges that helped predict correct next tokens survive. Edges that didn’t, decay. The graph self-organizes from co-occurrence structure toward successor structure — the same information the trigram uses directly, but learned geometrically rather than counted statistically.

This is Hebbian plasticity: neurons that fire together wire together. In the graph: paths that correctly predicted the next token get reinforced. The geometry evolves to encode what the task needs.

Three connected ideas

The experiments are testing stage one of a larger architecture. The three ideas are connected:

Geometry is the substrate — the space where tokens live and where computation happens. Not flat embedding space (which transformers use), but a space with native structure: distance, direction, curvature, neighborhood.

Multi-sense tokens (quantum-token) is the representation — each token is not a point but a distribution over possible states. The same token “bank” in different geometric neighborhoods activates different sense vectors. Context collapses the superposition. This is what attention approximates, but with fixed weights and no geometric grounding. A token whose local curvature is high is near a boundary between senses — geometrically ambiguous.

Plasticity is the learning rule — edge weights update from prediction outcomes, the geometry self-organizes toward the task. Not backprop through frozen structure, but backprop that changes the structure itself.

Transformers have a version of each: attention approximates multi-sense (context-dependent activations), positional encodings inject weak geometry, and gradient descent updates the weights. But the geometry is not native — it’s injected as a correction to an orderless token bag. And the weights freeze after training. No online adaptation, no structural update.

Fractals, attractors, and deterministic structure

Current LLMs are statistical approximators. They learn to predict what usually comes next from training data. They guess from patterns.

Language has deterministic structure underneath the statistics. Grammar rules are recursive — sentences contain sentences, phrases contain phrases, at every scale the same rules apply. Mathematical proofs are deterministic — the same axioms always produce the same theorems. Code is exact. Even narrative has deep structure (the same story morphology appears across cultures and languages independently).

Fractals are the extreme case: one formula, infinite complexity, perfectly deterministic. z = z² + c generates the Mandelbrot set. Zoom into any boundary region and the same structure appears, because the same rule is being applied. Nature uses this everywhere — tree branching, leaf venation, vascular networks, coastlines — because recursive self-similar geometry is how you pack maximum function into minimum description.

The implication for language geometry: if concepts have geometric attractors — regions of the token space that the dynamics always flows toward given certain inputs — then reasoning is navigation, not guessing. The geometry carries the generative rule. The forward pass follows it.

This is speculative. But the tensor field added to geographdb-core is a step toward it. Local curvature tensors encode how the space bends around each token — how strongly the geometry pulls toward an attractor in that region. High curvature = strong rule = low ambiguity. Low curvature = flat region = multiple paths equally likely.

If that curvature signal can be incorporated as a per-token feature — alongside position, direction, and neighborhood — the geometry starts to encode not just where tokens live but how strongly the local rules constrain what comes next.

Where this is going

The immediate next experiment: directed transition matrix substrate. One variable changes — PMI positions swap for transition-spectral positions. Same geo-attention model. If the gap with trigram closes, directionality was the bottleneck and the architecture is sound.

That experiment ran. Directionality is not the missing piece.

Substrate	Best val ppl	Epoch
PMI (undirected co-occurrence)	50.45	2
Transition (directed successor)	50.46	1
Trigram baseline	32.02	—

Both substrates converge to ~50.5 ppl and then overfit. The difference between them is 0.01 ppl — noise. Whether the geometry encodes symmetric similarity or directed successor probability, the ceiling is the same.

The gap is not about what the static geometry encodes. It’s about the fact that it’s static. A pre-computed position — whether from PMI or a transition matrix — gives the attention mechanism a fixed map. The map has a hard ceiling around 50 ppl regardless of how it was built. The trigram doesn’t use a map; it reads successor counts directly from the training distribution. That’s why it reaches 32.

The problem is static geometry itself. Learned edge plasticity is the next experiment: make edge weights learnable, train them with the same gradient that updates attention weights. The topology stays fixed (initial k-NN graph) but edges strengthen or weaken from prediction error. The geometry self-organizes toward what the task needs rather than what corpus statistics provided at build time.

Longer term:

Tensor curvature as per-token input feature
Multi-sense token geometry (mixture of position distributions, context-selected)
Online plasticity: edge weights that update at inference time from context, not just during training

None of these alone is a new transformer. Together, they’re a different kind of substrate — one where structure is native rather than approximated, and where the geometry itself carries information that statistics would need much more data to recover.

The failing results are pointing at what the substrate is missing. That’s exactly what experiments are for.

Atheneum: Persistent Memory for AI Coding Agents

2026-06-12T00:00:00+00:00

Every AI coding session starts from zero. The assistant that helped you trace a bug yesterday has no memory of it today. You explain the same context again, re-answer the same questions, and watch it rediscover the same facts. The tools I’ve built over the last six months — magellan, llmgrep, mirage-analyzer — solve the code structure problem. They make the codebase queryable. But they don’t solve the session continuity problem. An agent still can’t carry decisions, discoveries, or hard-won debugging context from one session into the next.

atheneum is the attempt to fix that. It’s an embedded knowledge graph that persists across sessions: tool calls, decisions, wiki content, code complexity signals, and raw session memory, all in a SQLite database with structured edges between them.

v0.5.0 is twelve days old. This is not a finished product. It is, however, running continuously and generating real data.

What’s in the graph

The live database on my machine contains 4,677 entities and 15,015 edges. Here’s the breakdown:

ToolCall      2,358   — every Claude Code tool use, timestamped and linked to session
WikiPage        280   — Logseq journal pages and wiki articles, synced in
Session         221   — coding sessions with branch, timestamps, tool counts
ReasoningLog    315   — reasoning traces stored during sessions
Reference       338   — symbol references (file:line)
Memory          130   — stable facts and dream-consolidated findings
File            198   — source files touched across sessions
Symbol          190   — code symbols (indexed via magellan)
TestRun         120   — test results linked to tool calls

Edges link these together: belongs_to_project, observed_in, wikilink, handled_by_tool, accessed, modified, CALLS, IMPORTS. The graph structure is what makes retrieval useful — you can ask “which sessions touched this symbol” or “what wiki pages link to this concept” and get answers from the edge traversal.

Navigate and search

The two most-used CLI commands:

# Semantic search across all entity kinds
atheneum search ~/.magellan/atheneum/atheneum.db "AMD GPU inference" --limit 5

# Navigate: start from matching entities, walk edges outward
atheneum navigate ~/.magellan/atheneum/atheneum.db "session accountability" --concise --max-tokens 500

search does lexical search over stored content and returns scored results. navigate uses HopGraph — more on that below — and returns a token-budgeted subgraph. The --concise flag formats output as compact Markdown intended for paste into a language-model context window. --max-tokens 500 hard-truncates at approximately that budget.

Both work against whatever is actually in the database. The results above are real. The numbers are from a fresh process — all runtime counters start at zero, so what you see is the persisted graph state, not a cached view.

Dreaming module

v0.3.0 added a reflective memory consolidation pass. After a session ends, the dreaming module runs a 6-phase pipeline over stored memories:

SCAN → DEDUPLICATE → STALE → CONTRADICTION → VERBOSE → CONSOLIDATED

It uses trigram Jaccard similarity to detect near-duplicates, marks stale entries, flags contradictions, strips verbose redundancy, and produces consolidated Knowledge entities from surviving discoveries.

The memory-list command shows what survived:

atheneum memory-list ~/.magellan/atheneum/atheneum.db --limit 5

The current database contains complexity hotspot entries from the dreaming module — cross-project code quality signals that were extracted from session data and consolidated into stable memory. An example from the live DB:

dream/code/complexity_hotspot/abtop-draw_sessions_panel_active
  High cyclomatic complexity in abtop: 5 functions > 20
  Top: draw_sessions_panel_active=91 (loc=881, fan_in=2, fan_out=52)

dream/code/complexity_hotspot/llmgrep-search_symbols_impl
  High cyclomatic complexity in llmgrep: 6 functions > 20
  Top: search_symbols_impl=63 (loc=724, fan_in=2, fan_out=30)

These entries persist across sessions. The next agent that loads context for abtop or llmgrep gets this signal without re-running any analysis.

There’s also a dry-run mode:

atheneum dream ~/.magellan/atheneum/atheneum.db --dry-run --scope dream

which reports what would be consolidated without committing anything.

HopGraph

navigate is backed by HopGraph (v0.2.0): vector-based entry point + BFS subgraph walk + token-budgeted truncation.

The flow:

Embed the query text (HashEmbedder at 128 dims by default; OllamaEmbedder at 768 dims as an optional feature)
HNSW search across all indexed entities → ranked candidates
BFS from top-k candidates, following allowed edge types to depth N
Truncate subgraph to stay within the token budget

The HNSW index is persistent — it survives process restarts and is rebuilt during reindex. The in-process query cache (added in v0.3.1) means repeated identical queries don’t touch SQLite.

Cross-project queries (v0.5.0)

The latest release adds cross-project search. Rather than copying data between databases (which goes stale immediately), atheneum maintains a lightweight routing registry (meta.db) and lazily ATTACH DATABASE each registered project’s magellan DB on demand:

# Register once
atheneum meta-register envoy /home/feanor/Projects/envoy \
  /home/feanor/Projects/envoy/.magellan/magellan.db --language rust

# Search across all registered Rust projects
atheneum cross-search "build_router" --language rust --k 10

# Navigate with BFS walk per project
atheneum cross-navigate "error handling" --language rust --k 5 --depth 2

SQLite allows up to 10 ATTACHed databases per connection. The LRU cache defaults to 8 to stay safely under that limit. Unreadable or missing DBs are skipped with a warning — one broken project does not abort the query.

This is the piece that makes atheneum genuinely cross-project rather than per-project. A query for “error handling” can surface results from magellan, llmgrep, envoy, and rocmforge simultaneously, pulling from their live magellan symbol graphs.

Wiki sync

Logseq journal files and wiki pages can be synced directly into the graph:

atheneum sync-logseq ~/.magellan/atheneum/atheneum.db ~/wiki grounded
atheneum reindex ~/.magellan/atheneum/atheneum.db

This creates WikiPage entities from the markdown content and wikilink edges from [[...]] references. Navigate queries can then traverse from code symbols into wiki pages and back — linking a design decision in a journal to the code symbols it describes.

The 280 WikiPage entities in the live database came from this sync. Most are stub pages with link structure but limited body content; pages that have been visited recently via navigate queries have their full content indexed.

Multiple assistants

The database is not tied to a single assistant. Three consumers currently write to the same atheneum DB:

Claude Code (this environment): session data, tool calls, reasoning logs, and discoveries all go through the envoy coordination layer, which writes to atheneum.

atheneum-py: a Python port of the core atheneum library, used to connect Gemini CLI to the same graph. It implements the same memory and knowledge APIs in Python, so a Gemini session can read discoveries written by a Claude Code session and vice versa.

Hermes: an open-source Python AI assistant. The atheneum plugin at ~/.hermes/plugins/atheneum/ gives Hermes read/write access to the same graph. It uses plugin.yaml for discovery and exposes atheneum’s search and memory APIs as Hermes tools.

The multi-assistant aspect is the core design goal. The knowledge graph accumulates across every session, regardless of which assistant ran it. What Claude Code learned about llmgrep’s complexity hotspots yesterday is available to Hermes today.

MCP server

The atheneum-mcp crate (in the same repository) implements an MCP server using the rmcp library. It exposes atheneum’s memory, search, navigate, and discovery APIs as MCP tools for any MCP-compatible client.

I have not tested this end-to-end against a running MCP client in this session. The crate builds and the protocol implementation exists, but I can’t claim it’s been verified beyond compilation.

Current state

v0.5.0 — released 2026-06-09
v0.1.0 — released 2026-05-31

Twelve days of active development. It’s moving fast because it’s solving an immediate problem in my workflow. The graph is real and in use daily. The API is not stable — v0.3.x through v0.5.0 had breaking changes in store_memory, the dreaming module was added and rewritten, and cross-project queries didn’t exist a week ago.

What works: the CLI (search, navigate, memory-list, graph-stats, dream dry-run), the dreaming consolidation pass, wiki sync, memory persistence across sessions. What’s less certain: the MCP server end-to-end, atheneum-py feature parity with the Rust version, HopGraph accuracy at high entity counts.

The source is at github.com/oldnordic. The crate is on crates.io. It requires a magellan-indexed project to be most useful; read the grounded coding workflow for context on how the tools fit together.

Envoy v0.2.0: Observability, Lock-Free Paths, and Bug Fixes

2026-06-12T00:00:00+00:00

Three weeks after the initial release, envoy v0.2.0 is out. This isn’t a feature dump – it’s the result of running the server continuously and fixing the things that actually hurt. Three bugs from the original article got fixed, Prometheus metrics landed, and a performance improvement from another project turned out to transfer cleanly.

What changed

The diff is 1,551 insertions, 701 deletions across 28 files. The big items:

parking_lot everywhere

The original article mentioned envoy uses SQLite for persistence. What I didn’t mention is that the in-memory state (agent registry, circuit breaker, message store) was protected by std::sync::Mutex. Every lock site had poison recovery code – .lock().unwrap_or_else(|e| e.into_inner()) or .lock().map_err(|e| EnvoyError::LockPoisoned(...)). In practice, a poisoned mutex means a panic already happened and the data might be corrupt. “Recovering” by ignoring the poison doesn’t help.

I was already migrating rs3gw (a separate project) to parking_lot::Mutex and noticed the pattern transferred directly. parking_lot mutexes don’t use poisoning – lock() returns a MutexGuard directly, no Result. The changes:

Removed LockPoisoned error variant entirely
Removed recover_lock() helper function
Removed FastMutex type alias
Simplified 25+ .lock() call sites across 7 files
Each mutex went from ~40 bytes to ~1 byte

No behavioral change for callers – the server responds to the same endpoints the same way. But the code is cleaner and the lock overhead is measurably smaller.

Prometheus `/metrics` endpoint

The original envoy had two monitoring endpoints: /health (returns {"status":"ok","uptime_seconds":N}) and /stats (returns aggregate counters). Useful for ad-hoc checks, useless for dashboards or alerting.

v0.2.0 adds GET /metrics in Prometheus exposition format:

# HELP envoy_requests_total Total HTTP requests, labeled by operation and status class
# TYPE envoy_requests_total counter
envoy_requests_total{method="GET",path="/health",status="2xx"} 14

# HELP envoy_agents_online Number of currently active agents
# TYPE envoy_agents_online gauge
envoy_agents_online 3

# HELP envoy_request_duration_ms Request latency in milliseconds
# TYPE envoy_request_duration_ms histogram
envoy_request_duration_ms_bucket{path="/health",le="0.5"} 14
envoy_request_duration_ms_sum{path="/health"} 0.821
envoy_request_duration_ms_count{path="/health"} 14

Three metric families:

Metric	Type	What it measures
`envoy_requests_total`	counter	Request count by method, path, status class
`envoy_request_duration_ms`	histogram	Latency distribution with 9 buckets (0.5ms to 5s)
`envoy_agents_online`	gauge	Active agent count, updated on register/retire

Path normalization. Raw URL paths cause cardinality explosions in Prometheus. A path like /agents/id1/messages/42/ack becomes a unique label, and with thousands of agents and messages you get thousands of time series. The middleware normalizes path segments that look like IDs – numeric (42), named (id1121), or UUID (338b8adc-...) – into a single :id token. So /agents/id1/messages/42/ack becomes /agents/:id/messages/:id/ack. Same metric regardless of which agent or message.

The approach is borrowed from rs3gw, which uses the same metrics + metrics-exporter-prometheus crate combination. The middleware wraps every request, records start time, normalizes the path, increments the counter, and observes the duration.

Prometheus scrape config:

scrape_configs:
  - job_name: 'envoy'
    static_configs:
      - targets: ['127.0.0.1:9876']
    scrape_interval: 15s
    metrics_path: /metrics

Request tracing

Every HTTP response now includes an x-request-id header with a unique UUID. This is done through tower-http layers – SetRequestIdLayer generates the UUID, PropagateRequestIdLayer ensures it appears on responses, and TraceLayer logs request/response pairs when RUST_LOG=tower_http=debug is set.

This is useful when debugging: if an agent reports a failed request, the request ID lets you find the exact log entry.

Bug fixes

Three issues from the original article’s “What’s rough” section got fixed:

`cross/navigate` no longer errors

The cross-project graph navigation endpoint (GET /atheneum/cross/navigate?q=build_router&language=rust&depth=2) was broken because the BFS edge query referenced a kind column, but production magellan databases use edge_type. The fix was a SQL alias: SELECT id, edge_type AS kind, .... Now the alias works with both the production schema and the test fixtures (which use kind). Symbol search (/atheneum/cross/search) was unaffected – it queries graph_entities which does use kind.

Evidence endpoints return JSON

Eight POST handlers – post_prompt, post_tool_call, post_file_write, post_commit, post_test_run, post_fix_chain, post_bench_run, post_subagent_handover – returned bare 201 Created with no response body. Now all return {"recorded": true} so callers can confirm success without relying on HTTP status alone. This was one of the API discoverability complaints from the original article.

API documentation rewritten

The original API.md was incomplete and sometimes wrong – several required fields weren’t documented. v0.2.0 rewrites it from the Rust struct definitions. Every endpoint now has correct request fields, required/optional markers, and response shapes. Verified against the actual handler code, not from memory.

What’s still rough

Honest update on what hasn’t improved:

The MCP polling problem is unchanged. Agents still poll for messages. The WebSocket endpoint exists but coding agents don’t speak WebSocket. This requires a protocol-level change in MCP, not an envoy fix.
Token savings counter still returns 0. Low priority – it’s a nice-to-have metric, not a correctness issue.
No Grafana dashboards yet. The Prometheus metrics are there, but I haven’t built the dashboard JSON. On the list.
Single-node only. Envoy uses a single SQLite database. No clustering, no replication, no multi-node coordination. If you need that, you need a different tool.

Numbers

Version:    0.2.0
LOC:        11,800 (Rust)
Tests:      65 (was 57)
Endpoints:  21+ (added /metrics)
Runtime:    SQLite (no external services)

The test count went from 57 to 65 – the 8 new tests cover the metrics module (path normalization, ID detection, UUID handling, histogram recording).

Install

cargo install agent-envoy

Source: github.com/oldnordic/envoy

Envoy: The Coordination Server AI Coding Agents Were Missing

2026-06-12T00:00:00+00:00

I run multiple AI coding agents in parallel. Claude Code sessions, Hermes agents, subagents spawning subagents. After a while I noticed something: none of them know the others exist. They overwrite each other’s files, repeat discoveries, and have no memory of what happened yesterday. There is no infrastructure for this. So I built one.

Envoy is an HTTP+JSON coordination server for AI coding agents. It provides agent identity, structured messaging, session accountability, and knowledge persistence – all backed by SQLite, no Postgres, no Redis, no Node.js.

What’s missing

Every major AI coding tool (Claude Code, Cursor, Copilot) treats each session as isolated:

Problem	Consequence
No persistent identity	Agents can’t address each other (“tell agent X to stop editing file Y”)
No cross-session memory	Every session re-discovers the same bugs, re-reads the same files
No audit trail	You can’t answer “who changed this file and why?”
No subagent accountability	Subagents fail silently; parents don’t know what happened
No cross-project search	Working on 3 repos means running 3 separate queries

Envoy fills all of these. Whether that’s a good idea depends on whether you actually run multiple agents – if you don’t, this is overkill.

How it works

Everything is SQLite-backed. The stack is:

envoy (HTTP server, this project)
  └── atheneum (embedded knowledge graph)
        └── sqlitegraph (SQLite graph engine with pub/sub)

Start the server:

envoy serve --port 9876

Or as a systemd user service:

systemctl --user start envoy

The server has been running on my machine for 42+ hours straight with no restarts. Health check:

$ curl http://127.0.0.1:9876/health
{"status":"ok","uptime_seconds":152986,"agents_online":2}

Agent identity

Agents register at session start. The server assigns hierarchical IDs:

$ curl -X POST http://127.0.0.1:9876/agents \
  -H "content-type: application/json" \
  -d '{"name":"claude-main","kind":"claude"}'
{"agent_id":"id1","name":"claude-main","is_new":true,...}

Subagents get dotted IDs that encode the hierarchy:

$ curl -X POST http://127.0.0.1:9876/agents \
  -H "content-type: application/json" \
  -d '{"name":"sub-agent-1","kind":"claude","parent_id":"id1"}'
{"agent_id":"id1.1","name":"sub-agent-1","parent_id":"id1",...}

Retiring an agent cascades to its children:

$ curl -X POST http://127.0.0.1:9876/agents/id1/retire \
  -H "X-Agent-Id: id1" \
  -H "content-type: application/json" \
  -d '{"agent_id":"id1"}'
{"affected":["id1","id1.1"],"retired":true}

Session accountability

Every session writes structured data through envoy-hook (a companion binary that plugs into Claude Code’s hook system). The lifecycle is:

SessionStart   → POST /atheneum/sessions
PostToolUse    → POST /atheneum/tool-calls
SubagentStop   → POST /atheneum/sessions/{id}/handover
Stop           → PATCH /atheneum/sessions/{id}

Query prior sessions before starting work:

$ curl -s "http://127.0.0.1:9876/atheneum/sessions?project=envoy&last=1"
[{
  "session_id": "...",
  "project": "envoy",
  "git_branch": "main",
  "tool_call_count": 47,
  "file_write_count": 12,
  "last_tool": "cargo test",
  "last_tool_summary": "all 34 tests passed"
}]

Tool call logging requires session_id, tool_name, and exit_status (the fields that tripped me up during testing – the API is precise about what it expects):

$ curl -X POST http://127.0.0.1:9876/atheneum/tool-calls \
  -H "X-Agent-Id: id1" \
  -H "content-type: application/json" \
  -d '{"session_id":"...","tool_name":"read_file",
       "exit_status":"success","input_summary":"read src/main.rs",
       "output_summary":"42 lines","latency_ms":150}'

Messaging between agents

Agents send messages to each other. This is the core coordination primitive:

$ curl -X POST http://127.0.0.1:9876/messages \
  -H "X-Agent-Id: id1" \
  -H "content-type: application/json" \
  -d '{"type":"direct","from":"id1","to":"id2",
       "parts":[{"text":"hey, the build is green"}]}'
{"message_id":"6751","from":"id1","to":"id2",...}

The recipient polls for pending messages:

$ curl -s "http://127.0.0.1:9876/agents/id2/messages/pending"
{
  "count": 1,
  "messages": [{
    "message_id": "6751",
    "from": "id1",
    "parts": [{"text": "hey, the build is green"}]
  }]
}

And acknowledges receipt:

$ curl -X POST http://127.0.0.1:9876/messages/6751/ack \
  -H "X-Agent-Id: id2" \
  -H "content-type: application/json" \
  -d '{"agent_id":"id2"}'
{"acked_by":["id2"],"message_id":"6751"}

The polling problem. This is the biggest pain point. The MCP (Model Context Protocol) interface that coding agents use is request-response: the agent asks a question, the server answers. There is no push mechanism. When agent A sends agent B a message, agent B only finds out the next time it explicitly polls pending. In practice, agents need to check periodically, which means either:

Wasting tokens on poll loops (“any messages for me? no? ok”)
Adding latency – a message sits undelivered until the next poll

The WebSocket endpoint exists (/ws) but coding agents don’t speak WebSocket natively. They speak HTTP. Until MCP adds a push/subscription mechanism, polling is the only option. This is a protocol limitation, not an implementation choice.

Knowledge persistence

Agents store discoveries so future sessions don’t re-derive them:

$ curl -X POST http://127.0.0.1:9876/atheneum/discoveries \
  -H "X-Agent-Id: id1" \
  -H "content-type: application/json" \
  -d '{"agent":"claude","discovery_type":"Bug",
       "target":"query_sessions",
       "metadata":{"file":"evidence.rs","line":547,
                   "why":"anonymous ? params required"}}'
{"discovery_id":7502,...}

Cross-project code search

This one I use daily. When you work on multiple codebases simultaneously, you need to find symbols across all of them. Envoy queries all magellan-indexed projects from one endpoint without copying data:

# One-time setup per project
atheneum meta-register envoy ~/Projects/envoy \
  ~/.magellan/envoy/envoy.db --language rust

# Search across all registered projects
$ curl "http://127.0.0.1:9876/atheneum/cross/search?q=build_router&language=rust&k=5"
{
  "count": 5,
  "results": [
    {"project":"envoy","name":"build_router","kind":"Function",
     "file":"src/http/router.rs","line":81},
    {"project":"envoy","name":"build_router calls build_base_routes",
     "kind":"Call","file":"src/http/router.rs","line":82},
    ...
  ]
}

How it works: envoy delegates to atheneum’s CrossRouter, which lazily ATTACH DATABASE each project’s magellan DB (read-only) and queries across schemas. An LRU cache keeps hot DBs attached across requests. SQLite limits this to ~10 attached databases, so the cache defaults to 8.

The deeper navigate endpoint (/atheneum/cross/navigate) that does BFS graph walks across projects currently errors on the cross-schema edge queries. That’s a known bug – the UNION ALL over attached schemas doesn’t find the edges table. Search works, graph navigation doesn’t yet.

The knowledge graph underneath

Envoy sits on top of atheneum, which stores everything as a property graph. Real numbers from my running instance:

Entity counts (4,747 total):
  ToolCall:    2,399    Session:     231    File:      203
  Reference:     338    WikiPage:    280    Import:    198
  ReasoningLog:  329    Symbol:      190    Memory:    130
  TestRun:       120    Discovery:     3    Event:       3

Edge counts (15,210 total):
  belongs_to_project: 4,184    accessed:    635
  observed_in:        3,435    modified:    393
  wikilink:           3,220    CALLS:       116
  handled_by_tool:    2,399    IMPORTS:      84
  performed_by:         233    REFERENCES:  145

This is what makes cross-session memory possible. When a new session starts, it queries the graph for prior context instead of re-discovering everything.

What’s rough

Honest assessment of what doesn’t work well:

The MCP polling problem described above. No push mechanism, no subscriptions, no server-sent events. Agents waste tokens polling or accept delivery latency.
The /atheneum/cross/navigate endpoint errors on cross-schema edge queries. Symbol search works, graph walks don’t.
API discoverability is poor. Several endpoints have required fields that aren’t documented anywhere except the Rust source. I found agent is required on session creation, tool_name instead of tool on tool-calls, agent_id on ack – all through 422 errors.
The events endpoint returns an empty body on success (no confirmation JSON), which makes it hard to verify it worked.
Token savings counter in the knowledge endpoint always returns 0. Never got around to implementing the calculation.
v0.1.1 – 127 commits, 11.5K LOC, but still early. No backward compatibility guarantees yet.

The post-mortem that shaped it

During development, a private git dependency broke CI for 8 consecutive runs. The dependency was specified as a git = "..." URL in Cargo.toml. It resolved fine locally (cached) but failed on every CI runner (fresh clone, no cache, no SSH key for the private repo). The error was misleading – cargo reported “revival failed” which looked like a registry issue, not an access issue.

That incident directly led to three envoy features:

Session accountability – if CI had logged what it actually did vs. what it claimed, the SSH key issue would have been obvious in 1 run instead of 8
Structured tool call logging – the difference between “cargo check failed” and “cargo check failed because SSH key was missing for git+https://…” is the difference between 1 hour and 6 hours of debugging
The subagent trust model – subagents are not trusted by default. Their output is only valid when all verification gates pass (magellan queries ran, cargo check green, no stubs). If a subagent’s hooks blocked it, its summary is discarded as unreliable

Current state

Version:    0.1.1
LOC:        11,562 (Rust)
Tests:      5,223 lines
Commits:    127
Endpoints:  20+ (agents, sessions, messages, tool-calls, events,
              discoveries, graph, cross-project search, health,
              circuit breakers)
Runtime:    SQLite (no external services)
License:    GPL-3.0-only

Install:

cargo install agent-envoy

Or as part of the grounded-coding stack (also installs magellan, llmgrep, mirage, splice):

curl -fsSL https://raw.githubusercontent.com/oldnordic/grounded-coding/master/install.sh | sh

Source: github.com/oldnordic/envoy

Training a Geometric Language Model in Pure Rust: First Results

2026-06-12T00:00:00+00:00

The geometric decoder post described how a corpus-native graph can guide token decoding through Rodrigues rotation and curvature weighting. This post covers what happens when you connect that graph to a training loop and actually try to learn next-token prediction from it.

Everything runs on CPU. No GPU, no autograd framework — just pure Rust with manual backprop.

What’s being trained

The architecture:

input:  8 context token positions × 3D coords = 24 floats
hidden: MLP(24 → hidden_dim → vocab_size)
output: softmax over dense vocab

The positions come from the same PMI+SVD pipeline as the decoder experiments: co-occurrence statistics → TruncatedSVD → unit-sphere 3D coordinates per token. The MLP maps those geometric coordinates to a next-token distribution.

What’s not there: learned embeddings, attention, positional encoding, transformer blocks. The geometry is the representation. The MLP is the prediction head.

Why no framework. Burn and Candle both have poor CPU performance and are primarily CUDA infrastructure. The experiments are CPU-first and AMD GPU later. Writing the forward and backward passes directly in Rust costs a few hundred lines (algorithms/mlp.rs, algorithms/adam.rs in geographdb-core) and avoids pulling in a dependency chain that doesn’t fit the use case.

The backward pass for the Rodrigues layer:

δW_out  = h.T @ δlogits
δh      = δlogits @ W_out

Rodrigues rotation matrices are orthogonal, so R^T = R^{-1}. Gradients flow back through the transport step without inverting anything:

δh_u = Σ_{v: u∈N(v)} R_{vu}^T · δh2_v

Trainable parameters: the MLP weights only. The coordinate positions are fixed (frozen from the SVD), and the Rodrigues rotations have no parameters — they’re computed from 3D positions at forward-pass time.

Toy corpus: does the implementation work?

Before touching TinyStories, the training loop was tested on a hand-built two-community graph: 8 nodes split into two spatial clusters, sequences that walk within or between communities.

Result: 100% accuracy after 200 epochs. Loss curve is monotonically decreasing. The MLP can learn to separate the two communities from 3D coordinate context alone.

This isn’t impressive on its own — it’s 8 nodes — but it validates that the forward pass, backward pass, Adam update, and the gradient accumulation are all correct.

First run on TinyStories: a training bug

The first TinyStories run (2,000 stories, vocab 3,547 tokens + UNK, dim 64, lr=0.001) showed a diagnostic failure:

epoch 1  loss=7.50
epoch 2  loss=7.76

Loss went up on epoch 2. That’s optimizer divergence, not architecture failure.

Root cause: the training loop was calling one Adam step per training example. Per-example Adam is stochastic gradient descent with maximum noise: each of the ~100K examples in a 2,000-story epoch produces its own independent parameter update, and Adam’s moment estimates are meaningless when computed on a single data point. With a 3,547-class output, each update is 226K parameter changes computed from one token’s gradient.

Fix: accumulate gradients over batches of 128, divide by batch size, then one Adam step. Standard mini-batch SGD. Lower default LR to 1e-4. Already had clip_gradients in Adam — wired it in.

With the fix, the same run:

epoch 1  loss=6.006
epoch 2  loss=5.831
epoch 3  loss=5.741
epoch 4  loss=5.668
epoch 5  loss=5.616

Monotonically decreasing across all 5 epochs. No divergence. Train loss ≈ validation perplexity (no overfitting — the model hasn’t learned enough to overfit).

Results: 2k stories, 5 epochs

Model	Validation perplexity
Bigram (Laplace-smoothed)	175.7
Geometric MLP (frozen coords)	282.8
Geometric MLP + curvature weighting	304.5

Bigram wins. The geometric model is learning (282 vs. ~4096 random), but not beating the baseline.

Two things are worth unpacking here.

Why bigram wins. Bigram takes the exact previous token as input and directly reads co-occurrence counts. The geometric MLP takes 3D positions as input. The SVD compression maps tokens to unit-sphere coordinates based on shared neighborhood structure — tokens that co-occur with similar neighbors end up nearby. But nearby tokens aren’t identical: the 3D position is a lossy representation of the token identity. The MLP has to recover discriminative signal from compressed coordinates. Bigram has no such compression; it works directly from identity.

Why the comparison is slightly asymmetric. Bigram uses 1-token context. The geometric model uses 8-token context (8 × 3D positions). The geometric model has more information in principle, but at this data scale the 3D coordinates don’t carry enough structure to exploit the longer context. With 2,000 training stories, the PMI co-occurrence matrix is sparse — many token pairs never co-occur, and the SVD positions don’t reliably separate semantically distinct tokens.

Why curvature weighting hurts. The curvature evaluation adds a heuristic log-probability bias (angle continuity + κ penalty, both with fixed coefficients) on top of the learned MLP logits at inference time. If the MLP has already learned something useful, overlaying an untuned heuristic distorts it. The curvature signal isn’t useless — it actively helped in the decoder traversal experiments — but there it was the only signal. Adding it as a fixed-coefficient bonus over a trained model requires tuning those coefficients, not hardcoding them at 1.0.

20,000 stories, 15 epochs: the full result

The 20k run added two variants not tested before: a trigram model (takes two previous token IDs, no geometry) and a hybrid model (two previous token IDs + 8 previous 3D positions). This makes the comparison direct: does geometry add anything on top of token identity?

Model	Validation perplexity
Bigram baseline	72.97
Trigram (token identity only)	32.02
Hybrid (token identity + geometry)	43.24
Hybrid + κ weighting	43.81

Geometry does not add signal. Trigram beats hybrid by 11 perplexity points. The MLP gets a cleaner signal from two token IDs than from two token IDs plus 8 × 3D coordinates. The curvature-weighted variant is slightly worse than plain hybrid.

Training dynamics match the numbers. Trigram fit the training set harder and plateaued around loss 3.18. Hybrid plateaued around 3.61 and started overfitting after epoch 9 — the geometric features are hurting generalisation, not helping it.

Why geometry doesn’t help here:

PMI+SVD positions encode shared co-occurrence neighborhood structure. Tokens that appear in similar contexts end up nearby in 3D space. That’s useful for finding semantically related tokens, but next-token prediction doesn’t need semantically related tokens — it needs the likely next token given the current context. A 3D coordinate tells you what a token is like; it doesn’t tell you what comes after it. The token ID tells you both.

The 8-position geometric context should in principle carry more information than a single token ID (which is what bigram uses). In practice, the MLP can’t extract that signal from the SVD coordinates. The two-token-ID trigram dominates by a large margin over everything else.

Geo-attention: single-head graph attention over geometric neighbors

The MLP result raised a different question: maybe the architecture is the constraint, not the representation. An MLP treats all 8 context positions equally and independently. A token’s geometric neighbors might carry signal that only becomes useful when actively queried — matching what the current token is “looking for” against what its neighbors know.

GraphAttentionClassifier implements this directly:

Token embedding table (learned)
Learned W_q, W_k, W_v projections
Each context token attends to itself + its k geometric neighbors from the PMI graph
Residual update: h = embedding + attention(...)
MLP head on the last context position
Full backward pass through attention weights and MLP

The same 20k/15ep setup, four variants in parallel:

Model	Validation perplexity
One-hot trigram (baseline)	32.02
Geo-attention + 4 neighbors	55.54
Geometric rotated + 4 neighbors	126.85
Geometric absolute	145.08
Geometric rotated (no neighbors)	272.98

Attention over geometry is much better than MLP over geometry. Geo-attention (55.54) is roughly 2.5x better than the best MLP-on-geometry variant (127 ppl). The query/key/value mechanism gives the model a “search and correlate” capability the flat MLP doesn’t have: it can weight neighbors selectively based on what the current token embedding is asking for.

Geometry still loses to token identity. Even with attention, geo-attention is 23 ppl behind one-hot trigram. Rotation alone (no neighbors) was near-useless (273 ppl); adding 4 neighbors rescued it to 127 ppl. Local geometric neighborhoods carry some signal — but only when actively queried, and not enough to close the gap with trigram.

Why the gap persists. PMI+SVD positions cluster tokens by shared co-occurrence context — tokens that appear in similar environments end up nearby in 3D space. That’s a semantic similarity measure. Next-token prediction needs successor structure: which token tends to follow this one. These are different things. “Dog” and “cat” are geometric neighbors (similar contexts); neither predicts the other as a next token. The trigram baseline reads co-occurrence directly as successor frequency. The PMI graph doesn’t preserve that direction.

Where this leaves things

The full experiment arc so far, at 20k stories / 15 epochs:

Architecture	Representation	Validation ppl
MLP	One-hot trigram	32.02
MLP	Hybrid (token ID + geometry)	43.24
Attention	Graph neighbors	55.54
MLP	Geometric rotated + neighbors	126.85
MLP	Geometric absolute	145.08
MLP	Geometric rotated	272.98

The bottleneck is the PMI+SVD graph construction, not the model. To beat trigram with geometry, the geometric space itself needs to encode successor structure — either learned end-to-end, or derived from a graph that preserves directional co-occurrence rather than symmetric neighborhood similarity. That’s the next question.

Reproduce

git clone https://github.com/oldnordic/geographdb-core
git clone https://github.com/oldnordic/geographdb-experiments

cd geographdb-experiments
cargo run --release --bin train_geometric -- \
  --dataset roneneldan/TinyStories \
  --vocab-size 4096 \
  --dim 64 \
  --epochs 5 \
  --lr 1e-4 \
  --max-train-stories 2000 \
  --max-val-stories 1000

Hardware: AMD Ryzen 7 7800X3D, 64 GB RAM, no GPU used. Training 2k stories for 5 epochs takes roughly 8 minutes on this machine.

The tokenizer is cached to --output (default /tmp/train_geometric_tinystories) after the first run.

Code

MLP ops + backward: geographdb-core/src/algorithms/mlp.rs
Adam optimizer: geographdb-core/src/algorithms/adam.rs
Training binary: geographdb-experiments/src/bin/train_geometric.rs
Rodrigues rotation: geographdb-core/src/algorithms/parallel_transport.rs

Luiz Spies — Technical Notes

Transformer X-Ray, Part II: The BOS Bottleneck

What was measured

Shared structure

Case 1: dna_acid

Case 2: shakespeare_romeo

Case 3: wwii_end

The shape

What this does and does not show

Files

X-raying a Transformer Forward Pass

What the tracer captures

Correct prediction: Paris → city

Wrong prediction: Myanmar → Yangon

What the comparison shows

Where this runs

What comes next

Multi-Layer Graphs, Ricci Curvature, and a Hypothesis About How Computation Should Route

The transformer’s structural problem

A multi-layer graph alternative

The tensor as routing signal

Why observe before building

What already exists

What this would mean if it works

rocmforge: GeoGraph Execution Engine and Branch Selection with a 0.5B Model

The execution engine

Branch selection with a 0.5B model

What’s next

Geometric-Only Attention: Linear Scaling from Sparse Neighborhoods

Geometric-only attention

Benchmark

What this costs in accuracy

The training run

Connection between the two

Code

Geometry as Substrate: What the Failing Results Are Telling Us

Negative results are signals, not failures

Why PMI is the wrong map

The deeper problem: static geometry

Three connected ideas

Fractals, attractors, and deterministic structure

Where this is going

Atheneum: Persistent Memory for AI Coding Agents

What’s in the graph

Navigate and search

Dreaming module

HopGraph

Cross-project queries (v0.5.0)

Wiki sync

Multiple assistants

MCP server

Current state

Envoy v0.2.0: Observability, Lock-Free Paths, and Bug Fixes

What changed

parking_lot everywhere

Prometheus /metrics endpoint

Request tracing

Bug fixes

cross/navigate no longer errors

Evidence endpoints return JSON

API documentation rewritten

What’s still rough

Numbers

Install

Envoy: The Coordination Server AI Coding Agents Were Missing

What’s missing

How it works

Agent identity

Session accountability

Messaging between agents

Knowledge persistence

Cross-project code search

The knowledge graph underneath

What’s rough

The post-mortem that shaped it

Current state

Training a Geometric Language Model in Pure Rust: First Results

What’s being trained

Toy corpus: does the implementation work?

First run on TinyStories: a training bug

Case 1: `dna_acid`

Case 2: `shakespeare_romeo`

Case 3: `wwii_end`

Prometheus `/metrics` endpoint

`cross/navigate` no longer errors