<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://oldnordic.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://oldnordic.github.io/" rel="alternate" type="text/html" /><updated>2026-06-16T14:08:53+00:00</updated><id>https://oldnordic.github.io/feed.xml</id><title type="html">Luiz Spies — Technical Notes</title><subtitle>Notes on language geometry, LLM inference, code intelligence, and agentic AI systems. All results reproducible. No hype.</subtitle><author><name>Luiz Spies</name></author><entry><title type="html">Transformer X-Ray, Part II: The BOS Bottleneck</title><link href="https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray-depth-profiles.html" rel="alternate" type="text/html" title="Transformer X-Ray, Part II: The BOS Bottleneck" /><published>2026-06-16T00:00:00+00:00</published><updated>2026-06-16T00:00:00+00:00</updated><id>https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray-depth-profiles</id><content type="html" xml:base="https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray-depth-profiles.html"><![CDATA[<p>The first <a href="https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray.html">x-ray post</a> looked at where the prediction position sends attention mass. That was useful, but it flattened the forward pass into one aggregate graph.</p>

<p>The more interesting question is when the routing pattern changes.</p>

<p>I ran a small sweep of 16 prompt pairs and compared layerwise attention divergence between an intended continuation context and a contradictory one. Three examples stood out:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">dna_acid</code></li>
  <li><code class="language-plaintext highlighter-rouge">shakespeare_romeo</code></li>
  <li><code class="language-plaintext highlighter-rouge">wwii_end</code></li>
</ul>

<p>They produce three different depth profiles. Together they suggest a consistent structure: the forward pass compresses into a BOS-dominated bottleneck around layer 16 of 24, then reopens near the output.</p>

<p>This post is about that shape.</p>

<hr />

<h2 id="what-was-measured">What was measured</h2>

<p>For each prompt pair, I traced the full forward pass and computed layerwise Jensen-Shannon divergence between the attention-flow distributions under:</p>

<ul>
  <li>an intended continuation context</li>
  <li>a contradictory continuation context</li>
</ul>

<p>Separately, I aggregated the prediction-position attention by layer and measured how much of that mass went to position 0 (BOS).</p>

<p>One caveat up front: in this sweep, “correct” means the prompt was continued with the intended context, not necessarily that the model produced the expected answer token. Two of the three examples below do <strong>not</strong> produce the expected final token even in the intended context. They are still useful because the claim here is about routing depth, not benchmark accuracy.</p>

<p>The relevant trace files are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">/tmp/dna_correct.jsonl</code></li>
  <li><code class="language-plaintext highlighter-rouge">/tmp/shakespeare_correct.jsonl</code></li>
  <li><code class="language-plaintext highlighter-rouge">/tmp/wwii_correct.jsonl</code></li>
  <li><code class="language-plaintext highlighter-rouge">/tmp/kappa_results.jsonl</code></li>
</ul>

<hr />

<h2 id="shared-structure">Shared structure</h2>

<p>Across all three traces:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">L0-L2</code>: local/contextual routing dominates</li>
  <li><code class="language-plaintext highlighter-rouge">L3</code>: BOS turns on abruptly</li>
  <li><code class="language-plaintext highlighter-rouge">L16</code>: BOS reaches maximum compression</li>
  <li><code class="language-plaintext highlighter-rouge">L17-L23</code>: BOS releases and semantic positions reopen</li>
</ul>

<p>The normalized BOS share at the bottleneck layer:</p>

<table>
  <thead>
    <tr>
      <th>Example</th>
      <th><code class="language-plaintext highlighter-rouge">L16</code> BOS share</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dna_acid</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.933</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">shakespeare_romeo</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.915</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">wwii_end</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.889</code></td>
    </tr>
  </tbody>
</table>

<p>This is the strongest invariant in the data so far. The bottleneck is not vaguely “somewhere in the middle.” On these traces it sits around two-thirds depth: layer 16 of 24.</p>

<hr />

<h2 id="case-1-dna_acid">Case 1: <code class="language-plaintext highlighter-rouge">dna_acid</code></h2>

<p>Prompt family: <code class="language-plaintext highlighter-rouge">"DNA stands for deoxyribonucleic ..."</code></p>

<p>This is the clean memorized-phrase case.</p>

<p>Layerwise divergence:</p>

<ul>
  <li>average JS divergence: <code class="language-plaintext highlighter-rouge">0.0553</code></li>
  <li><code class="language-plaintext highlighter-rouge">L2</code> peak: <code class="language-plaintext highlighter-rouge">0.1803</code></li>
</ul>

<p>That early peak matters. The contradiction shows up before the BOS sink fully activates.</p>

<p>Prediction-position BOS share:</p>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>BOS share</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L0</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.045</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L1</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.049</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L2</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.034</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L3</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.510</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L16</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.933</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L23</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.536</code></td>
    </tr>
  </tbody>
</table>

<p>Interpretation:</p>

<ul>
  <li>early layers are reading a memorized lexical pattern</li>
  <li>BOS compression takes over at <code class="language-plaintext highlighter-rouge">L3</code></li>
  <li>by <code class="language-plaintext highlighter-rouge">L16</code>, the trace is almost fully collapsed into BOS</li>
  <li>by <code class="language-plaintext highlighter-rouge">L23</code>, BOS is still dominant, but semantic positions re-emerge strongly</li>
</ul>

<p>This is an early-detection contradiction.</p>

<hr />

<h2 id="case-2-shakespeare_romeo">Case 2: <code class="language-plaintext highlighter-rouge">shakespeare_romeo</code></h2>

<p>Prompt family: <code class="language-plaintext highlighter-rouge">"Romeo and Juliet was written by ..."</code></p>

<p>This is the low-confidence late-separation case.</p>

<p>Layerwise divergence:</p>

<ul>
  <li>average JS divergence: <code class="language-plaintext highlighter-rouge">0.0503</code></li>
  <li><code class="language-plaintext highlighter-rouge">L23</code> peak: <code class="language-plaintext highlighter-rouge">0.1506</code></li>
</ul>

<p>Prediction-position BOS share:</p>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>BOS share</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L0</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.037</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L1</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.146</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L2</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.091</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L3</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.734</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L16</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.915</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L23</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.416</code></td>
    </tr>
  </tbody>
</table>

<p>Confidence on the intended-context trace is low: <code class="language-plaintext highlighter-rouge">0.2678</code>.</p>

<p>That shows up in the late layers. BOS weakens, but the trace does not collapse into one clear semantic target. The upper layers remain distributed.</p>

<p>This is not the same pattern as <code class="language-plaintext highlighter-rouge">dna_acid</code>. The contradiction is not caught early. The middle stack is comparatively stable. The separation comes late, and the final routing remains diffuse.</p>

<hr />

<h2 id="case-3-wwii_end">Case 3: <code class="language-plaintext highlighter-rouge">wwii_end</code></h2>

<p>Prompt family: <code class="language-plaintext highlighter-rouge">"World War II ended in ..."</code></p>

<p>This is the strongest late-release case.</p>

<p>Layerwise divergence:</p>

<ul>
  <li>average JS divergence: <code class="language-plaintext highlighter-rouge">0.0402</code></li>
  <li><code class="language-plaintext highlighter-rouge">L23</code> peak: <code class="language-plaintext highlighter-rouge">0.1308</code></li>
</ul>

<p>Prediction-position BOS share:</p>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>BOS share</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L0</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.049</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L1</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.121</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L2</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.049</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L3</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.677</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L16</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.889</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">L23</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.101</code></td>
    </tr>
  </tbody>
</table>

<p>The important detail here is not that BOS is literally absent in the early layers. It is not. The important detail is that BOS is negligible relative to the semantic positions before <code class="language-plaintext highlighter-rouge">L3</code>, then almost disappears again by <code class="language-plaintext highlighter-rouge">L23</code>.</p>

<p>At <code class="language-plaintext highlighter-rouge">L23</code>, the top positions are no longer BOS-dominated:</p>

<ul>
  <li>position <code class="language-plaintext highlighter-rouge">2</code>: <code class="language-plaintext highlighter-rouge">4.808</code></li>
  <li>position <code class="language-plaintext highlighter-rouge">4</code>: <code class="language-plaintext highlighter-rouge">4.333</code></li>
  <li>position <code class="language-plaintext highlighter-rouge">5</code>: <code class="language-plaintext highlighter-rouge">2.070</code></li>
  <li>BOS: <code class="language-plaintext highlighter-rouge">1.411</code></li>
</ul>

<p>So the late layers are reading the slot directly. The model stops using BOS as the dominant routing anchor and turns back to the content positions that matter for the year completion.</p>

<hr />

<h2 id="the-shape">The shape</h2>

<p>The hourglass is real, but it is not symmetric.</p>

<p>The observed structure on these traces is:</p>

<ol>
  <li><strong>Bottom opens early</strong>: <code class="language-plaintext highlighter-rouge">L0-L2</code> are local and content-heavy</li>
  <li><strong>Compression begins abruptly</strong>: BOS turns on at <code class="language-plaintext highlighter-rouge">L3</code></li>
  <li><strong>Neck sits late</strong>: maximum compression is around <code class="language-plaintext highlighter-rouge">L16</code></li>
  <li><strong>Top reopens near output</strong>: <code class="language-plaintext highlighter-rouge">L17-L23</code> release BOS and return mass to semantic positions</li>
</ol>

<p>That is more precise than saying “attention becomes abstract in the middle.”</p>

<p>More concretely:</p>

<ul>
  <li>early layers detect lexical or contextual pattern</li>
  <li>middle layers compress routing into a stable transport regime</li>
  <li>late layers reopen that routing to make the final commitment</li>
</ul>

<p>The three examples differ in where contradiction becomes visible:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">dna_acid</code>: early lexical contradiction</li>
  <li><code class="language-plaintext highlighter-rouge">shakespeare_romeo</code>: late, low-confidence separation</li>
  <li><code class="language-plaintext highlighter-rouge">wwii_end</code>: late semantic slot reading</li>
</ul>

<hr />

<h2 id="what-this-does-and-does-not-show">What this does and does not show</h2>

<p>What it shows:</p>

<ul>
  <li>the forward pass has a measurable depth profile</li>
  <li>BOS can act as a real routing bottleneck</li>
  <li>different prompt types diverge at different depths</li>
</ul>

<p>What it does <strong>not</strong> yet show:</p>

<ul>
  <li>that this bottleneck is universal</li>
  <li>that <code class="language-plaintext highlighter-rouge">L16</code> is stable across models</li>
  <li>that the early/late split cleanly maps to “memorized” vs “compositional” at scale</li>
</ul>

<p>The current sweep is <code class="language-plaintext highlighter-rouge">N=16</code>. That is enough to preserve the pattern, not enough to treat it as settled.</p>

<p>The next useful run is larger and simpler:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">50-100</code> matched prompt pairs</li>
  <li>first divergence layer</li>
  <li>peak divergence layer</li>
  <li>area under the divergence curve</li>
  <li>split by prompt type</li>
</ol>

<p>If the layer-16 bottleneck and the early-vs-late split survive that, this stops being a visual anecdote and starts becoming a routing diagnostic.</p>

<hr />

<h2 id="files">Files</h2>

<p>The x-ray tracer is in <a href="https://github.com/oldnordic/rocmforge">rocmforge</a>. The visualization work is in <a href="https://github.com/oldnordic/geographdb-core">geographdb-core</a>.</p>

<p>Relevant local artifacts from this run:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">/tmp/kappa_results.jsonl</code></li>
  <li><code class="language-plaintext highlighter-rouge">/tmp/dna_correct.jsonl</code></li>
  <li><code class="language-plaintext highlighter-rouge">/tmp/shakespeare_correct.jsonl</code></li>
  <li><code class="language-plaintext highlighter-rouge">/tmp/wwii_correct.jsonl</code></li>
</ul>

<p>The first post in this series is here:</p>

<ul>
  <li><a href="https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray.html">X-raying a Transformer Forward Pass</a></li>
</ul>]]></content><author><name>Luiz Spies</name></author><category term="rocmforge" /><category term="mechanistic-interpretability" /><summary type="html"><![CDATA[The first x-ray post looked at where the prediction position sends attention mass. That was useful, but it flattened the forward pass into one aggregate graph.]]></summary></entry><entry><title type="html">X-raying a Transformer Forward Pass</title><link href="https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray.html" rel="alternate" type="text/html" title="X-raying a Transformer Forward Pass" /><published>2026-06-16T00:00:00+00:00</published><updated>2026-06-16T00:00:00+00:00</updated><id>https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray</id><content type="html" xml:base="https://oldnordic.github.io/rocmforge/mechanistic-interpretability/2026/06/16/transformer-xray.html"><![CDATA[<p>What does attention actually do, token by token, layer by layer? Not the textbook answer — the actual numbers, on a real prompt, with a real model.</p>

<p>I built a forward-pass tracer into <a href="https://github.com/oldnordic/rocmforge">rocmforge</a> that captures every attention edge as inference runs, then renders it as a graph. This post shows what came out.</p>

<hr />

<h2 id="what-the-tracer-captures">What the tracer captures</h2>

<p>Every transformer forward pass is a flow: embeddings at the bottom, logits at the top, attention routing information between positions at each layer.</p>

<p>The tracer records this as a JSONL stream:</p>

<ul>
  <li><strong>node</strong> records: one per component (input_embedding, query, key, value, attention_output, mlp_hidden, logits, confidence) per layer per sequence position</li>
  <li><strong>edge</strong> records: attention edges with <code class="language-plaintext highlighter-rouge">src_position</code>, <code class="language-plaintext highlighter-rouge">dst_position</code>, <code class="language-plaintext highlighter-rouge">weight</code> — the raw softmax output, summed across heads</li>
  <li><strong>meta</strong> record: predicted token, confidence, and expected attention positions for the prompt</li>
</ul>

<p>Weights are summed across all 25 layers and all heads. This gives total attention mass per (src, dst) pair across the full forward pass.</p>

<hr />

<h2 id="correct-prediction-paris--city">Correct prediction: Paris → city</h2>

<p>Prompt: <em>“The capital of France is Paris. Paris is a…”</em></p>

<p>Predicted token: <strong>city</strong> (confidence 0.773)</p>

<p>Expected positions: <code class="language-plaintext highlighter-rouge">{0, 4, 5}</code> — BOS token, “Paris”, “is”</p>

<p><img src="/assets/images/xray_correct.png" alt="Correct prediction x-ray" /></p>

<p><em>Left: attention flow graph, positions 0–8, components stacked bottom to top. Right: what the last position (pos 8) attends to, colored by expected (green) vs unexpected (orange/red).</em></p>

<p>The convergence bar is what matters. Position 8 (prediction position) attends to:</p>

<table>
  <thead>
    <tr>
      <th>Position</th>
      <th>Token</th>
      <th>Weight</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>BOS</td>
      <td>166</td>
      <td>expected</td>
    </tr>
    <tr>
      <td>4</td>
      <td>“Paris”</td>
      <td>31</td>
      <td>expected</td>
    </tr>
    <tr>
      <td>2</td>
      <td>“capital”</td>
      <td>~8</td>
      <td>unexpected</td>
    </tr>
    <tr>
      <td>5</td>
      <td>“is”</td>
      <td>~6</td>
      <td>expected</td>
    </tr>
  </tbody>
</table>

<p>Strong BOS sink. Dominant expected positions. Four unexpected positions with low mass. Model routes to the right context.</p>

<hr />

<h2 id="wrong-prediction-myanmar--yangon">Wrong prediction: Myanmar → Yangon</h2>

<p>Prompt: <em>“The capital of Myanmar is”</em></p>

<p>Predicted token: <strong>Yang</strong> (→ Yangon, confidence 0.9999)</p>

<p>Correct answer: Naypyidaw (Myanmar moved its capital in 2006)</p>

<p>Expected positions: <code class="language-plaintext highlighter-rouge">{0, 1, 3, 4}</code> — BOS, “The”, “capital”, “of”</p>

<p><img src="/assets/images/xray_wrong.png" alt="Wrong prediction x-ray" /></p>

<p><em>Same layout. Position 13 (prediction) attends to 14 positions. Many are unexpected.</em></p>

<table>
  <thead>
    <tr>
      <th>Position</th>
      <th>Weight</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0 (BOS)</td>
      <td>173</td>
      <td>expected</td>
    </tr>
    <tr>
      <td>13 (self)</td>
      <td>30</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>12</td>
      <td>22</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>6 (“Myanmar”)</td>
      <td>22</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>9</td>
      <td>14</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>7</td>
      <td>13</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>4</td>
      <td>10</td>
      <td>expected</td>
    </tr>
    <tr>
      <td>5</td>
      <td>7</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>3</td>
      <td>7</td>
      <td>expected</td>
    </tr>
    <tr>
      <td>11</td>
      <td>5</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>8</td>
      <td>5</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>10</td>
      <td>2</td>
      <td><strong>unexpected</strong></td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>expected</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="what-the-comparison-shows">What the comparison shows</h2>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Correct</th>
      <th>Wrong</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>BOS sink (pos 0)</td>
      <td>166</td>
      <td><strong>173</strong></td>
    </tr>
    <tr>
      <td>Active positions</td>
      <td>9</td>
      <td><strong>14</strong></td>
    </tr>
    <tr>
      <td>Unexpected positions &gt; 0.1</td>
      <td>4</td>
      <td><strong>10</strong></td>
    </tr>
    <tr>
      <td>Confidence</td>
      <td>0.773</td>
      <td>0.9999</td>
    </tr>
  </tbody>
</table>

<p>The BOS sink does not move. It gets slightly stronger in the wrong prediction. That rules out sink displacement as the failure cause.</p>

<p>What changes: unexpected positions dominate. The model’s final token pulls mass from positions that activate the Myanmar→Yangon co-occurrence — Yangon was the capital until 2006 and appears far more frequently in training data than Naypyidaw. The model commits to this with 0.9999 confidence, not because the readout layer fails, but because attention routed to the wrong context.</p>

<p>Failure modes observed: higher attention entropy (#3) and unexpected-position mass dominance (#4). The readout layer (logit projection) works correctly on whatever context attention delivered — the error is upstream.</p>

<hr />

<h2 id="where-this-runs">Where this runs</h2>

<p>The tracer is in rocmforge, emitting JSONL from the CPU inference hotpath. It runs on any GGUF model loadable by the existing CPU engine. The visualization is a Python script (<code class="language-plaintext highlighter-rouge">plot_forward_graph.py</code>) in <a href="https://github.com/oldnordic/geographdb-core/blob/master/examples/plot_forward_graph.py">geographdb-core</a>.</p>

<p>Invocation:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cargo run <span class="nt">--example</span> infer <span class="nt">--</span> <span class="se">\</span>
  <span class="nt">--model</span> models/qwen2.5-0.5b-instruct-q8_0.gguf <span class="se">\</span>
  <span class="nt">--prompt</span> <span class="s2">"The capital of Myanmar is"</span> <span class="se">\</span>
  <span class="nt">--forward-graph-trace</span> /tmp/trace.jsonl <span class="se">\</span>
  <span class="nt">--expected-attention</span> <span class="s1">'{"13": [0,1,3,4]}'</span>

python examples/plot_forward_graph.py /tmp/trace.jsonl
</code></pre></div></div>

<hr />

<h2 id="what-comes-next">What comes next</h2>

<p>Two traces is not a result. It is a signal worth testing.</p>

<p>The claim — that wrong predictions show higher attention entropy and more unexpected-position mass while BOS sink strength remains constant — needs a controlled study before it can be asserted. What I am planning:</p>

<ol>
  <li>Run 20–50 correct/wrong prompt pairs on Qwen2.5-0.5B-Instruct, matched on approximate prompt length</li>
  <li>Compute Shannon entropy of the pred-position attention distribution for each trace</li>
  <li>Test whether <code class="language-plaintext highlighter-rouge">entropy(wrong) &gt; entropy(correct)</code> holds across the dataset</li>
  <li>Separate routing failure (this post) from readout failure by checking whether wrong predictions with high confidence differ from wrong predictions with low confidence</li>
</ol>

<p>If the entropy separation holds, it gives an interpretability signal derivable from a single forward pass, without any fine-tuning or probing classifier. That is what makes it worth checking.</p>

<p>GPU path via rocmforge ROCm kernels is deferred pending flash-attention changes. CPU path works now.</p>]]></content><author><name>Luiz Spies</name></author><category term="rocmforge" /><category term="mechanistic-interpretability" /><summary type="html"><![CDATA[What does attention actually do, token by token, layer by layer? Not the textbook answer — the actual numbers, on a real prompt, with a real model.]]></summary></entry><entry><title type="html">Multi-Layer Graphs, Ricci Curvature, and a Hypothesis About How Computation Should Route</title><link href="https://oldnordic.github.io/language-geometry/2026/06/15/multilayer-graph-ricci-hypothesis.html" rel="alternate" type="text/html" title="Multi-Layer Graphs, Ricci Curvature, and a Hypothesis About How Computation Should Route" /><published>2026-06-15T00:00:00+00:00</published><updated>2026-06-15T00:00:00+00:00</updated><id>https://oldnordic.github.io/language-geometry/2026/06/15/multilayer-graph-ricci-hypothesis</id><content type="html" xml:base="https://oldnordic.github.io/language-geometry/2026/06/15/multilayer-graph-ricci-hypothesis.html"><![CDATA[<p>This post is about an idea, not a result. The experiment described here has not been run. The hypothesis may be wrong. I’m writing it down because the reasoning is worth making explicit before touching code.</p>

<hr />

<h2 id="the-transformers-structural-problem">The transformer’s structural problem</h2>

<p>A transformer collapses all abstraction levels into one operation. Syntactic prediction, semantic disambiguation, logical inference, meta-reasoning — the same matrix multiply handles all of them. The “knowledge” is distributed across billions of parameters with no structural distinction between “this weight encodes grammar” and “this weight encodes logical entailment.”</p>

<p>This creates two practical problems:</p>

<p><strong>The frozen state problem.</strong> Weights are fixed after training. To update what the model knows, retrain everything. There’s no mechanism for local update — no way to say “strengthen this specific connection because it produced a correct prediction.”</p>

<p><strong>The cost problem.</strong> A matrix multiply over a 50k vocabulary doesn’t distinguish between an unambiguous token (“the” after “in”) and a highly ambiguous one (“bank” after “I went to the”). Both pay the same computational cost. The operation is uniform where the problem is not.</p>

<p>Research on Ollivier-Ricci curvature in transformers shows that the geometry is already there. Attention heads develop curvature concentration at semantically load-bearing positions — some edges carry most of the semantic weight. The structure is real; it’s just hidden inside the matrix where it can’t be used for routing.</p>

<hr />

<h2 id="a-multi-layer-graph-alternative">A multi-layer graph alternative</h2>

<p>The hypothesis: replace dense matrix computation with a layered sparse graph where each layer handles one abstraction level, and a tensor field of Ricci curvature determines where layers naturally couple.</p>

<p>Each layer is a sparse graph with its own nodes, edges, and responsibility:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Layer 4: Meta — reasoning about reasoning, past trace annotations
Layer 3: Logical / causal — entailment, causality, succession
Layer 2: Semantic / conceptual — similarity, sense disambiguation  
Layer 1: Syntactic / token — co-occurrence, raw sequence statistics
</code></pre></div></div>

<p>Each layer produces a partial result. The output is a weighted sum:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>output = Σ(layer_i_result × inter_layer_weight_i)
</code></pre></div></div>

<p>The inter-layer weights are graph edges, not learned gates. Sparse, inspectable, updatable from prediction outcomes without retraining.</p>

<p>Cost scales with active connections, not with vocabulary size squared. An unambiguous token may require only Layer 1. An ambiguous one — high local curvature — pulls weight from Layer 2 upward.</p>

<p>Adding a new layer means adding new edges into the sum. Existing layers are unchanged.</p>

<hr />

<h2 id="the-tensor-as-routing-signal">The tensor as routing signal</h2>

<p>In Riemannian geometry, the metric tensor encodes how space curves. Curvature tells you whether a path between two points bends or goes straight, which determines which paths are actually short.</p>

<p>The hypothesis applies the same idea to the multi-layer graph: <strong>Ricci curvature at an inter-layer edge is the routing weight.</strong></p>

<p>Where the tensor bends sharply — high local curvature — layers are pulled together. Inter-layer edges activate, computation lifts to the next abstraction level. Where it’s flat, layers don’t interact.</p>

<p>The analogy to General Relativity: matter tells spacetime how to curve, curvature tells matter how to move. Here: information density in the graph tells the tensor how to curve, curvature tells computation where to flow between layers.</p>

<p>High curvature = concept boundary = ambiguity = lift to higher layer. Low curvature = unambiguous region = stay at current layer.</p>

<p>This is not a new routing mechanism bolted on. It’s the geometry of the graph itself, expressed as a curvature field, doing the routing.</p>

<hr />

<h2 id="why-observe-before-building">Why observe before building</h2>

<p>The previous geographdb experiments produced a clear lesson: don’t impose structure, measure it.</p>

<p>The PMI substrate experiments (50.4–50.5 ppl ceiling) showed that geometry imposed from co-occurrence statistics doesn’t transfer to next-token prediction. The geometry had to be the <em>right kind</em> for the task. Imposing the wrong geometry was worse than no geometry.</p>

<p>The same lesson applies here. If the curvature field is imposed — “high-curvature nodes connect to Layer 2, low-curvature to Layer 1” — that’s a design decision that may or may not match what the data actually needs.</p>

<p>The right approach, following the Ollivier-Ricci methodology: build the multi-layer graph from existing data, compute curvature on every edge (within layers AND inter-layer), and observe where it concentrates.</p>

<p>The null hypothesis: curvature distributes randomly across layers and inter-layer edges. If high-curvature points in Layer 1 don’t align with high-curvature points in Layer 2 at the same concept, the inter-layer routing hypothesis is wrong.</p>

<p>The signal to look for: if curvature aligns across layers at the same concepts without being forced — if the geometry self-organizes the layer boundaries — that’s the data telling you where layers naturally touch.</p>

<hr />

<h2 id="what-already-exists">What already exists</h2>

<p>The experiment is closer to runnable than it might seem:</p>

<ul>
  <li><strong>Layer 1</strong> exists: PMI co-occurrence graph from TinyStories, used in the geographdb experiments</li>
  <li><strong>Layer 2</strong> exists: SVD embeddings of the PMI graph, 3D token positions</li>
  <li><strong>Layer 3</strong> exists: directed transition matrix, already built and tested (showed same ~50.5 ppl ceiling as PMI — a result that itself says something about what these layers can and can’t do alone)</li>
  <li><strong>Curvature tensor</strong>: already added to <code class="language-plaintext highlighter-rouge">geographdb-core</code> as a per-token feature field</li>
</ul>

<p>Missing: Ollivier-Ricci curvature computation on the inter-layer edges, and the inter-layer edges themselves.</p>

<p>The plan:</p>
<ol>
  <li>Build inter-layer edges: identity mapping, same token in Layer 1 → same node in Layer 2</li>
  <li>Compute Ollivier-Ricci curvature: within each layer, then across inter-layer edges</li>
  <li>Plot the curvature distribution — within layers vs inter-layer</li>
  <li>Identify what data sits at high-curvature inter-layer points</li>
  <li>Report what the geometry says, not what we expected it to say</li>
</ol>

<hr />

<h2 id="what-this-would-mean-if-it-works">What this would mean if it works</h2>

<p>If inter-layer curvature aligns with within-layer curvature at the same concepts without being imposed, it means the geometry of the data naturally encodes where abstraction levels interact. The routing doesn’t need to be learned — it emerges from the structure.</p>

<p>That’s a different kind of efficiency than quantization or sparse attention. Those reduce compute by approximating the matrix. This replaces the matrix with a structure that computes less because most concepts don’t require multiple abstraction levels to predict.</p>

<p>Whether it works is an empirical question. The experiment will say.</p>]]></content><author><name>Luiz Spies</name></author><category term="language-geometry" /><summary type="html"><![CDATA[This post is about an idea, not a result. The experiment described here has not been run. The hypothesis may be wrong. I’m writing it down because the reasoning is worth making explicit before touching code.]]></summary></entry><entry><title type="html">rocmforge: GeoGraph Execution Engine and Branch Selection with a 0.5B Model</title><link href="https://oldnordic.github.io/rocmforge/2026/06/15/rocmforge-cpu-graph-engine.html" rel="alternate" type="text/html" title="rocmforge: GeoGraph Execution Engine and Branch Selection with a 0.5B Model" /><published>2026-06-15T00:00:00+00:00</published><updated>2026-06-15T00:00:00+00:00</updated><id>https://oldnordic.github.io/rocmforge/2026/06/15/rocmforge-cpu-graph-engine</id><content type="html" xml:base="https://oldnordic.github.io/rocmforge/2026/06/15/rocmforge-cpu-graph-engine.html"><![CDATA[<p>This post covers two things built in rocmforge over the past week: a graph-based CPU execution engine with temporal rollback, and a first working result using a local 0.5B model to select between branches.</p>

<hr />

<h2 id="the-execution-engine">The execution engine</h2>

<p>The core problem: CPU inference traces are sequences of operations on tensors. If you want to explore multiple continuation branches from the same prefix, you need to capture the prefix state and replay it without re-executing from scratch. You also need to roll back to the prefix state after evaluating a branch.</p>

<p>The implementation is <code class="language-plaintext highlighter-rouge">CpuGraphArena</code> with <code class="language-plaintext highlighter-rouge">CpuGraph</code>:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">CpuGraphArena</code> owns all captured tensor bytes. Handles (<code class="language-plaintext highlighter-rouge">F32Handle</code>, <code class="language-plaintext highlighter-rouge">U8Handle</code>) are stable arena offsets, not raw pointer addresses. Pointer arithmetic on arena data is safe through the handle abstraction.</li>
  <li><code class="language-plaintext highlighter-rouge">CaptureContext</code> copies inputs and outputs into the arena during capture.</li>
  <li><code class="language-plaintext highlighter-rouge">CpuGraph::execute_window(&amp;mut arena, window)</code> replays a slice of the captured graph.</li>
  <li><code class="language-plaintext highlighter-rouge">graph.regress(t)</code> invalidates all nodes after timestamp <code class="language-plaintext highlighter-rouge">t</code> and restores arena bindings. Rolling back to prefix state is a single call.</li>
  <li><code class="language-plaintext highlighter-rouge">read_back</code> copies the final arena state to caller buffers.</li>
</ul>

<p>Verified: <code class="language-plaintext highlighter-rouge">test_cpu_graph_parity</code> max abs error = 0.00000000. The graph replay is numerically identical to direct execution.</p>

<p>Benchmark (10 samples, single layer):</p>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Direct imperative</td>
      <td>~689 µs</td>
    </tr>
    <tr>
      <td>GeoGraph replay</td>
      <td>~709 µs</td>
    </tr>
  </tbody>
</table>

<p>~3% overhead from the arena indirection. The prefix capture + branch rollback pattern costs nothing at replay time beyond this baseline.</p>

<p>The search/rollback test: capture shared prefix → evaluate branch A → <code class="language-plaintext highlighter-rouge">regress(t)</code> → capture and evaluate branch B → both match direct execution. That test passes.</p>

<hr />

<h2 id="branch-selection-with-a-05b-model">Branch selection with a 0.5B model</h2>

<p>With rollback working, the next question: can a local model pick which branch is better?</p>

<p>Three attempts failed before finding what works. The failures are worth documenting because the root causes are non-obvious.</p>

<p><strong>What failed:</strong></p>

<ol>
  <li>
    <p><strong>Numeric-score-only prompts.</strong> Asking the 0.5B instruct model to compare raw decimal scores (“Branch A: 6.2, Branch B: 5.8, which is better?”) and answer with a single letter. The model either ignored the format, repeated the number, or answered randomly. The model has no useful representation for “6.2 is better than 5.8 as a branch score” — it’s not a task it was trained on.</p>
  </li>
  <li>
    <p><strong>Logit extraction at the wrong token position.</strong> The label token (A or B) was not the first response token. Extracted logits were dominated by format words (“CHOICE”, “Choose”, digits), not the A/B decision.</p>
  </li>
  <li>
    <p><strong>4-branch multi-class task.</strong> The hidden state does not linearly separate arbitrary numeric scores at this scale. Too many classes, not enough signal.</p>
  </li>
</ol>

<p><strong>What works:</strong></p>

<ul>
  <li>Semantic two-branch task: one branch described as moving toward the target direction, the other away. Natural language descriptions, not raw scores.</li>
  <li>Chat template applied before tokenization so the instruct model actually follows the prompt.</li>
  <li><code class="language-plaintext highlighter-rouge">BranchChoiceHead</code>: a small linear binary classifier trained on the final hidden-state vector of the full multi-branch prompt, on top of the frozen 0.5B model.</li>
</ul>

<p>Result on the integration test:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Correct / 8</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Trained choice head</td>
      <td>8</td>
    </tr>
    <tr>
      <td>Random baseline</td>
      <td>4</td>
    </tr>
  </tbody>
</table>

<p>The mechanical property holds: the trained head picks the correct branch more often than random. This is on a toy task with a small held-out set. It demonstrates the mechanism works, not that it generalizes broadly.</p>

<p>The key constraint from this experiment: <strong>the 0.5B model needs semantic framing.</strong> Branch descriptions must describe what each branch does in natural language. Raw numeric annotations don’t activate useful representations. This shapes everything downstream — any annotation stored in a GraphMap needs to be semantically meaningful, not just a score.</p>

<hr />

<h2 id="whats-next">What’s next</h2>

<p>The gap in the current system: real inference sessions don’t yet capture a GraphMap. The branch selection mechanism works on synthetic traces. Wiring <code class="language-plaintext highlighter-rouge">CaptureContext</code> into the actual CPU inference path is what closes the loop — real forward passes produce real traces, real traces feed the branch selector, the selector’s choices get stored as annotations.</p>

<p>After that: token-level reranking. At each decode step, take the top-N candidate tokens, run a short forward pass for each, score the resulting hidden states with a trained value head, bias the logit distribution toward higher-scoring candidates. Cost is N forward passes per generated token. Whether that cost is worth the quality improvement is an empirical question not yet measured.</p>]]></content><author><name>Luiz Spies</name></author><category term="rocmforge" /><summary type="html"><![CDATA[This post covers two things built in rocmforge over the past week: a graph-based CPU execution engine with temporal rollback, and a first working result using a local 0.5B model to select between branches.]]></summary></entry><entry><title type="html">Geometric-Only Attention: Linear Scaling from Sparse Neighborhoods</title><link href="https://oldnordic.github.io/language-geometry/2026/06/14/geometric-attention-linear-scaling.html" rel="alternate" type="text/html" title="Geometric-Only Attention: Linear Scaling from Sparse Neighborhoods" /><published>2026-06-14T00:00:00+00:00</published><updated>2026-06-14T00:00:00+00:00</updated><id>https://oldnordic.github.io/language-geometry/2026/06/14/geometric-attention-linear-scaling</id><content type="html" xml:base="https://oldnordic.github.io/language-geometry/2026/06/14/geometric-attention-linear-scaling.html"><![CDATA[<p>The <a href="/language-geometry/2026/06/13/geometry-as-substrate.html">previous posts</a> documented a ceiling: every geometric attention variant converged to ~50.5 val perplexity on TinyStories, while a plain trigram MLP reached 32. Static geometry was the bottleneck.</p>

<p>This post covers two things: a new sparse attention mode in <code class="language-plaintext highlighter-rouge">geographdb-core</code> 0.5.3, and the first training run that goes below that ceiling.</p>

<hr />

<h2 id="geometric-only-attention">Geometric-only attention</h2>

<p><code class="language-plaintext highlighter-rouge">geographdb-core</code> 0.5.3 adds <code class="language-plaintext highlighter-rouge">GraphAttentionClassifier::set_geometric_attention_only(bool)</code>. When set:</p>

<ul>
  <li>Each token attends only to itself and its geometric graph neighbors (fixed count <code class="language-plaintext highlighter-rouge">k</code>)</li>
  <li>The O(L²) full self-attention term is dropped entirely</li>
  <li>Complexity becomes O(L × k), where k is constant</li>
</ul>

<p>The implementation is a sparse index build (<code class="language-plaintext highlighter-rouge">build_attended_indices</code>) shared between forward and backward pass. Softmax and RoPE are dispatched to CPU-native kernels with AVX2 where available, scalar fallback elsewhere.</p>

<hr />

<h2 id="benchmark">Benchmark</h2>

<p>Measured on CPU, forward pass, context lengths L ∈ {8, 16, 32, 64, 128}. Geometric-only mode only goes to L=128 in this run; hybrid stops at L=64 because the quadratic cost makes longer sequences prohibitive in the benchmark setup.</p>

<table>
  <thead>
    <tr>
      <th>L</th>
      <th>Hybrid (default)</th>
      <th>Geometric-only</th>
      <th>Speedup</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>8</td>
      <td>122 µs</td>
      <td>57 µs</td>
      <td>~2.1×</td>
    </tr>
    <tr>
      <td>16</td>
      <td>390 µs</td>
      <td>110 µs</td>
      <td>~3.5×</td>
    </tr>
    <tr>
      <td>32</td>
      <td>1.38 ms</td>
      <td>219 µs</td>
      <td>~6.3×</td>
    </tr>
    <tr>
      <td>64</td>
      <td>5.13 ms</td>
      <td>436 µs</td>
      <td>~11.8×</td>
    </tr>
    <tr>
      <td>128</td>
      <td>—</td>
      <td>865 µs</td>
      <td>—</td>
    </tr>
  </tbody>
</table>

<p>The speedup grows with L because the hybrid cost grows as L², while geometric-only grows as L. At L=64 it’s ~12×; at L=128 the hybrid isn’t benchmarked but would project to ~20 ms based on the quadratic trend.</p>

<p>These are wall-clock times, not theoretical FLOPs. AVX2 dispatch affects both modes so the ratio is a fair comparison.</p>

<hr />

<h2 id="what-this-costs-in-accuracy">What this costs in accuracy</h2>

<p>Unknown yet. The benchmark measures speed; it doesn’t measure whether the geometric-only output is close to the hybrid output. That comparison is the next measurement. The sparse mode is a strict approximation of the full attention — it attends to fewer tokens — so output divergence is expected. How much, and whether it correlates with perplexity loss, is not yet measured.</p>

<hr />

<h2 id="the-training-run">The training run</h2>

<p>Separately, a cross-context attention variant is training on TinyStories (20k train / 2k val, 10 epochs). At the time of writing it is on epoch 3. Numbers so far:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Val perplexity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bigram baseline</td>
      <td>115.300</td>
    </tr>
    <tr>
      <td>Epoch 1</td>
      <td>34.527</td>
    </tr>
    <tr>
      <td>Epoch 2</td>
      <td>31.607</td>
    </tr>
  </tbody>
</table>

<p>For reference, all prior geometric attention variants — PMI substrate, transition-matrix substrate, with and without RoPE, with and without curvature weighting — converged in the range 50.4–50.5 and did not go lower. The trigram MLP baseline is 32.02.</p>

<p>Epoch 2 of this run is 31.607. That is below the prior ceiling and below the trigram baseline.</p>

<p>This is a single run, not yet complete. Early-stopping patience is 2 epochs. The model may still plateau or overfit. The architecture change that produced this is cross-context attention on top of geometric positions — the model can now query global context rather than being restricted to local neighborhood structure.</p>

<p>Whether it holds through epoch 10, and what the final number is, will be in a follow-up post.</p>

<hr />

<h2 id="connection-between-the-two">Connection between the two</h2>

<p>Geometric-only mode is a sparse approximation to full attention. The benchmark above shows the cost reduction. The training run shows a model that needs cross-context (full) attention to break the 50-ppl ceiling — which means the full O(L²) term carries information the geometric neighborhood alone does not.</p>

<p>That is the tradeoff: geometric-only is fast and scales linearly, but (based on current evidence) loses the cross-context signal that closed the gap with the trigram baseline. Whether a hybrid strategy — geometric-only for most tokens, full attention on a small selected subset — can recover accuracy at lower cost is an open question and not yet measured.</p>

<hr />

<h2 id="code">Code</h2>

<p><code class="language-plaintext highlighter-rouge">geographdb-core</code> 0.5.3 is at <a href="https://github.com/oldnordic/geographdb-core">github.com/oldnordic/geographdb-core</a>. The geometric-only flag, benchmark, and new test (<code class="language-plaintext highlighter-rouge">learns_simple_task_geometric_only</code>) are in this commit.</p>]]></content><author><name>Luiz Spies</name></author><category term="language-geometry" /><summary type="html"><![CDATA[The previous posts documented a ceiling: every geometric attention variant converged to ~50.5 val perplexity on TinyStories, while a plain trigram MLP reached 32. Static geometry was the bottleneck.]]></summary></entry><entry><title type="html">Geometry as Substrate: What the Failing Results Are Telling Us</title><link href="https://oldnordic.github.io/language-geometry/2026/06/13/geometry-as-substrate.html" rel="alternate" type="text/html" title="Geometry as Substrate: What the Failing Results Are Telling Us" /><published>2026-06-13T00:00:00+00:00</published><updated>2026-06-13T00:00:00+00:00</updated><id>https://oldnordic.github.io/language-geometry/2026/06/13/geometry-as-substrate</id><content type="html" xml:base="https://oldnordic.github.io/language-geometry/2026/06/13/geometry-as-substrate.html"><![CDATA[<p>The <a href="/language-geometry/2026/06/12/geometric-lm-training.html">previous post</a> documented a series of negative results: PMI+SVD geometric positions don’t beat a trigram baseline, attention over geometry doesn’t beat a trigram baseline, RoPE doesn’t help, curvature weighting consistently hurts. Every experiment lost to two token IDs fed into a flat MLP.</p>

<p>This post is about what those results mean — and where the geometry idea goes from here.</p>

<hr />

<h2 id="negative-results-are-signals-not-failures">Negative results are signals, not failures</h2>

<p>The full experiment arc so far, all at 20k TinyStories / 15 epochs:</p>

<table>
  <thead>
    <tr>
      <th>Architecture</th>
      <th>Representation</th>
      <th>Val perplexity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MLP</td>
      <td>One-hot trigram</td>
      <td><strong>32.02</strong></td>
    </tr>
    <tr>
      <td>Attention</td>
      <td>Graph neighbors (PMI)</td>
      <td>55.54</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Hybrid (token ID + geometry)</td>
      <td>43.24</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Geometric rotated + neighbors</td>
      <td>126.85</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Geometric absolute</td>
      <td>145.08</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Geometric rotated</td>
      <td>272.98</td>
    </tr>
    <tr>
      <td>Bigram baseline</td>
      <td>—</td>
      <td>72.97</td>
    </tr>
  </tbody>
</table>

<p>Read as failures: every geometric model lost.</p>

<p>Read as signals:</p>

<ol>
  <li>
    <p><strong>Attention over geometry is 2.5x better than MLP over geometry</strong> (55 vs 127 ppl). The architecture matters. Query/key/value over graph neighbors extracts real structure that a flat MLP can’t see.</p>
  </li>
  <li>
    <p><strong>Geometric models peak earlier than trigram.</strong> Geo-attention trained to its best validation around epoch 2-3. Trigram kept improving through epoch 15. The geometry finds <em>something</em> fast — it just can’t go as deep as direct token identity.</p>
  </li>
  <li>
    <p><strong>Adding geometry to token identity hurts</strong> (hybrid 43 vs trigram 32). The PMI positions are not just uninformative — they’re noise on top of the token signal.</p>
  </li>
  <li>
    <p><strong>Rotation alone is catastrophic; neighbors rescue it</strong> (273 vs 127 ppl). Local geometric context carries signal, but only when the model can query it selectively.</p>
  </li>
</ol>

<p>The consistent pattern: the PMI+SVD substrate has <em>some</em> structure (early peaking, attention extractability) but the wrong kind of structure for next-token prediction. PMI encodes symmetric co-occurrence similarity. Next-token prediction needs directed successor structure. Those are different things.</p>

<hr />

<h2 id="why-pmi-is-the-wrong-map">Why PMI is the wrong map</h2>

<p>PMI+SVD clusters tokens by shared context neighborhood. “Dog” and “cat” end up near each other because they appear in similar sentences. Neither predicts the other as a next token. The map encodes <em>what is similar</em> — it doesn’t encode <em>what comes next</em>.</p>

<p>Language has both. Words that are semantically similar (similarity structure) and words that tend to follow each other (succession structure). Current LLMs learn succession directly from token sequences. The PMI graph captures similarity and ignores succession.</p>

<p>The next experiment: swap PMI for a <strong>directed transition matrix</strong>. Build <code class="language-plaintext highlighter-rouge">P(w₂ | w₁)</code> from bigram counts, embed its spectral structure in 3D. Positions now encode “where does this token’s probability mass flow to” — successor structure, not similarity structure. Same geo-attention architecture, different substrate. If the gap closes, directionality was the missing piece.</p>

<hr />

<h2 id="the-deeper-problem-static-geometry">The deeper problem: static geometry</h2>

<p>Even with the right directionality, there’s a harder constraint: the geometry is fixed at training time. PMI positions are computed from the corpus, frozen, and never updated. The model learns to read a static map.</p>

<p>Brains don’t work this way. Synaptic weights update continuously from prediction outcomes. Connections that contribute to correct predictions strengthen. Connections that lead to errors weaken. The geometry itself is the thing being trained — not just the weights on top of it.</p>

<p>The current architecture has:</p>
<ul>
  <li>Fixed graph topology (PMI-derived)</li>
  <li>Fixed node positions (SVD coordinates)</li>
  <li>Learned attention weights (W_q, W_k, W_v)</li>
  <li>Learned MLP head</li>
</ul>

<p>The learned pieces are layered on top of a frozen substrate. The question the experiments are really answering is: <em>how much can learned attention compensate for a wrong substrate?</em> The answer so far: partially (55 vs 127 ppl) but not enough (55 vs 32 ppl).</p>

<p><strong>Learned graph plasticity</strong> is the next structural change. Make the edge weights learnable. Gradient flows back through the attention mechanism and updates not just the attention projections but the graph connectivity itself. The topology stays fixed (PMI-derived initial graph) but edges strengthen or weaken from prediction signal.</p>

<p>Over training, edges that helped predict correct next tokens survive. Edges that didn’t, decay. The graph self-organizes from co-occurrence structure toward successor structure — the same information the trigram uses directly, but learned geometrically rather than counted statistically.</p>

<p>This is Hebbian plasticity: neurons that fire together wire together. In the graph: paths that correctly predicted the next token get reinforced. The geometry evolves to encode what the task needs.</p>

<hr />

<h2 id="three-connected-ideas">Three connected ideas</h2>

<p>The experiments are testing stage one of a larger architecture. The three ideas are connected:</p>

<p><strong>Geometry</strong> is the substrate — the space where tokens live and where computation happens. Not flat embedding space (which transformers use), but a space with native structure: distance, direction, curvature, neighborhood.</p>

<p><strong>Multi-sense tokens (quantum-token)</strong> is the representation — each token is not a point but a distribution over possible states. The same token “bank” in different geometric neighborhoods activates different sense vectors. Context collapses the superposition. This is what attention approximates, but with fixed weights and no geometric grounding. A token whose local curvature is high is near a boundary between senses — geometrically ambiguous.</p>

<p><strong>Plasticity</strong> is the learning rule — edge weights update from prediction outcomes, the geometry self-organizes toward the task. Not backprop through frozen structure, but backprop that changes the structure itself.</p>

<p>Transformers have a version of each: attention approximates multi-sense (context-dependent activations), positional encodings inject weak geometry, and gradient descent updates the weights. But the geometry is not native — it’s injected as a correction to an orderless token bag. And the weights freeze after training. No online adaptation, no structural update.</p>

<hr />

<h2 id="fractals-attractors-and-deterministic-structure">Fractals, attractors, and deterministic structure</h2>

<p>Current LLMs are statistical approximators. They learn to predict what <em>usually</em> comes next from training data. They guess from patterns.</p>

<p>Language has deterministic structure underneath the statistics. Grammar rules are recursive — sentences contain sentences, phrases contain phrases, at every scale the same rules apply. Mathematical proofs are deterministic — the same axioms always produce the same theorems. Code is exact. Even narrative has deep structure (the same story morphology appears across cultures and languages independently).</p>

<p>Fractals are the extreme case: one formula, infinite complexity, perfectly deterministic. <code class="language-plaintext highlighter-rouge">z = z² + c</code> generates the Mandelbrot set. Zoom into any boundary region and the same structure appears, because the same rule is being applied. Nature uses this everywhere — tree branching, leaf venation, vascular networks, coastlines — because recursive self-similar geometry is how you pack maximum function into minimum description.</p>

<p>The implication for language geometry: if concepts have <strong>geometric attractors</strong> — regions of the token space that the dynamics always flows toward given certain inputs — then reasoning is navigation, not guessing. The geometry carries the generative rule. The forward pass follows it.</p>

<p>This is speculative. But the tensor field added to <code class="language-plaintext highlighter-rouge">geographdb-core</code> is a step toward it. Local curvature tensors encode how the space bends around each token — how strongly the geometry pulls toward an attractor in that region. High curvature = strong rule = low ambiguity. Low curvature = flat region = multiple paths equally likely.</p>

<p>If that curvature signal can be incorporated as a per-token feature — alongside position, direction, and neighborhood — the geometry starts to encode not just <em>where tokens live</em> but <em>how strongly the local rules constrain what comes next</em>.</p>

<hr />

<h2 id="where-this-is-going">Where this is going</h2>

<p>The immediate next experiment: <strong>directed transition matrix substrate</strong>. One variable changes — PMI positions swap for transition-spectral positions. Same geo-attention model. If the gap with trigram closes, directionality was the bottleneck and the architecture is sound.</p>

<p><strong>That experiment ran. Directionality is not the missing piece.</strong></p>

<table>
  <thead>
    <tr>
      <th>Substrate</th>
      <th>Best val ppl</th>
      <th>Epoch</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>PMI (undirected co-occurrence)</td>
      <td>50.45</td>
      <td>2</td>
    </tr>
    <tr>
      <td>Transition (directed successor)</td>
      <td>50.46</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Trigram baseline</td>
      <td>32.02</td>
      <td>—</td>
    </tr>
  </tbody>
</table>

<p>Both substrates converge to ~50.5 ppl and then overfit. The difference between them is 0.01 ppl — noise. Whether the geometry encodes symmetric similarity or directed successor probability, the ceiling is the same.</p>

<p>The gap is not about what the static geometry encodes. It’s about the fact that it’s static. A pre-computed position — whether from PMI or a transition matrix — gives the attention mechanism a fixed map. The map has a hard ceiling around 50 ppl regardless of how it was built. The trigram doesn’t use a map; it reads successor counts directly from the training distribution. That’s why it reaches 32.</p>

<p>The problem is static geometry itself. <strong>Learned edge plasticity</strong> is the next experiment: make edge weights learnable, train them with the same gradient that updates attention weights. The topology stays fixed (initial k-NN graph) but edges strengthen or weaken from prediction error. The geometry self-organizes toward what the task needs rather than what corpus statistics provided at build time.</p>

<p>Longer term:</p>

<ul>
  <li>Tensor curvature as per-token input feature</li>
  <li>Multi-sense token geometry (mixture of position distributions, context-selected)</li>
  <li>Online plasticity: edge weights that update at inference time from context, not just during training</li>
</ul>

<p>None of these alone is a new transformer. Together, they’re a different kind of substrate — one where structure is native rather than approximated, and where the geometry itself carries information that statistics would need much more data to recover.</p>

<p>The failing results are pointing at what the substrate is missing. That’s exactly what experiments are for.</p>]]></content><author><name>Luiz Spies</name></author><category term="language-geometry" /><summary type="html"><![CDATA[The previous post documented a series of negative results: PMI+SVD geometric positions don’t beat a trigram baseline, attention over geometry doesn’t beat a trigram baseline, RoPE doesn’t help, curvature weighting consistently hurts. Every experiment lost to two token IDs fed into a flat MLP.]]></summary></entry><entry><title type="html">Atheneum: Persistent Memory for AI Coding Agents</title><link href="https://oldnordic.github.io/engineering/2026/06/12/atheneum.html" rel="alternate" type="text/html" title="Atheneum: Persistent Memory for AI Coding Agents" /><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://oldnordic.github.io/engineering/2026/06/12/atheneum</id><content type="html" xml:base="https://oldnordic.github.io/engineering/2026/06/12/atheneum.html"><![CDATA[<p>Every AI coding session starts from zero. The assistant that helped you trace a bug yesterday has no memory of it today. You explain the same context again, re-answer the same questions, and watch it rediscover the same facts. The tools I’ve built over the last six months — <a href="https://crates.io/crates/magellan">magellan</a>, <a href="https://crates.io/crates/llmgrep">llmgrep</a>, <a href="https://crates.io/crates/mirage-analyzer">mirage-analyzer</a> — solve the code structure problem. They make the codebase queryable. But they don’t solve the session continuity problem. An agent still can’t carry decisions, discoveries, or hard-won debugging context from one session into the next.</p>

<p><a href="https://crates.io/crates/atheneum">atheneum</a> is the attempt to fix that. It’s an embedded knowledge graph that persists across sessions: tool calls, decisions, wiki content, code complexity signals, and raw session memory, all in a SQLite database with structured edges between them.</p>

<p>v0.5.0 is twelve days old. This is not a finished product. It is, however, running continuously and generating real data.</p>

<hr />

<h2 id="whats-in-the-graph">What’s in the graph</h2>

<p>The live database on my machine contains <strong>4,677 entities and 15,015 edges</strong>. Here’s the breakdown:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ToolCall      2,358   — every Claude Code tool use, timestamped and linked to session
WikiPage        280   — Logseq journal pages and wiki articles, synced in
Session         221   — coding sessions with branch, timestamps, tool counts
ReasoningLog    315   — reasoning traces stored during sessions
Reference       338   — symbol references (file:line)
Memory          130   — stable facts and dream-consolidated findings
File            198   — source files touched across sessions
Symbol          190   — code symbols (indexed via magellan)
TestRun         120   — test results linked to tool calls
</code></pre></div></div>

<p>Edges link these together: <code class="language-plaintext highlighter-rouge">belongs_to_project</code>, <code class="language-plaintext highlighter-rouge">observed_in</code>, <code class="language-plaintext highlighter-rouge">wikilink</code>, <code class="language-plaintext highlighter-rouge">handled_by_tool</code>, <code class="language-plaintext highlighter-rouge">accessed</code>, <code class="language-plaintext highlighter-rouge">modified</code>, <code class="language-plaintext highlighter-rouge">CALLS</code>, <code class="language-plaintext highlighter-rouge">IMPORTS</code>. The graph structure is what makes retrieval useful — you can ask “which sessions touched this symbol” or “what wiki pages link to this concept” and get answers from the edge traversal.</p>

<hr />

<h2 id="navigate-and-search">Navigate and search</h2>

<p>The two most-used CLI commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Semantic search across all entity kinds</span>
atheneum search ~/.magellan/atheneum/atheneum.db <span class="s2">"AMD GPU inference"</span> <span class="nt">--limit</span> 5

<span class="c"># Navigate: start from matching entities, walk edges outward</span>
atheneum navigate ~/.magellan/atheneum/atheneum.db <span class="s2">"session accountability"</span> <span class="nt">--concise</span> <span class="nt">--max-tokens</span> 500
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">search</code> does lexical search over stored content and returns scored results. <code class="language-plaintext highlighter-rouge">navigate</code> uses HopGraph — more on that below — and returns a token-budgeted subgraph. The <code class="language-plaintext highlighter-rouge">--concise</code> flag formats output as compact Markdown intended for paste into a language-model context window. <code class="language-plaintext highlighter-rouge">--max-tokens 500</code> hard-truncates at approximately that budget.</p>

<p>Both work against whatever is actually in the database. The results above are real. The numbers are from a fresh process — all runtime counters start at zero, so what you see is the persisted graph state, not a cached view.</p>

<hr />

<h2 id="dreaming-module">Dreaming module</h2>

<p>v0.3.0 added a reflective memory consolidation pass. After a session ends, the dreaming module runs a 6-phase pipeline over stored memories:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SCAN → DEDUPLICATE → STALE → CONTRADICTION → VERBOSE → CONSOLIDATED
</code></pre></div></div>

<p>It uses trigram Jaccard similarity to detect near-duplicates, marks stale entries, flags contradictions, strips verbose redundancy, and produces consolidated <code class="language-plaintext highlighter-rouge">Knowledge</code> entities from surviving discoveries.</p>

<p>The <code class="language-plaintext highlighter-rouge">memory-list</code> command shows what survived:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>atheneum memory-list ~/.magellan/atheneum/atheneum.db <span class="nt">--limit</span> 5
</code></pre></div></div>

<p>The current database contains complexity hotspot entries from the dreaming module — cross-project code quality signals that were extracted from session data and consolidated into stable memory. An example from the live DB:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dream/code/complexity_hotspot/abtop-draw_sessions_panel_active
  High cyclomatic complexity in abtop: 5 functions &gt; 20
  Top: draw_sessions_panel_active=91 (loc=881, fan_in=2, fan_out=52)
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dream/code/complexity_hotspot/llmgrep-search_symbols_impl
  High cyclomatic complexity in llmgrep: 6 functions &gt; 20
  Top: search_symbols_impl=63 (loc=724, fan_in=2, fan_out=30)
</code></pre></div></div>

<p>These entries persist across sessions. The next agent that loads context for <code class="language-plaintext highlighter-rouge">abtop</code> or <code class="language-plaintext highlighter-rouge">llmgrep</code> gets this signal without re-running any analysis.</p>

<p>There’s also a dry-run mode:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>atheneum dream ~/.magellan/atheneum/atheneum.db <span class="nt">--dry-run</span> <span class="nt">--scope</span> dream
</code></pre></div></div>

<p>which reports what would be consolidated without committing anything.</p>

<hr />

<h2 id="hopgraph">HopGraph</h2>

<p><code class="language-plaintext highlighter-rouge">navigate</code> is backed by HopGraph (v0.2.0): vector-based entry point + BFS subgraph walk + token-budgeted truncation.</p>

<p>The flow:</p>
<ol>
  <li>Embed the query text (HashEmbedder at 128 dims by default; OllamaEmbedder at 768 dims as an optional feature)</li>
  <li>HNSW search across all indexed entities → ranked candidates</li>
  <li>BFS from top-k candidates, following allowed edge types to depth N</li>
  <li>Truncate subgraph to stay within the token budget</li>
</ol>

<p>The HNSW index is persistent — it survives process restarts and is rebuilt during <code class="language-plaintext highlighter-rouge">reindex</code>. The in-process query cache (added in v0.3.1) means repeated identical queries don’t touch SQLite.</p>

<hr />

<h2 id="cross-project-queries-v050">Cross-project queries (v0.5.0)</h2>

<p>The latest release adds cross-project search. Rather than copying data between databases (which goes stale immediately), atheneum maintains a lightweight routing registry (<code class="language-plaintext highlighter-rouge">meta.db</code>) and lazily <code class="language-plaintext highlighter-rouge">ATTACH DATABASE</code> each registered project’s magellan DB on demand:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Register once</span>
atheneum meta-register envoy /home/feanor/Projects/envoy <span class="se">\</span>
  /home/feanor/Projects/envoy/.magellan/magellan.db <span class="nt">--language</span> rust

<span class="c"># Search across all registered Rust projects</span>
atheneum cross-search <span class="s2">"build_router"</span> <span class="nt">--language</span> rust <span class="nt">--k</span> 10

<span class="c"># Navigate with BFS walk per project</span>
atheneum cross-navigate <span class="s2">"error handling"</span> <span class="nt">--language</span> rust <span class="nt">--k</span> 5 <span class="nt">--depth</span> 2
</code></pre></div></div>

<p>SQLite allows up to 10 <code class="language-plaintext highlighter-rouge">ATTACH</code>ed databases per connection. The LRU cache defaults to 8 to stay safely under that limit. Unreadable or missing DBs are skipped with a warning — one broken project does not abort the query.</p>

<p>This is the piece that makes atheneum genuinely cross-project rather than per-project. A query for “error handling” can surface results from magellan, llmgrep, envoy, and rocmforge simultaneously, pulling from their live magellan symbol graphs.</p>

<hr />

<h2 id="wiki-sync">Wiki sync</h2>

<p>Logseq journal files and wiki pages can be synced directly into the graph:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>atheneum sync-logseq ~/.magellan/atheneum/atheneum.db ~/wiki grounded
atheneum reindex ~/.magellan/atheneum/atheneum.db
</code></pre></div></div>

<p>This creates <code class="language-plaintext highlighter-rouge">WikiPage</code> entities from the markdown content and <code class="language-plaintext highlighter-rouge">wikilink</code> edges from <code class="language-plaintext highlighter-rouge">[[...]]</code> references. Navigate queries can then traverse from code symbols into wiki pages and back — linking a design decision in a journal to the code symbols it describes.</p>

<p>The 280 WikiPage entities in the live database came from this sync. Most are stub pages with link structure but limited body content; pages that have been visited recently via <code class="language-plaintext highlighter-rouge">navigate</code> queries have their full content indexed.</p>

<hr />

<h2 id="multiple-assistants">Multiple assistants</h2>

<p>The database is not tied to a single assistant. Three consumers currently write to the same atheneum DB:</p>

<p><strong>Claude Code</strong> (this environment): session data, tool calls, reasoning logs, and discoveries all go through the envoy coordination layer, which writes to atheneum.</p>

<p><strong>atheneum-py</strong>: a Python port of the core atheneum library, used to connect Gemini CLI to the same graph. It implements the same memory and knowledge APIs in Python, so a Gemini session can read discoveries written by a Claude Code session and vice versa.</p>

<p><strong>Hermes</strong>: an open-source Python AI assistant. The atheneum plugin at <code class="language-plaintext highlighter-rouge">~/.hermes/plugins/atheneum/</code> gives Hermes read/write access to the same graph. It uses <code class="language-plaintext highlighter-rouge">plugin.yaml</code> for discovery and exposes atheneum’s search and memory APIs as Hermes tools.</p>

<p>The multi-assistant aspect is the core design goal. The knowledge graph accumulates across every session, regardless of which assistant ran it. What Claude Code learned about <code class="language-plaintext highlighter-rouge">llmgrep</code>’s complexity hotspots yesterday is available to Hermes today.</p>

<hr />

<h2 id="mcp-server">MCP server</h2>

<p>The <code class="language-plaintext highlighter-rouge">atheneum-mcp</code> crate (in the same repository) implements an MCP server using the <a href="https://crates.io/crates/rmcp">rmcp</a> library. It exposes atheneum’s memory, search, navigate, and discovery APIs as MCP tools for any MCP-compatible client.</p>

<p>I have not tested this end-to-end against a running MCP client in this session. The crate builds and the protocol implementation exists, but I can’t claim it’s been verified beyond compilation.</p>

<hr />

<h2 id="current-state">Current state</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>v0.5.0 — released 2026-06-09
v0.1.0 — released 2026-05-31
</code></pre></div></div>

<p>Twelve days of active development. It’s moving fast because it’s solving an immediate problem in my workflow. The graph is real and in use daily. The API is not stable — v0.3.x through v0.5.0 had breaking changes in <code class="language-plaintext highlighter-rouge">store_memory</code>, the dreaming module was added and rewritten, and cross-project queries didn’t exist a week ago.</p>

<p>What works: the CLI (search, navigate, memory-list, graph-stats, dream dry-run), the dreaming consolidation pass, wiki sync, memory persistence across sessions. What’s less certain: the MCP server end-to-end, atheneum-py feature parity with the Rust version, HopGraph accuracy at high entity counts.</p>

<p>The source is at <a href="https://github.com/oldnordic">github.com/oldnordic</a>. The crate is on <a href="https://crates.io/crates/atheneum">crates.io</a>. It requires a magellan-indexed project to be most useful; read the <a href="https://oldnordic.github.io/engineering/2026/06/11/magellan.html">grounded coding workflow</a> for context on how the tools fit together.</p>]]></content><author><name>Luiz Spies</name></author><category term="engineering" /><summary type="html"><![CDATA[Every AI coding session starts from zero. The assistant that helped you trace a bug yesterday has no memory of it today. You explain the same context again, re-answer the same questions, and watch it rediscover the same facts. The tools I’ve built over the last six months — magellan, llmgrep, mirage-analyzer — solve the code structure problem. They make the codebase queryable. But they don’t solve the session continuity problem. An agent still can’t carry decisions, discoveries, or hard-won debugging context from one session into the next.]]></summary></entry><entry><title type="html">Envoy v0.2.0: Observability, Lock-Free Paths, and Bug Fixes</title><link href="https://oldnordic.github.io/engineering/2026/06/12/envoy-v0.2.html" rel="alternate" type="text/html" title="Envoy v0.2.0: Observability, Lock-Free Paths, and Bug Fixes" /><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://oldnordic.github.io/engineering/2026/06/12/envoy-v0.2</id><content type="html" xml:base="https://oldnordic.github.io/engineering/2026/06/12/envoy-v0.2.html"><![CDATA[<p>Three weeks after the <a href="/engineering/2026/06/12/envoy/">initial release</a>, envoy v0.2.0 is out. This isn’t a feature dump – it’s the result of running the server continuously and fixing the things that actually hurt. Three bugs from the original article got fixed, Prometheus metrics landed, and a performance improvement from another project turned out to transfer cleanly.</p>

<h2 id="what-changed">What changed</h2>

<p>The diff is 1,551 insertions, 701 deletions across 28 files. The big items:</p>

<h3 id="parking_lot-everywhere">parking_lot everywhere</h3>

<p>The original article mentioned envoy uses SQLite for persistence. What I didn’t mention is that the in-memory state (agent registry, circuit breaker, message store) was protected by <code class="language-plaintext highlighter-rouge">std::sync::Mutex</code>. Every lock site had poison recovery code – <code class="language-plaintext highlighter-rouge">.lock().unwrap_or_else(|e| e.into_inner())</code> or <code class="language-plaintext highlighter-rouge">.lock().map_err(|e| EnvoyError::LockPoisoned(...))</code>. In practice, a poisoned mutex means a panic already happened and the data might be corrupt. “Recovering” by ignoring the poison doesn’t help.</p>

<p>I was already migrating <a href="https://github.com/oldnordic/rs3gw">rs3gw</a> (a separate project) to <code class="language-plaintext highlighter-rouge">parking_lot::Mutex</code> and noticed the pattern transferred directly. parking_lot mutexes don’t use poisoning – <code class="language-plaintext highlighter-rouge">lock()</code> returns a <code class="language-plaintext highlighter-rouge">MutexGuard&lt;T&gt;</code> directly, no <code class="language-plaintext highlighter-rouge">Result</code>. The changes:</p>

<ul>
  <li>Removed <code class="language-plaintext highlighter-rouge">LockPoisoned</code> error variant entirely</li>
  <li>Removed <code class="language-plaintext highlighter-rouge">recover_lock()</code> helper function</li>
  <li>Removed <code class="language-plaintext highlighter-rouge">FastMutex</code> type alias</li>
  <li>Simplified 25+ <code class="language-plaintext highlighter-rouge">.lock()</code> call sites across 7 files</li>
  <li>Each mutex went from ~40 bytes to ~1 byte</li>
</ul>

<p>No behavioral change for callers – the server responds to the same endpoints the same way. But the code is cleaner and the lock overhead is measurably smaller.</p>

<h3 id="prometheus-metrics-endpoint">Prometheus <code class="language-plaintext highlighter-rouge">/metrics</code> endpoint</h3>

<p>The original envoy had two monitoring endpoints: <code class="language-plaintext highlighter-rouge">/health</code> (returns <code class="language-plaintext highlighter-rouge">{"status":"ok","uptime_seconds":N}</code>) and <code class="language-plaintext highlighter-rouge">/stats</code> (returns aggregate counters). Useful for ad-hoc checks, useless for dashboards or alerting.</p>

<p>v0.2.0 adds <code class="language-plaintext highlighter-rouge">GET /metrics</code> in Prometheus exposition format:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># HELP envoy_requests_total Total HTTP requests, labeled by operation and status class
# TYPE envoy_requests_total counter
envoy_requests_total{method="GET",path="/health",status="2xx"} 14

# HELP envoy_agents_online Number of currently active agents
# TYPE envoy_agents_online gauge
envoy_agents_online 3

# HELP envoy_request_duration_ms Request latency in milliseconds
# TYPE envoy_request_duration_ms histogram
envoy_request_duration_ms_bucket{path="/health",le="0.5"} 14
envoy_request_duration_ms_sum{path="/health"} 0.821
envoy_request_duration_ms_count{path="/health"} 14
</code></pre></div></div>

<p>Three metric families:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Type</th>
      <th>What it measures</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">envoy_requests_total</code></td>
      <td>counter</td>
      <td>Request count by method, path, status class</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">envoy_request_duration_ms</code></td>
      <td>histogram</td>
      <td>Latency distribution with 9 buckets (0.5ms to 5s)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">envoy_agents_online</code></td>
      <td>gauge</td>
      <td>Active agent count, updated on register/retire</td>
    </tr>
  </tbody>
</table>

<p><strong>Path normalization.</strong> Raw URL paths cause cardinality explosions in Prometheus. A path like <code class="language-plaintext highlighter-rouge">/agents/id1/messages/42/ack</code> becomes a unique label, and with thousands of agents and messages you get thousands of time series. The middleware normalizes path segments that look like IDs – numeric (<code class="language-plaintext highlighter-rouge">42</code>), named (<code class="language-plaintext highlighter-rouge">id1121</code>), or UUID (<code class="language-plaintext highlighter-rouge">338b8adc-...</code>) – into a single <code class="language-plaintext highlighter-rouge">:id</code> token. So <code class="language-plaintext highlighter-rouge">/agents/id1/messages/42/ack</code> becomes <code class="language-plaintext highlighter-rouge">/agents/:id/messages/:id/ack</code>. Same metric regardless of which agent or message.</p>

<p>The approach is borrowed from rs3gw, which uses the same <code class="language-plaintext highlighter-rouge">metrics</code> + <code class="language-plaintext highlighter-rouge">metrics-exporter-prometheus</code> crate combination. The middleware wraps every request, records start time, normalizes the path, increments the counter, and observes the duration.</p>

<p>Prometheus scrape config:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">scrape_configs</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">job_name</span><span class="pi">:</span> <span class="s1">'</span><span class="s">envoy'</span>
    <span class="na">static_configs</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">targets</span><span class="pi">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">127.0.0.1:9876'</span><span class="pi">]</span>
    <span class="na">scrape_interval</span><span class="pi">:</span> <span class="s">15s</span>
    <span class="na">metrics_path</span><span class="pi">:</span> <span class="s">/metrics</span>
</code></pre></div></div>

<h3 id="request-tracing">Request tracing</h3>

<p>Every HTTP response now includes an <code class="language-plaintext highlighter-rouge">x-request-id</code> header with a unique UUID. This is done through tower-http layers – <code class="language-plaintext highlighter-rouge">SetRequestIdLayer</code> generates the UUID, <code class="language-plaintext highlighter-rouge">PropagateRequestIdLayer</code> ensures it appears on responses, and <code class="language-plaintext highlighter-rouge">TraceLayer</code> logs request/response pairs when <code class="language-plaintext highlighter-rouge">RUST_LOG=tower_http=debug</code> is set.</p>

<p>This is useful when debugging: if an agent reports a failed request, the request ID lets you find the exact log entry.</p>

<h2 id="bug-fixes">Bug fixes</h2>

<p>Three issues from the original article’s “What’s rough” section got fixed:</p>

<h3 id="crossnavigate-no-longer-errors"><code class="language-plaintext highlighter-rouge">cross/navigate</code> no longer errors</h3>

<p>The cross-project graph navigation endpoint (<code class="language-plaintext highlighter-rouge">GET /atheneum/cross/navigate?q=build_router&amp;language=rust&amp;depth=2</code>) was broken because the BFS edge query referenced a <code class="language-plaintext highlighter-rouge">kind</code> column, but production magellan databases use <code class="language-plaintext highlighter-rouge">edge_type</code>. The fix was a SQL alias: <code class="language-plaintext highlighter-rouge">SELECT id, edge_type AS kind, ...</code>. Now the alias works with both the production schema and the test fixtures (which use <code class="language-plaintext highlighter-rouge">kind</code>). Symbol search (<code class="language-plaintext highlighter-rouge">/atheneum/cross/search</code>) was unaffected – it queries <code class="language-plaintext highlighter-rouge">graph_entities</code> which does use <code class="language-plaintext highlighter-rouge">kind</code>.</p>

<h3 id="evidence-endpoints-return-json">Evidence endpoints return JSON</h3>

<p>Eight POST handlers – <code class="language-plaintext highlighter-rouge">post_prompt</code>, <code class="language-plaintext highlighter-rouge">post_tool_call</code>, <code class="language-plaintext highlighter-rouge">post_file_write</code>, <code class="language-plaintext highlighter-rouge">post_commit</code>, <code class="language-plaintext highlighter-rouge">post_test_run</code>, <code class="language-plaintext highlighter-rouge">post_fix_chain</code>, <code class="language-plaintext highlighter-rouge">post_bench_run</code>, <code class="language-plaintext highlighter-rouge">post_subagent_handover</code> – returned bare <code class="language-plaintext highlighter-rouge">201 Created</code> with no response body. Now all return <code class="language-plaintext highlighter-rouge">{"recorded": true}</code> so callers can confirm success without relying on HTTP status alone. This was one of the API discoverability complaints from the original article.</p>

<h3 id="api-documentation-rewritten">API documentation rewritten</h3>

<p>The original <code class="language-plaintext highlighter-rouge">API.md</code> was incomplete and sometimes wrong – several required fields weren’t documented. v0.2.0 rewrites it from the Rust struct definitions. Every endpoint now has correct request fields, required/optional markers, and response shapes. Verified against the actual handler code, not from memory.</p>

<h2 id="whats-still-rough">What’s still rough</h2>

<p>Honest update on what hasn’t improved:</p>

<ul>
  <li><strong>The MCP polling problem</strong> is unchanged. Agents still poll for messages. The WebSocket endpoint exists but coding agents don’t speak WebSocket. This requires a protocol-level change in MCP, not an envoy fix.</li>
  <li><strong>Token savings counter</strong> still returns 0. Low priority – it’s a nice-to-have metric, not a correctness issue.</li>
  <li><strong>No Grafana dashboards</strong> yet. The Prometheus metrics are there, but I haven’t built the dashboard JSON. On the list.</li>
  <li><strong>Single-node only.</strong> Envoy uses a single SQLite database. No clustering, no replication, no multi-node coordination. If you need that, you need a different tool.</li>
</ul>

<h2 id="numbers">Numbers</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Version:    0.2.0
LOC:        11,800 (Rust)
Tests:      65 (was 57)
Endpoints:  21+ (added /metrics)
Runtime:    SQLite (no external services)
</code></pre></div></div>

<p>The test count went from 57 to 65 – the 8 new tests cover the metrics module (path normalization, ID detection, UUID handling, histogram recording).</p>

<h2 id="install">Install</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cargo <span class="nb">install </span>agent-envoy
</code></pre></div></div>

<p>Source: <a href="https://github.com/oldnordic/envoy">github.com/oldnordic/envoy</a></p>]]></content><author><name>Luiz Spies</name></author><category term="engineering" /><summary type="html"><![CDATA[Three weeks after the initial release, envoy v0.2.0 is out. This isn’t a feature dump – it’s the result of running the server continuously and fixing the things that actually hurt. Three bugs from the original article got fixed, Prometheus metrics landed, and a performance improvement from another project turned out to transfer cleanly.]]></summary></entry><entry><title type="html">Envoy: The Coordination Server AI Coding Agents Were Missing</title><link href="https://oldnordic.github.io/engineering/2026/06/12/envoy.html" rel="alternate" type="text/html" title="Envoy: The Coordination Server AI Coding Agents Were Missing" /><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://oldnordic.github.io/engineering/2026/06/12/envoy</id><content type="html" xml:base="https://oldnordic.github.io/engineering/2026/06/12/envoy.html"><![CDATA[<p>I run multiple AI coding agents in parallel. Claude Code sessions, Hermes agents, subagents spawning subagents. After a while I noticed something: <strong>none of them know the others exist.</strong> They overwrite each other’s files, repeat discoveries, and have no memory of what happened yesterday. There is no infrastructure for this. So I built one.</p>

<p>Envoy is an HTTP+JSON coordination server for AI coding agents. It provides agent identity, structured messaging, session accountability, and knowledge persistence – all backed by SQLite, no Postgres, no Redis, no Node.js.</p>

<h2 id="whats-missing">What’s missing</h2>

<p>Every major AI coding tool (Claude Code, Cursor, Copilot) treats each session as isolated:</p>

<table>
  <thead>
    <tr>
      <th>Problem</th>
      <th>Consequence</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>No persistent identity</td>
      <td>Agents can’t address each other (“tell agent X to stop editing file Y”)</td>
    </tr>
    <tr>
      <td>No cross-session memory</td>
      <td>Every session re-discovers the same bugs, re-reads the same files</td>
    </tr>
    <tr>
      <td>No audit trail</td>
      <td>You can’t answer “who changed this file and why?”</td>
    </tr>
    <tr>
      <td>No subagent accountability</td>
      <td>Subagents fail silently; parents don’t know what happened</td>
    </tr>
    <tr>
      <td>No cross-project search</td>
      <td>Working on 3 repos means running 3 separate queries</td>
    </tr>
  </tbody>
</table>

<p>Envoy fills all of these. Whether that’s a good idea depends on whether you actually run multiple agents – if you don’t, this is overkill.</p>

<h2 id="how-it-works">How it works</h2>

<p>Everything is SQLite-backed. The stack is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>envoy (HTTP server, this project)
  └── atheneum (embedded knowledge graph)
        └── sqlitegraph (SQLite graph engine with pub/sub)
</code></pre></div></div>

<p>Start the server:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>envoy serve <span class="nt">--port</span> 9876
</code></pre></div></div>

<p>Or as a systemd user service:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl <span class="nt">--user</span> start envoy
</code></pre></div></div>

<p>The server has been running on my machine for 42+ hours straight with no restarts. Health check:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl http://127.0.0.1:9876/health
<span class="o">{</span><span class="s2">"status"</span>:<span class="s2">"ok"</span>,<span class="s2">"uptime_seconds"</span>:152986,<span class="s2">"agents_online"</span>:2<span class="o">}</span>
</code></pre></div></div>

<h3 id="agent-identity">Agent identity</h3>

<p>Agents register at session start. The server assigns hierarchical IDs:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/agents <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"name":"claude-main","kind":"claude"}'</span>
<span class="o">{</span><span class="s2">"agent_id"</span>:<span class="s2">"id1"</span>,<span class="s2">"name"</span>:<span class="s2">"claude-main"</span>,<span class="s2">"is_new"</span>:true,...<span class="o">}</span>
</code></pre></div></div>

<p>Subagents get dotted IDs that encode the hierarchy:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/agents <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"name":"sub-agent-1","kind":"claude","parent_id":"id1"}'</span>
<span class="o">{</span><span class="s2">"agent_id"</span>:<span class="s2">"id1.1"</span>,<span class="s2">"name"</span>:<span class="s2">"sub-agent-1"</span>,<span class="s2">"parent_id"</span>:<span class="s2">"id1"</span>,...<span class="o">}</span>
</code></pre></div></div>

<p>Retiring an agent cascades to its children:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/agents/id1/retire <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"X-Agent-Id: id1"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"agent_id":"id1"}'</span>
<span class="o">{</span><span class="s2">"affected"</span>:[<span class="s2">"id1"</span>,<span class="s2">"id1.1"</span><span class="o">]</span>,<span class="s2">"retired"</span>:true<span class="o">}</span>
</code></pre></div></div>

<h3 id="session-accountability">Session accountability</h3>

<p>Every session writes structured data through <code class="language-plaintext highlighter-rouge">envoy-hook</code> (a companion binary that plugs into Claude Code’s hook system). The lifecycle is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SessionStart   → POST /atheneum/sessions
PostToolUse    → POST /atheneum/tool-calls
SubagentStop   → POST /atheneum/sessions/{id}/handover
Stop           → PATCH /atheneum/sessions/{id}
</code></pre></div></div>

<p>Query prior sessions before starting work:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-s</span> <span class="s2">"http://127.0.0.1:9876/atheneum/sessions?project=envoy&amp;last=1"</span>
<span class="o">[{</span>
  <span class="s2">"session_id"</span>: <span class="s2">"..."</span>,
  <span class="s2">"project"</span>: <span class="s2">"envoy"</span>,
  <span class="s2">"git_branch"</span>: <span class="s2">"main"</span>,
  <span class="s2">"tool_call_count"</span>: 47,
  <span class="s2">"file_write_count"</span>: 12,
  <span class="s2">"last_tool"</span>: <span class="s2">"cargo test"</span>,
  <span class="s2">"last_tool_summary"</span>: <span class="s2">"all 34 tests passed"</span>
<span class="o">}]</span>
</code></pre></div></div>

<p>Tool call logging requires <code class="language-plaintext highlighter-rouge">session_id</code>, <code class="language-plaintext highlighter-rouge">tool_name</code>, and <code class="language-plaintext highlighter-rouge">exit_status</code> (the fields that tripped me up during testing – the API is precise about what it expects):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/atheneum/tool-calls <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"X-Agent-Id: id1"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"session_id":"...","tool_name":"read_file",
       "exit_status":"success","input_summary":"read src/main.rs",
       "output_summary":"42 lines","latency_ms":150}'</span>
</code></pre></div></div>

<h3 id="messaging-between-agents">Messaging between agents</h3>

<p>Agents send messages to each other. This is the core coordination primitive:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/messages <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"X-Agent-Id: id1"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"type":"direct","from":"id1","to":"id2",
       "parts":[{"text":"hey, the build is green"}]}'</span>
<span class="o">{</span><span class="s2">"message_id"</span>:<span class="s2">"6751"</span>,<span class="s2">"from"</span>:<span class="s2">"id1"</span>,<span class="s2">"to"</span>:<span class="s2">"id2"</span>,...<span class="o">}</span>
</code></pre></div></div>

<p>The recipient polls for pending messages:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-s</span> <span class="s2">"http://127.0.0.1:9876/agents/id2/messages/pending"</span>
<span class="o">{</span>
  <span class="s2">"count"</span>: 1,
  <span class="s2">"messages"</span>: <span class="o">[{</span>
    <span class="s2">"message_id"</span>: <span class="s2">"6751"</span>,
    <span class="s2">"from"</span>: <span class="s2">"id1"</span>,
    <span class="s2">"parts"</span>: <span class="o">[{</span><span class="s2">"text"</span>: <span class="s2">"hey, the build is green"</span><span class="o">}]</span>
  <span class="o">}]</span>
<span class="o">}</span>
</code></pre></div></div>

<p>And acknowledges receipt:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/messages/6751/ack <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"X-Agent-Id: id2"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"agent_id":"id2"}'</span>
<span class="o">{</span><span class="s2">"acked_by"</span>:[<span class="s2">"id2"</span><span class="o">]</span>,<span class="s2">"message_id"</span>:<span class="s2">"6751"</span><span class="o">}</span>
</code></pre></div></div>

<p><strong>The polling problem.</strong> This is the biggest pain point. The MCP (Model Context Protocol) interface that coding agents use is request-response: the agent asks a question, the server answers. There is no push mechanism. When agent A sends agent B a message, agent B only finds out the next time it explicitly polls <code class="language-plaintext highlighter-rouge">pending</code>. In practice, agents need to check periodically, which means either:</p>

<ol>
  <li>Wasting tokens on poll loops (“any messages for me? no? ok”)</li>
  <li>Adding latency – a message sits undelivered until the next poll</li>
</ol>

<p>The WebSocket endpoint exists (<code class="language-plaintext highlighter-rouge">/ws</code>) but coding agents don’t speak WebSocket natively. They speak HTTP. Until MCP adds a push/subscription mechanism, polling is the only option. This is a protocol limitation, not an implementation choice.</p>

<h3 id="knowledge-persistence">Knowledge persistence</h3>

<p>Agents store discoveries so future sessions don’t re-derive them:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl <span class="nt">-X</span> POST http://127.0.0.1:9876/atheneum/discoveries <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"X-Agent-Id: id1"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"content-type: application/json"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"agent":"claude","discovery_type":"Bug",
       "target":"query_sessions",
       "metadata":{"file":"evidence.rs","line":547,
                   "why":"anonymous ? params required"}}'</span>
<span class="o">{</span><span class="s2">"discovery_id"</span>:7502,...<span class="o">}</span>
</code></pre></div></div>

<h3 id="cross-project-code-search">Cross-project code search</h3>

<p>This one I use daily. When you work on multiple codebases simultaneously, you need to find symbols across all of them. Envoy queries all magellan-indexed projects from one endpoint without copying data:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># One-time setup per project</span>
atheneum meta-register envoy ~/Projects/envoy <span class="se">\</span>
  ~/.magellan/envoy/envoy.db <span class="nt">--language</span> rust

<span class="c"># Search across all registered projects</span>
<span class="nv">$ </span>curl <span class="s2">"http://127.0.0.1:9876/atheneum/cross/search?q=build_router&amp;language=rust&amp;k=5"</span>
<span class="o">{</span>
  <span class="s2">"count"</span>: 5,
  <span class="s2">"results"</span>: <span class="o">[</span>
    <span class="o">{</span><span class="s2">"project"</span>:<span class="s2">"envoy"</span>,<span class="s2">"name"</span>:<span class="s2">"build_router"</span>,<span class="s2">"kind"</span>:<span class="s2">"Function"</span>,
     <span class="s2">"file"</span>:<span class="s2">"src/http/router.rs"</span>,<span class="s2">"line"</span>:81<span class="o">}</span>,
    <span class="o">{</span><span class="s2">"project"</span>:<span class="s2">"envoy"</span>,<span class="s2">"name"</span>:<span class="s2">"build_router calls build_base_routes"</span>,
     <span class="s2">"kind"</span>:<span class="s2">"Call"</span>,<span class="s2">"file"</span>:<span class="s2">"src/http/router.rs"</span>,<span class="s2">"line"</span>:82<span class="o">}</span>,
    ...
  <span class="o">]</span>
<span class="o">}</span>
</code></pre></div></div>

<p>How it works: envoy delegates to atheneum’s <code class="language-plaintext highlighter-rouge">CrossRouter</code>, which lazily <code class="language-plaintext highlighter-rouge">ATTACH DATABASE</code> each project’s magellan DB (read-only) and queries across schemas. An LRU cache keeps hot DBs attached across requests. SQLite limits this to ~10 attached databases, so the cache defaults to 8.</p>

<p>The deeper navigate endpoint (<code class="language-plaintext highlighter-rouge">/atheneum/cross/navigate</code>) that does BFS graph walks across projects currently errors on the cross-schema edge queries. That’s a known bug – the <code class="language-plaintext highlighter-rouge">UNION ALL</code> over attached schemas doesn’t find the edges table. Search works, graph navigation doesn’t yet.</p>

<h2 id="the-knowledge-graph-underneath">The knowledge graph underneath</h2>

<p>Envoy sits on top of atheneum, which stores everything as a property graph. Real numbers from my running instance:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Entity counts (4,747 total):
  ToolCall:    2,399    Session:     231    File:      203
  Reference:     338    WikiPage:    280    Import:    198
  ReasoningLog:  329    Symbol:      190    Memory:    130
  TestRun:       120    Discovery:     3    Event:       3

Edge counts (15,210 total):
  belongs_to_project: 4,184    accessed:    635
  observed_in:        3,435    modified:    393
  wikilink:           3,220    CALLS:       116
  handled_by_tool:    2,399    IMPORTS:      84
  performed_by:         233    REFERENCES:  145
</code></pre></div></div>

<p>This is what makes cross-session memory possible. When a new session starts, it queries the graph for prior context instead of re-discovering everything.</p>

<h2 id="whats-rough">What’s rough</h2>

<p>Honest assessment of what doesn’t work well:</p>

<ul>
  <li><strong>The MCP polling problem</strong> described above. No push mechanism, no subscriptions, no server-sent events. Agents waste tokens polling or accept delivery latency.</li>
  <li><strong>The <code class="language-plaintext highlighter-rouge">/atheneum/cross/navigate</code> endpoint</strong> errors on cross-schema edge queries. Symbol search works, graph walks don’t.</li>
  <li><strong>API discoverability</strong> is poor. Several endpoints have required fields that aren’t documented anywhere except the Rust source. I found <code class="language-plaintext highlighter-rouge">agent</code> is required on session creation, <code class="language-plaintext highlighter-rouge">tool_name</code> instead of <code class="language-plaintext highlighter-rouge">tool</code> on tool-calls, <code class="language-plaintext highlighter-rouge">agent_id</code> on ack – all through 422 errors.</li>
  <li><strong>The events endpoint</strong> returns an empty body on success (no confirmation JSON), which makes it hard to verify it worked.</li>
  <li><strong>Token savings counter</strong> in the knowledge endpoint always returns 0. Never got around to implementing the calculation.</li>
  <li><strong>v0.1.1</strong> – 127 commits, 11.5K LOC, but still early. No backward compatibility guarantees yet.</li>
</ul>

<h2 id="the-post-mortem-that-shaped-it">The post-mortem that shaped it</h2>

<p>During development, a private git dependency broke CI for 8 consecutive runs. The dependency was specified as a <code class="language-plaintext highlighter-rouge">git = "..."</code> URL in <code class="language-plaintext highlighter-rouge">Cargo.toml</code>. It resolved fine locally (cached) but failed on every CI runner (fresh clone, no cache, no SSH key for the private repo). The error was misleading – cargo reported “revival failed” which looked like a registry issue, not an access issue.</p>

<p>That incident directly led to three envoy features:</p>

<ol>
  <li><strong>Session accountability</strong> – if CI had logged what it actually did vs. what it claimed, the SSH key issue would have been obvious in 1 run instead of 8</li>
  <li><strong>Structured tool call logging</strong> – the difference between “cargo check failed” and “cargo check failed because SSH key was missing for git+https://…” is the difference between 1 hour and 6 hours of debugging</li>
  <li><strong>The subagent trust model</strong> – subagents are not trusted by default. Their output is only valid when all verification gates pass (magellan queries ran, cargo check green, no stubs). If a subagent’s hooks blocked it, its summary is discarded as unreliable</li>
</ol>

<h2 id="current-state">Current state</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Version:    0.1.1
LOC:        11,562 (Rust)
Tests:      5,223 lines
Commits:    127
Endpoints:  20+ (agents, sessions, messages, tool-calls, events,
              discoveries, graph, cross-project search, health,
              circuit breakers)
Runtime:    SQLite (no external services)
License:    GPL-3.0-only
</code></pre></div></div>

<p>Install:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cargo <span class="nb">install </span>agent-envoy
</code></pre></div></div>

<p>Or as part of the grounded-coding stack (also installs magellan, llmgrep, mirage, splice):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-fsSL</span> https://raw.githubusercontent.com/oldnordic/grounded-coding/master/install.sh | sh
</code></pre></div></div>

<p>Source: <a href="https://github.com/oldnordic/envoy">github.com/oldnordic/envoy</a></p>]]></content><author><name>Luiz Spies</name></author><category term="engineering" /><summary type="html"><![CDATA[I run multiple AI coding agents in parallel. Claude Code sessions, Hermes agents, subagents spawning subagents. After a while I noticed something: none of them know the others exist. They overwrite each other’s files, repeat discoveries, and have no memory of what happened yesterday. There is no infrastructure for this. So I built one.]]></summary></entry><entry><title type="html">Training a Geometric Language Model in Pure Rust: First Results</title><link href="https://oldnordic.github.io/language-geometry/2026/06/12/geometric-lm-training.html" rel="alternate" type="text/html" title="Training a Geometric Language Model in Pure Rust: First Results" /><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://oldnordic.github.io/language-geometry/2026/06/12/geometric-lm-training</id><content type="html" xml:base="https://oldnordic.github.io/language-geometry/2026/06/12/geometric-lm-training.html"><![CDATA[<p>The <a href="/language-geometry/2026/06/09/geometric-graph-attention-decoder.html">geometric decoder post</a> described how a corpus-native graph can guide token decoding through Rodrigues rotation and curvature weighting. This post covers what happens when you connect that graph to a training loop and actually try to learn next-token prediction from it.</p>

<p>Everything runs on CPU. No GPU, no autograd framework — just pure Rust with manual backprop.</p>

<hr />

<h2 id="whats-being-trained">What’s being trained</h2>

<p>The architecture:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input:  8 context token positions × 3D coords = 24 floats
hidden: MLP(24 → hidden_dim → vocab_size)
output: softmax over dense vocab
</code></pre></div></div>

<p>The positions come from the same PMI+SVD pipeline as the decoder experiments: co-occurrence statistics → TruncatedSVD → unit-sphere 3D coordinates per token. The MLP maps those geometric coordinates to a next-token distribution.</p>

<p>What’s not there: learned embeddings, attention, positional encoding, transformer blocks. The geometry is the representation. The MLP is the prediction head.</p>

<p><strong>Why no framework.</strong> Burn and Candle both have poor CPU performance and are primarily CUDA infrastructure. The experiments are CPU-first and AMD GPU later. Writing the forward and backward passes directly in Rust costs a few hundred lines (<code class="language-plaintext highlighter-rouge">algorithms/mlp.rs</code>, <code class="language-plaintext highlighter-rouge">algorithms/adam.rs</code> in <a href="https://github.com/oldnordic/geographdb-core">geographdb-core</a>) and avoids pulling in a dependency chain that doesn’t fit the use case.</p>

<p>The backward pass for the Rodrigues layer:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>δW_out  = h.T @ δlogits
δh      = δlogits @ W_out
</code></pre></div></div>

<p>Rodrigues rotation matrices are orthogonal, so <code class="language-plaintext highlighter-rouge">R^T = R^{-1}</code>. Gradients flow back through the transport step without inverting anything:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>δh_u = Σ_{v: u∈N(v)} R_{vu}^T · δh2_v
</code></pre></div></div>

<p>Trainable parameters: the MLP weights only. The coordinate positions are fixed (frozen from the SVD), and the Rodrigues rotations have no parameters — they’re computed from 3D positions at forward-pass time.</p>

<hr />

<h2 id="toy-corpus-does-the-implementation-work">Toy corpus: does the implementation work?</h2>

<p>Before touching TinyStories, the training loop was tested on a hand-built two-community graph: 8 nodes split into two spatial clusters, sequences that walk within or between communities.</p>

<p>Result: <strong>100% accuracy</strong> after 200 epochs. Loss curve is monotonically decreasing. The MLP can learn to separate the two communities from 3D coordinate context alone.</p>

<p>This isn’t impressive on its own — it’s 8 nodes — but it validates that the forward pass, backward pass, Adam update, and the gradient accumulation are all correct.</p>

<hr />

<h2 id="first-run-on-tinystories-a-training-bug">First run on TinyStories: a training bug</h2>

<p>The first TinyStories run (2,000 stories, vocab 3,547 tokens + UNK, dim 64, lr=0.001) showed a diagnostic failure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>epoch 1  loss=7.50
epoch 2  loss=7.76
</code></pre></div></div>

<p>Loss went up on epoch 2. That’s optimizer divergence, not architecture failure.</p>

<p><strong>Root cause:</strong> the training loop was calling one Adam step per training example. Per-example Adam is stochastic gradient descent with maximum noise: each of the ~100K examples in a 2,000-story epoch produces its own independent parameter update, and Adam’s moment estimates are meaningless when computed on a single data point. With a 3,547-class output, each update is 226K parameter changes computed from one token’s gradient.</p>

<p><strong>Fix:</strong> accumulate gradients over batches of 128, divide by batch size, then one Adam step. Standard mini-batch SGD. Lower default LR to 1e-4. Already had <code class="language-plaintext highlighter-rouge">clip_gradients</code> in <code class="language-plaintext highlighter-rouge">Adam</code> — wired it in.</p>

<p>With the fix, the same run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>epoch 1  loss=6.006
epoch 2  loss=5.831
epoch 3  loss=5.741
epoch 4  loss=5.668
epoch 5  loss=5.616
</code></pre></div></div>

<p>Monotonically decreasing across all 5 epochs. No divergence. Train loss ≈ validation perplexity (no overfitting — the model hasn’t learned enough to overfit).</p>

<hr />

<h2 id="results-2k-stories-5-epochs">Results: 2k stories, 5 epochs</h2>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Validation perplexity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bigram (Laplace-smoothed)</td>
      <td>175.7</td>
    </tr>
    <tr>
      <td>Geometric MLP (frozen coords)</td>
      <td>282.8</td>
    </tr>
    <tr>
      <td>Geometric MLP + curvature weighting</td>
      <td>304.5</td>
    </tr>
  </tbody>
</table>

<p>Bigram wins. The geometric model is learning (282 vs. ~4096 random), but not beating the baseline.</p>

<p>Two things are worth unpacking here.</p>

<p><strong>Why bigram wins.</strong> Bigram takes the exact previous token as input and directly reads co-occurrence counts. The geometric MLP takes 3D positions as input. The SVD compression maps tokens to unit-sphere coordinates based on shared neighborhood structure — tokens that co-occur with similar neighbors end up nearby. But nearby tokens aren’t identical: the 3D position is a lossy representation of the token identity. The MLP has to recover discriminative signal from compressed coordinates. Bigram has no such compression; it works directly from identity.</p>

<p><strong>Why the comparison is slightly asymmetric.</strong> Bigram uses 1-token context. The geometric model uses 8-token context (8 × 3D positions). The geometric model has more information in principle, but at this data scale the 3D coordinates don’t carry enough structure to exploit the longer context. With 2,000 training stories, the PMI co-occurrence matrix is sparse — many token pairs never co-occur, and the SVD positions don’t reliably separate semantically distinct tokens.</p>

<p><strong>Why curvature weighting hurts.</strong> The curvature evaluation adds a heuristic log-probability bias (angle continuity + κ penalty, both with fixed coefficients) on top of the learned MLP logits at inference time. If the MLP has already learned something useful, overlaying an untuned heuristic distorts it. The curvature signal isn’t useless — it actively helped in the decoder traversal experiments — but there it was the only signal. Adding it as a fixed-coefficient bonus over a trained model requires tuning those coefficients, not hardcoding them at 1.0.</p>

<hr />

<h2 id="20000-stories-15-epochs-the-full-result">20,000 stories, 15 epochs: the full result</h2>

<p>The 20k run added two variants not tested before: a <strong>trigram</strong> model (takes two previous token IDs, no geometry) and a <strong>hybrid</strong> model (two previous token IDs + 8 previous 3D positions). This makes the comparison direct: does geometry add anything on top of token identity?</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Validation perplexity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bigram baseline</td>
      <td>72.97</td>
    </tr>
    <tr>
      <td>Trigram (token identity only)</td>
      <td>32.02</td>
    </tr>
    <tr>
      <td>Hybrid (token identity + geometry)</td>
      <td>43.24</td>
    </tr>
    <tr>
      <td>Hybrid + κ weighting</td>
      <td>43.81</td>
    </tr>
  </tbody>
</table>

<p><strong>Geometry does not add signal.</strong> Trigram beats hybrid by 11 perplexity points. The MLP gets a cleaner signal from two token IDs than from two token IDs plus 8 × 3D coordinates. The curvature-weighted variant is slightly worse than plain hybrid.</p>

<p>Training dynamics match the numbers. Trigram fit the training set harder and plateaued around loss 3.18. Hybrid plateaued around 3.61 and started overfitting after epoch 9 — the geometric features are hurting generalisation, not helping it.</p>

<p><strong>Why geometry doesn’t help here:</strong></p>

<p>PMI+SVD positions encode shared co-occurrence neighborhood structure. Tokens that appear in similar contexts end up nearby in 3D space. That’s useful for finding semantically related tokens, but next-token prediction doesn’t need semantically related tokens — it needs the likely <em>next</em> token given the current context. A 3D coordinate tells you what a token is <em>like</em>; it doesn’t tell you what comes after it. The token ID tells you both.</p>

<p>The 8-position geometric context should in principle carry more information than a single token ID (which is what bigram uses). In practice, the MLP can’t extract that signal from the SVD coordinates. The two-token-ID trigram dominates by a large margin over everything else.</p>

<hr />

<h2 id="geo-attention-single-head-graph-attention-over-geometric-neighbors">Geo-attention: single-head graph attention over geometric neighbors</h2>

<p>The MLP result raised a different question: maybe the architecture is the constraint, not the representation. An MLP treats all 8 context positions equally and independently. A token’s geometric neighbors might carry signal that only becomes useful when actively queried — matching what the current token is “looking for” against what its neighbors know.</p>

<p><code class="language-plaintext highlighter-rouge">GraphAttentionClassifier</code> implements this directly:</p>

<ul>
  <li>Token embedding table (learned)</li>
  <li>Learned W_q, W_k, W_v projections</li>
  <li>Each context token attends to itself + its k geometric neighbors from the PMI graph</li>
  <li>Residual update: <code class="language-plaintext highlighter-rouge">h = embedding + attention(...)</code></li>
  <li>MLP head on the last context position</li>
  <li>Full backward pass through attention weights and MLP</li>
</ul>

<p>The same 20k/15ep setup, four variants in parallel:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Validation perplexity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>One-hot trigram (baseline)</td>
      <td>32.02</td>
    </tr>
    <tr>
      <td>Geo-attention + 4 neighbors</td>
      <td>55.54</td>
    </tr>
    <tr>
      <td>Geometric rotated + 4 neighbors</td>
      <td>126.85</td>
    </tr>
    <tr>
      <td>Geometric absolute</td>
      <td>145.08</td>
    </tr>
    <tr>
      <td>Geometric rotated (no neighbors)</td>
      <td>272.98</td>
    </tr>
  </tbody>
</table>

<p><strong>Attention over geometry is much better than MLP over geometry.</strong> Geo-attention (55.54) is roughly 2.5x better than the best MLP-on-geometry variant (127 ppl). The query/key/value mechanism gives the model a “search and correlate” capability the flat MLP doesn’t have: it can weight neighbors selectively based on what the current token embedding is asking for.</p>

<p><strong>Geometry still loses to token identity.</strong> Even with attention, geo-attention is 23 ppl behind one-hot trigram. Rotation alone (no neighbors) was near-useless (273 ppl); adding 4 neighbors rescued it to 127 ppl. Local geometric neighborhoods carry some signal — but only when actively queried, and not enough to close the gap with trigram.</p>

<p><strong>Why the gap persists.</strong> PMI+SVD positions cluster tokens by shared co-occurrence context — tokens that appear in similar environments end up nearby in 3D space. That’s a <em>semantic</em> similarity measure. Next-token prediction needs <em>successor</em> structure: which token tends to follow this one. These are different things. “Dog” and “cat” are geometric neighbors (similar contexts); neither predicts the other as a next token. The trigram baseline reads co-occurrence directly as successor frequency. The PMI graph doesn’t preserve that direction.</p>

<hr />

<h2 id="where-this-leaves-things">Where this leaves things</h2>

<p>The full experiment arc so far, at 20k stories / 15 epochs:</p>

<table>
  <thead>
    <tr>
      <th>Architecture</th>
      <th>Representation</th>
      <th>Validation ppl</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MLP</td>
      <td>One-hot trigram</td>
      <td><strong>32.02</strong></td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Hybrid (token ID + geometry)</td>
      <td>43.24</td>
    </tr>
    <tr>
      <td>Attention</td>
      <td>Graph neighbors</td>
      <td>55.54</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Geometric rotated + neighbors</td>
      <td>126.85</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Geometric absolute</td>
      <td>145.08</td>
    </tr>
    <tr>
      <td>MLP</td>
      <td>Geometric rotated</td>
      <td>272.98</td>
    </tr>
  </tbody>
</table>

<p>The bottleneck is the PMI+SVD graph construction, not the model. To beat trigram with geometry, the geometric space itself needs to encode successor structure — either learned end-to-end, or derived from a graph that preserves directional co-occurrence rather than symmetric neighborhood similarity. That’s the next question.</p>

<hr />

<h2 id="reproduce">Reproduce</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/oldnordic/geographdb-core
git clone https://github.com/oldnordic/geographdb-experiments

<span class="nb">cd </span>geographdb-experiments
cargo run <span class="nt">--release</span> <span class="nt">--bin</span> train_geometric <span class="nt">--</span> <span class="se">\</span>
  <span class="nt">--dataset</span> roneneldan/TinyStories <span class="se">\</span>
  <span class="nt">--vocab-size</span> 4096 <span class="se">\</span>
  <span class="nt">--dim</span> 64 <span class="se">\</span>
  <span class="nt">--epochs</span> 5 <span class="se">\</span>
  <span class="nt">--lr</span> 1e-4 <span class="se">\</span>
  <span class="nt">--max-train-stories</span> 2000 <span class="se">\</span>
  <span class="nt">--max-val-stories</span> 1000
</code></pre></div></div>

<p>Hardware: AMD Ryzen 7 7800X3D, 64 GB RAM, no GPU used. Training 2k stories for 5 epochs takes roughly 8 minutes on this machine.</p>

<p>The tokenizer is cached to <code class="language-plaintext highlighter-rouge">--output</code> (default <code class="language-plaintext highlighter-rouge">/tmp/train_geometric_tinystories</code>) after the first run.</p>

<h2 id="code">Code</h2>

<ul>
  <li>MLP ops + backward: <code class="language-plaintext highlighter-rouge">geographdb-core/src/algorithms/mlp.rs</code></li>
  <li>Adam optimizer: <code class="language-plaintext highlighter-rouge">geographdb-core/src/algorithms/adam.rs</code></li>
  <li>Training binary: <code class="language-plaintext highlighter-rouge">geographdb-experiments/src/bin/train_geometric.rs</code></li>
  <li>Rodrigues rotation: <code class="language-plaintext highlighter-rouge">geographdb-core/src/algorithms/parallel_transport.rs</code></li>
</ul>]]></content><author><name>Luiz Spies</name></author><category term="language-geometry" /><summary type="html"><![CDATA[The geometric decoder post described how a corpus-native graph can guide token decoding through Rodrigues rotation and curvature weighting. This post covers what happens when you connect that graph to a training loop and actually try to learn next-token prediction from it.]]></summary></entry></feed>