A Geometric Decoder for Language Graphs
This post is the decoder half of a larger experiment: can a language model decode by moving through a graph instead of comparing query/key vectors?
The standard transformer decoder scores every candidate token with softmax(QK^T). The geometric alternative scores candidates by how well an outgoing edge continues the local curvature of the graph. No learned attention weights, no softmax over the full vocabulary, no feed-forward on hidden states. For this experiment, the graph acts as the decoding substrate.
Mechanism
Three pieces are wired together in geographdb-core:
D1 — Rodrigues rotation. The walker carries a velocity vector along the incoming edge. To score a candidate outgoing edge, it rotates that velocity onto the candidate direction using Rodrigues’ formula. The rotation angle is the cost: a straight continuation scores high, a sharp turn scores low.
D2 — Graph attention + UCB. For each candidate neighbor, the attention score is the cosine similarity of the transported neighbor feature against the current node’s feature. W_Q and W_K are identity for this first test, so the geometry is doing all the work. An upper-confidence-bound term adds exploration pressure based on how often each node has been visited.
E6 — Ricci curvature κ. The UCB bonus is weighted by 1 - κ, where κ comes from a fast Dice neighbor-overlap approximation. Negative κ means the edge is a bridge between communities; positive κ means it sits inside a tight cluster. The walker therefore explores more aggressively across bridges and less inside echo chambers.
Metrics
Two simple counts:
unique-node coverage rate = unique_nodes / steps. Higher means the walker visits more distinct nodes; lower means it loops.cross-communitytransitions: how often the walker moves from one side of the graph to the other.
Interpreting the two together matters. A walker can achieve high coverage by bouncing around inside one community without ever leaving. The geometric walker deliberately trades some local coverage for cross-community coherence.
Toy graph
A 6-node graph with two clusters:
| Mode | Unique nodes | Coverage rate | Cross-community |
|---|---|---|---|
| Distance-KNN | 2 | 0.065 | 0 |
| Graph attention + UCB | 6 | 0.194 | 3 |
| Graph attention + UCB + κ | 6 | 0.194 | 7 |
Distance-KNN gets stuck between the two nearest nodes and never crosses communities. The geometric walker visits every node, and κ doubles the number of community crossings.
Corpus-scale: full TinyStories
The same comparison on the full TinyStories training set:
- 2,119,719 texts
- 460,035,016 tokens
- 43,047 vocabulary tokens → 54,919 sense-nodes
- 859,917 PMI edges
| Mode | Unique nodes | Coverage rate | Cross-community |
|---|---|---|---|
| Distance-KNN | 397 | 0.792 | 62 |
| Graph attention + UCB | 284 | 0.567 | 186 |
| Graph attention + UCB + κ | 304 | 0.607 | 247 |
Distance-KNN has the widest surface coverage (79% new nodes) but shallow semantic reach: only 62 cross-community moves. The geometric walker visits fewer unique nodes overall but makes 3× more cross-community transitions. Adding κ pushes cross-community movement another +33%.
Ablation readout
import numpy as np
knn = np.array([397, 62])
attn = np.array([284, 186])
ricci = np.array([304, 247])
print("cross-community gain over Distance-KNN:")
print(f" attention: {attn[1] / knn[1] - 1:.2%}")
print(f" attention+κ: {ricci[1] / knn[1] - 1:.2%}")
print(f" κ-only lift: {ricci[1] / attn[1] - 1:.2%}")
Output:
cross-community gain over Distance-KNN:
attention: +200.00%
attention+κ: +298.39%
κ-only lift: +32.80%
The gain from κ is additive, not redundant: it specifically increases the fraction of moves that cross semantic boundaries.
What this does and does not show
This shows that a geometric decoder can traverse a corpus-native graph more coherently than a pure distance heuristic, and that curvature is a useful signal for where to explore. It does not show that this decoder beats a trained transformer on perplexity or downstream tasks. The comparison is against a strong-but-myopic baseline, not against a full language model.
Reproduce
Build the graph:
cargo run --release --bin corpus_native_graph -- \
--dataset text=roneneldan/TinyStories:data/train- \
--vocab-size 50000 \
--output /tmp/tinystories_full_graph
Run the three-way comparison:
cargo run --release --bin d3_attention_comparison -- \
/tmp/tinystories_full_graph 500 0.0
The 0.0 is the edge-weight threshold for the fast κ computation; TinyStories edges are PMI-weighted, so zero keeps all of them.
Code
- D1/D2:
geographdb-core/src/algorithms/parallel_transport.rs,geographdb-core/src/corpus/inference.rs - D3:
geographdb-experiments/src/bin/d3_attention_comparison.rs - E6:
geographdb-core/src/algorithms/ricci.rs(curvature_map_fast)
Encoder note
The decoder experiments above assume the node positions are already meaningful. A first encoder test (E2) trains a small MLP, CoordinateBranch, to predict Memoria’s layer-23 PCA activation coordinates directly from token IDs.
Those PCA coordinates are unnormalized:
| Stat | Value |
|---|---|
| Per-axis std | [9.55, 7.62, 6.20] |
| Norm std (‖σ‖) | 13.70 |
| Mean norm (‖x‖) | 12.06 |
After 200 epochs the model reaches mean L2 error = 0.519. In context:
0.519 / 13.70 ≈ 3.8%of the coordinate spread- 92.6% of nodes are within L2 error < 1.0
- 99.99% are within L2 error < 2.0
That says PMI co-occurrence geometry captures the rough layout of the activation manifold, but not the fine-grained positions. It is a meaningful correlation, not a pixel-perfect reconstruction.
Next
Training results — connecting the geometric graph to a learning loop — are in Geometric Language Model: First Training Results. That post covers the pure-Rust MLP+Adam training infrastructure, the mini-batch fix that stopped early divergence, and the first perplexity comparison against a bigram baseline on TinyStories.