Envoy v0.2.0: Observability, Lock-Free Paths, and Bug Fixes

Three weeks after the initial release, envoy v0.2.0 is out. This isn’t a feature dump – it’s the result of running the server continuously and fixing the things that actually hurt. Three bugs from the original article got fixed, Prometheus metrics landed, and a performance improvement from another project turned out to transfer cleanly.

What changed

The diff is 1,551 insertions, 701 deletions across 28 files. The big items:

parking_lot everywhere

The original article mentioned envoy uses SQLite for persistence. What I didn’t mention is that the in-memory state (agent registry, circuit breaker, message store) was protected by std::sync::Mutex. Every lock site had poison recovery code – .lock().unwrap_or_else(|e| e.into_inner()) or .lock().map_err(|e| EnvoyError::LockPoisoned(...)). In practice, a poisoned mutex means a panic already happened and the data might be corrupt. “Recovering” by ignoring the poison doesn’t help.

I was already migrating rs3gw (a separate project) to parking_lot::Mutex and noticed the pattern transferred directly. parking_lot mutexes don’t use poisoning – lock() returns a MutexGuard<T> directly, no Result. The changes:

Removed LockPoisoned error variant entirely
Removed recover_lock() helper function
Removed FastMutex type alias
Simplified 25+ .lock() call sites across 7 files
Each mutex went from ~40 bytes to ~1 byte

No behavioral change for callers – the server responds to the same endpoints the same way. But the code is cleaner and the lock overhead is measurably smaller.

Prometheus `/metrics` endpoint

The original envoy had two monitoring endpoints: /health (returns {"status":"ok","uptime_seconds":N}) and /stats (returns aggregate counters). Useful for ad-hoc checks, useless for dashboards or alerting.

v0.2.0 adds GET /metrics in Prometheus exposition format:

# HELP envoy_requests_total Total HTTP requests, labeled by operation and status class
# TYPE envoy_requests_total counter
envoy_requests_total{method="GET",path="/health",status="2xx"} 14

# HELP envoy_agents_online Number of currently active agents
# TYPE envoy_agents_online gauge
envoy_agents_online 3

# HELP envoy_request_duration_ms Request latency in milliseconds
# TYPE envoy_request_duration_ms histogram
envoy_request_duration_ms_bucket{path="/health",le="0.5"} 14
envoy_request_duration_ms_sum{path="/health"} 0.821
envoy_request_duration_ms_count{path="/health"} 14

Three metric families:

Metric	Type	What it measures
`envoy_requests_total`	counter	Request count by method, path, status class
`envoy_request_duration_ms`	histogram	Latency distribution with 9 buckets (0.5ms to 5s)
`envoy_agents_online`	gauge	Active agent count, updated on register/retire

Path normalization. Raw URL paths cause cardinality explosions in Prometheus. A path like /agents/id1/messages/42/ack becomes a unique label, and with thousands of agents and messages you get thousands of time series. The middleware normalizes path segments that look like IDs – numeric (42), named (id1121), or UUID (338b8adc-...) – into a single :id token. So /agents/id1/messages/42/ack becomes /agents/:id/messages/:id/ack. Same metric regardless of which agent or message.

The approach is borrowed from rs3gw, which uses the same metrics + metrics-exporter-prometheus crate combination. The middleware wraps every request, records start time, normalizes the path, increments the counter, and observes the duration.

Prometheus scrape config:

scrape_configs:
  - job_name: 'envoy'
    static_configs:
      - targets: ['127.0.0.1:9876']
    scrape_interval: 15s
    metrics_path: /metrics

Request tracing

Every HTTP response now includes an x-request-id header with a unique UUID. This is done through tower-http layers – SetRequestIdLayer generates the UUID, PropagateRequestIdLayer ensures it appears on responses, and TraceLayer logs request/response pairs when RUST_LOG=tower_http=debug is set.

This is useful when debugging: if an agent reports a failed request, the request ID lets you find the exact log entry.

Bug fixes

Three issues from the original article’s “What’s rough” section got fixed:

`cross/navigate` no longer errors

The cross-project graph navigation endpoint (GET /atheneum/cross/navigate?q=build_router&language=rust&depth=2) was broken because the BFS edge query referenced a kind column, but production magellan databases use edge_type. The fix was a SQL alias: SELECT id, edge_type AS kind, .... Now the alias works with both the production schema and the test fixtures (which use kind). Symbol search (/atheneum/cross/search) was unaffected – it queries graph_entities which does use kind.

Evidence endpoints return JSON

Eight POST handlers – post_prompt, post_tool_call, post_file_write, post_commit, post_test_run, post_fix_chain, post_bench_run, post_subagent_handover – returned bare 201 Created with no response body. Now all return {"recorded": true} so callers can confirm success without relying on HTTP status alone. This was one of the API discoverability complaints from the original article.

API documentation rewritten

The original API.md was incomplete and sometimes wrong – several required fields weren’t documented. v0.2.0 rewrites it from the Rust struct definitions. Every endpoint now has correct request fields, required/optional markers, and response shapes. Verified against the actual handler code, not from memory.

What’s still rough

Honest update on what hasn’t improved:

The MCP polling problem is unchanged. Agents still poll for messages. The WebSocket endpoint exists but coding agents don’t speak WebSocket. This requires a protocol-level change in MCP, not an envoy fix.
Token savings counter still returns 0. Low priority – it’s a nice-to-have metric, not a correctness issue.
No Grafana dashboards yet. The Prometheus metrics are there, but I haven’t built the dashboard JSON. On the list.
Single-node only. Envoy uses a single SQLite database. No clustering, no replication, no multi-node coordination. If you need that, you need a different tool.

Numbers

Version:    0.2.0
LOC:        11,800 (Rust)
Tests:      65 (was 57)
Endpoints:  21+ (added /metrics)
Runtime:    SQLite (no external services)

The test count went from 57 to 65 – the 8 new tests cover the metrics module (path normalization, ID detection, UUID handling, histogram recording).

Install

cargo install agent-envoy

Source: github.com/oldnordic/envoy