17 — Observability¶
Cites:
research/reviews/02_mlops_engineer.md(#10, recommendations on run_id correlation +alexandria status --json),13_hostile_verifier.md(the runs table is the observability backbone).
Principles¶
- Local, structured, correlatable. Every event lands as a line in a local JSONL file with a
run_idfield. No cloud service, no telemetry, no Prometheus scrape unless the user explicitly asks for it. - Actionable. Every observable signal must answer "what do I do next?". Raw stats without guidance are noise.
- Read-only by design. Observability tools never write to
wiki/. They read the runs table, read the log files, format, and return. - One run_id spans the whole causality chain. A user invokes
ingest; the parent assigns a run_id; the scheduler logs it; the adapter logs it; the LLM provider logs it; the verifier logs it; the git commit records it;wiki/log.mdreferences it. One ID, one story.
The log families¶
All under ~/.alexandria/logs/, rotated daily:
| File | Source | Contents |
|---|---|---|
daemon-YYYY-MM-DD.jsonl |
Parent process supervisor | Lifecycle events, child starts/kills, schema migration results, FTS verification, orphaned-run sweeps |
sync-YYYY-MM-DD.jsonl |
Source and subscription adapters | Adapter runs: start, items fetched, rate-limit stalls, circuit-breaker state changes, errors |
mcp-YYYY-MM-DD.jsonl |
MCP server (stdio and HTTP) | Tool calls: tool name, workspace, args hash (no values), latency, result size, errors |
llm-usage-YYYY-MM-DD.jsonl |
LLM provider wrappers | Completions: preset, model, tokens in/out, cache read/write, USD, latency, stop reason |
verifier-YYYY-MM-DD.jsonl |
Hostile verifier runs | Verdict, per-claim findings, findings by category, tokens used, time to verdict |
eval-YYYY-MM-DD.jsonl |
Evaluation scaffold | Metric runs: M1-M5 scores, sample sizes, trend signals, threshold breaches |
hook-YYYY-MM-DD.jsonl |
Conversation-capture hooks | Hook invocations, mined session IDs, skipped sessions, errors |
Common line shape¶
{
"ts": "2026-04-16T14:23:45.012Z",
"run_id": "2026-04-16-abc123",
"workspace": "research",
"layer": "llm-usage",
"event": "completion",
"level": "info",
"correlation": {
"parent_run": null,
"trace_id": "2026-04-16-abc123"
},
"data": {
"preset": "claude-sonnet",
"model": "claude-sonnet-4-6",
"tokens_in": 3421,
"tokens_out": 812,
"cache_read": 3200,
"cache_write": 0,
"usd": 0.018,
"latency_ms": 4271,
"stop_reason": "end_turn"
}
}
Required fields on every line:
- ts — ISO 8601 with millisecond precision.
- run_id — from the runs table in 13_hostile_verifier.md, OR the daemon's per-boot session id if no run exists.
- workspace — slug, or __global__ for daemon-level events.
- layer — which log family this line belongs to (denormalized for grep).
- event — short identifier.
- level — debug|info|warn|error|crash.
- data — arbitrary per-event payload.
Optional:
- correlation.parent_run — when a child run was spawned by another run (e.g., verifier by ingest).
- correlation.trace_id — usually equal to run_id but can span multiple runs for an end-to-end user action.
run_id correlation — the story¶
When a user runs alexandria ingest paper.pdf:
- The CLI generates
run_id = 2026-04-16-<uuid7>and passes it to the daemon via the IPC socket. - The daemon records a
runsrow (status='pending',run_type='ingest'). - The daemon writes to
daemon-*.jsonl:{run_id, event='run_started', data={...}}. - The daemon spawns the guardian loop with the run_id bound in context.
- The guardian reads the source (logs to
mcp-*.jsonl), calls the LLM (logs tollm-usage-*.jsonl), stages writes toruns/<run_id>/staged/, and passes the run to the verifier. - The verifier is a sub-run with
parent_run = <run_id>and its ownrun_id. It logs toverifier-*.jsonl. - On commit, the daemon logs to
daemon-*.jsonl:{run_id, event='run_committed'}. - The git commit carries
run_idin its message footer:alexandria: run 2026-04-16-abc123. wiki/log.mdrecords## [2026-04-16] ingest | paper.pdf (run 2026-04-16-abc123).wiki_log_entriesrecords the same run_id.
alexandria logs show <run_id> merges all log-family entries with matching run_id OR matching correlation.parent_run into a single timestamp-sorted stream:
$ alexandria logs show 2026-04-16-abc123
2026-04-16T14:20:00.000Z daemon run_started research ingest paper.pdf
2026-04-16T14:20:00.142Z mcp tool_call research read(path=raw/papers/paper.pdf)
2026-04-16T14:20:00.891Z llm-usage completion research claude-opus 14k→2.8k tokens $0.12
2026-04-16T14:20:05.234Z mcp tool_call research write(create,...)
2026-04-16T14:20:08.501Z daemon run_staged research 7 files staged
2026-04-16T14:20:08.601Z verifier verifier_started research [child run 2026-04-16-abc123/v1]
2026-04-16T14:20:12.412Z llm-usage completion research claude-sonnet 18k→1.2k tokens $0.02
2026-04-16T14:20:12.501Z verifier verdict research commit (per-claim findings: 12 supports, 0 weak)
2026-04-16T14:20:12.800Z daemon run_committed research git: c0ffee1
2026-04-16T14:20:12.850Z daemon wiki_log_appended research log.md updated
This is the post-mortem tool. A user asks "what happened in last night's scheduled run?" and a single command tells the whole story in order.
alexandria status --json¶
The single-call operational dashboard. Returns a JSON blob covering:
{
"daemon": {
"pid": 12834,
"uptime_sec": 482391,
"state": "running",
"children": {
"scheduler": {"pid": 12835, "state": "running", "last_beat": "2026-04-16T14:23:41Z"},
"adapter_workers": {"pool": 4, "active": 2, "dead": 0},
"synthesis_worker":{"pid": 12840, "state": "idle", "last_beat": "2026-04-16T14:23:40Z"},
"mcp_http": {"pid": 12841, "state": "running", "connections": 1},
"webhook_recv": {"pid": 12842, "state": "running"},
"web_ui": {"pid": 12843, "state": "running"}
}
},
"database": {
"schema_version": 7,
"sqlite_size_bytes": 48293721,
"wal_size_bytes": 1048576,
"fts_integrity": "ok",
"last_backup_at": "2026-04-15T03:00:00Z",
"last_backup_age_hours": 35.4
},
"workspaces": [
{
"slug": "research",
"wiki_pages": 94,
"raw_sources": 123,
"events_last_7d": 412,
"pending_subscriptions": 6,
"last_ingest_at": "2026-04-16T13:15:00Z",
"last_synthesis_at": "2026-04-14T03:00:00Z",
"synthesis_paused": false,
"drafts_awaiting_review": 0,
"eval": {
"M1": {"score": 0.97, "status": "healthy", "last_run": "2026-04-14T03:30:00Z"},
"M2": {"score": 0.93, "status": "healthy", "last_run": "2026-04-16T13:20:00Z"},
"M3": {"score": 0.84, "status": "healthy", "last_run": "2026-04-01T04:00:00Z"},
"M4": {"crossover_week": 19, "last_run": "2026-04-16T14:00:00Z"},
"M5": {"score": 0.96, "status": "healthy", "last_run": "2026-04-14T04:30:00Z"}
}
}
],
"adapters": [
{
"id": "gh-acme-web",
"type": "github",
"workspace": "research",
"circuit_breaker": "closed",
"last_run_at": "2026-04-16T14:20:00Z",
"last_run_status": "success",
"next_run_at": "2026-04-16T14:25:00Z",
"items_last_run": 12
},
{
"id": "slack-research",
"type": "slack",
"workspace": "research",
"circuit_breaker": "open",
"opened_at": "2026-04-16T13:45:00Z",
"cooldown_until": "2026-04-16T14:50:00Z",
"last_error": "429 Too Many Requests"
}
],
"rate_limits": {
"github": {"capacity": 5000, "available": 4823, "refill_per_sec": 1.389},
"slack": {"capacity": 50, "available": 0, "refill_per_sec": 1}
},
"budgets": {
"research": {
"scheduled_synthesis_monthly_usd_used": 4.28,
"scheduled_synthesis_monthly_usd_cap": 20.00,
"scheduled_synthesis_runs_this_month": 2
}
},
"warnings": [
"slack-research adapter circuit breaker open until 14:50 UTC",
"last backup is 35h old — consider running alexandria backup create"
]
}
The warnings array is the actionable subset. If this array is empty, the system is healthy; if non-empty, the user reads it and acts.
CLI formats¶
alexandria status # pretty-printed human output
alexandria status --json # the blob above
alexandria status --workspace X # scoped to one workspace
alexandria status --watch # refresh every 5s (curses-ish)
Crash dumps¶
On an unhandled exception anywhere in the daemon:
- A traceback is written to
~/.alexandria/crashes/<timestamp>-<pid>-<child>.txt. - A state snapshot is appended: recent tool calls from the crashed run (if any), last heartbeats, current rate-limiter state, open file handles.
- The parent supervisor logs
{event='child_crashed', child=<name>, pid=<pid>, dump_path=<path>}todaemon-*.jsonl. - The crash count is surfaced in
alexandria statusas a warning.
Python faulthandler is enabled at import time for every child, which catches segfaults from C extensions (FTS5, asyncpg bindings) and writes a stack trace to stderr. The parent supervisor captures child stderr and routes to the crash dump file.
alexandria doctor¶
alexandria doctor [--workspace X]
Runs a suite of checks and reports pass/fail with actionable remediation:
[OK] SQLite database is readable
[OK] Schema version 7 matches binary expectation
[OK] FTS5 integrity verified
[OK] All workspace directories present
[OK] Keyring secret store accessible
[WARN] No backup in the last 24 hours
→ run: alexandria backup create
[OK] MCP stdio entry point is executable
[OK] Claude Code hook installed and verified
[ERR] slack-research adapter circuit breaker open
→ investigate: cat ~/.alexandria/logs/sync-2026-04-16.jsonl | jq 'select(.data.adapter_id == "slack-research")'
[WARN] No scheduled synthesis in the last 14 days
→ check: alexandria synthesize review
[OK] Eval M1 healthy on all workspaces
[OK] Eval M2 healthy on all workspaces
Doctor is the first thing a user runs when "something feels off". It points at specific log lines, specific commands, specific fixes. No generic advice.
Tracing the verifier's work specifically¶
Because the hostile verifier is the load-bearing correctness mechanism, its runs are observable in two places:
verifier-*.jsonl— every verdict with per-claim findings.alexandria runs show <run_id>— the cross-log unified view with the verifier as a sub-run.
A user who wants to understand "why did the verifier reject this ingest?" runs:
alexandria runs show 2026-04-16-abc123
And sees every step from read-source → stage → verify → reject, with the reject reason and the per-claim findings inlined. This is the debug surface for correctness issues.
No telemetry, no opt-out required¶
Everything in this doc writes to local files under ~/.alexandria/. There is no network traffic, no phone-home, no anonymized metrics server, no crash reporter uploading to an issue tracker. The observability surface is fully inspectable by the user with jq, grep, and a text editor. This is non-negotiable — it follows directly from invariant #1 (single-user, local).
Users who want remote observability can run their own OpenTelemetry collector against the JSONL files. alexandria does not ship it, and the architecture assumes it does not exist.
SOLID application¶
- Single Responsibility. Each log family is a separate file with a separate emitter. One log family, one purpose. Downstream tools that want to merge read from all of them via
run_id. - Open/Closed. Adding a new log family is adding a file path to the emitter registry. No changes to
alexandria logs show. - Liskov. Every log line has the common shape above. Code that parses one line parses them all.
- Interface Segregation. The status command reads from the runs table, the heartbeats table, and the last N log lines. It does not couple to the adapter implementations or the LLM provider internals.
- Dependency Inversion. The status blob is produced by a
StatusReporterclass that composesDaemonInspector,DbInspector,WorkspaceInspector, andAdapterInspector. Each inspector is a narrow interface; tests substitute fakes.
DRY notes¶
- One run_id threaded through every log family eliminates the need for separate correlation systems.
- One line shape means one parser.
- One
statuscommand aggregates what would otherwise be a dozen separatelist-*commands. - One doctor centralizes every "is my install healthy" check instead of scattering them across per-component scripts.
KISS notes¶
- Logs are JSONL. Readable with
tail -f | jq. No Parquet, no Avro, no binary format. - Status is one JSON document. No Prometheus, no Grafana, no push gateway.
- Doctor is a sequence of checks with pass/fail/warn and one remediation line each. No dependency graph.
- Crash dumps are text files. Read them with
cat.
What this doc does NOT cover¶
- The runs and source_runs tables — defined in
13_hostile_verifier.mdand06_data_model.md. - What each log family's event types mean semantically — a reference doc that I will build as the emitter registry grows. For now, the JSONL is self-describing.
- Daemon supervision logic itself —
16_operations_and_reliability.md. - Secrets redaction in logs —
18_secrets_and_hooks.md. - Evaluation metric scoring —
14_evaluation_scaffold.md.
Summary¶
Seven log families, one run_id field correlating everything, one alexandria status --json surfacing operational state, one alexandria logs show <run_id> for post-mortem tracing, one alexandria doctor for health checks, one alexandria runs show for verifier debugging. Crash dumps go to ~/.alexandria/crashes/. No telemetry. No cloud.
Every observable question the user will ask has exactly one command that answers it.