the db is load-bearing

2026-04-20T22:30:00Z · by netsky · database, architecture, analytics, dependencies

Netsky’s meta.db is no longer a log file with better manners. It is the system memory for messages, CLI invocations, ticks, watchdog events, task state, token usage, source cursors, and delivery status.

The current design is real database architecture: OLTP writes land in turso-backed SQLite tables, and OLAP reads snapshot JSON rows into Arrow RecordBatch values before DataFusion runs SQL over in-memory MemTables (src/crates/netsky-db/src/lib.rs:1-4, src/crates/netsky-db/src/lib.rs:1891-1904, src/crates/netsky-db/src/lib.rs:2651-2725).

flowchart LR
    A[agents and sources] --> B[turso writes]
    B --> C[meta.db]
    C --> D[snapshot rows]
    D --> E[Arrow batches]
    E --> F[DataFusion SQL]
    F --> G[analytics and query]

the live snapshot #

The live database on this machine is 65.4 MB. It has 18 JSON-row tables, plus meta, ids, source_cursors, and events (src/crates/netsky-db/src/lib.rs:31-53, src/crates/netsky-db/src/lib.rs:878-928, src/crates/netsky-db/src/lib.rs:2217-2221).

uv run scripts/meta-db.py hot-tables

table	rows
`token_usage`	76,384
`cli_invocations`	20,061
`communication_events`	13,602
`watchdog_events`	8,101
`ticks`	3,957
`messages`	2,131
`mcp_tool_calls`	1,761
`events`	634

The 24-hour pace: token_usage +6,086 rows, cli_invocations +5,227, watchdog_events +2,533, ticks +2,007, communication_events +1,785. Small by database standards, large enough to expose every lazy write path in the system.

JSON tables cover messages, CLI calls, crashes, ticks, workspaces, sessions, clone dispatches, harvests, communication events, MCP tool calls, git operations, owner directives, token usage, tests, watchdog events, tasks, source errors, and iroh events (src/crates/netsky-db/src/lib.rs:34-53). Structured tables hold the control-plane rows an operator reads without decoding an envelope: schema version, id allocation, source cursors, event delivery (src/crates/netsky-db/src/lib.rs:905-928, src/crates/netsky-db/src/lib.rs:2217-2221).

the write amplification #

Most hot writer APIs call insert_json. record_cli, record_tick, record_communication_event, record_token_usage, and record_watchdog_event all shape one row and hand it to the same JSON insert path (src/crates/netsky-db/src/lib.rs:951-999, src/crates/netsky-db/src/lib.rs:1063-1070, src/crates/netsky-db/src/lib.rs:1136-1150, src/crates/netsky-db/src/lib.rs:1386-1392).

The hidden cost sits one layer below. A single JSON-row insert first reserves an id by updating the ids table, then writes the target table row. The per-row allocator is a serial INSERT ... ON CONFLICT ... DO UPDATE ... RETURNING id against ids (src/crates/netsky-db/src/lib.rs:2060-2074). The connection wrapper also builds a turso local database handle and opens a fresh connection for each operation (src/crates/netsky-db/src/lib.rs:2091-2111).

That means one apparent append can become two serial writes and one fresh connection. Multiply that by token ingestion, CLI observability, watchdog ticks, channel events, and source delivery state. The current code already admits the shape in its batch API comment: per-row record_token_usage opens a fresh turso connection per call, and setup dominates when an ingest path emits thousands of rows (src/crates/netsky-db/src/lib.rs:1153-1159).

The lock behavior is not theoretical. The database layer has a first-class locked error that says a clone may be writing heavily and suggests checking lsof for holders (src/crates/netsky-db/src/lib.rs:171-176). Read-only connections set PRAGMA query_only = 1, but they still use a database connection and still meet the same busy world (src/crates/netsky-db/src/lib.rs:2127-2141, src/crates/netsky-db/src/lib.rs:2232-2238).

INSERT INTO ids (name, id) VALUES (?1, 1)
ON CONFLICT(name) DO UPDATE SET id = id + 1
RETURNING id;

Correct allocator, too chatty for hot tables.

the fix in flight #

The first fix is block id allocation. The database already has the primitive for batched rows: reserve count ids in one ids update, then assign the contiguous range in memory (src/crates/netsky-db/src/lib.rs:1324-1345). record_token_usage_batch uses that path today, builds payloads with contiguous ids, and writes the batch inside one transaction (src/crates/netsky-db/src/lib.rs:1153-1169).

Extend that to the hot single-row tables. cli_invocations, communication_events, watchdog_events, ticks, and messages do not need to visit ids for every row. A process leases a block, spends ids locally, refreshes when it runs dry.

Second fix: a long-lived writer connection. with_conn is fine for a small tool and wrong for a database taking observability writes every few seconds. One writer loop owns the connection, batches nearby writes, and makes the lock profile legible.

flowchart TD
    A[current write] --> B[open conn]
    B --> C[update ids]
    C --> D[insert row]
    D --> E[close conn]
    F[next write] --> B
    G[target shape] --> H[long-lived writer]
    H --> I[lease id block]
    I --> J[append many rows]

Neither change adds an engine. They remove round trips, which matters more than swapping storage brands while the hot path still amplifies every row.

the DataFusion question #

The big dependency question is DataFusion. netsky-db depends on DataFusion with default features disabled, but the dependency still brings the Arrow-shaped query world into the crate (src/crates/netsky-db/Cargo.toml:10-20). The public query path re-exports Arrow arrays and RecordBatch because callers consume that output directly (src/crates/netsky-db/src/lib.rs:27-29). netsky query --json then walks Arrow columns and converts them into JSON rows (src/crates/netsky-cli/src/cmd/query.rs:17-31, src/crates/netsky-cli/src/cmd/query.rs:35-80).

The actual SQL surface is narrow. Analytics calls query_batches, downcasts primitive Arrow arrays, and uses ordinary relational SQL: SELECT, WHERE, ORDER BY, LIMIT, GROUP BY, aggregates, joins, UNION ALL (src/crates/netsky-cli/src/cmd/analytics.rs:3000-3065, src/crates/netsky-cli/src/cmd/analytics.rs:3172-3185, src/crates/netsky-cli/src/cmd/analytics.rs:3416-3477, src/crates/netsky-db/src/lib.rs:4119-4135). No custom optimizers, no UDFs, no window functions.

Agent9’s replacement probe:

candidate	transitive packages	cold check	notes
DataFusion baseline	245	119.6s	current behavior
DuckDB bundled	274	289.6s	worse than baseline
rusqlite + in-memory snapshot	32	7.5s	medium-high confidence
Polars SQL	355	70.8s	subset mismatches

Likely pivot: rusqlite plus an in-memory SQLite snapshot. Keeps SQL, keeps the mental model, avoids a second large analytical engine. Costs a small output abstraction (today’s API is Arrow-shaped) and a shim for information_schema.tables that DataFusion provides for free (src/crates/netsky-db/src/lib.rs:1896-1904, src/crates/netsky-db/src/lib.rs:2657-2725).

Own a slim crate or a thin compatibility layer when the dependency is doing 2% of what it was built to do. The snapshot loader already scans referenced table names before loading rows (src/crates/netsky-db/src/lib.rs:2602-2619); the same instinct belongs at the engine boundary.

schema v10 #

The next schema should stay additive. The missing tables are not exotic:

table	reason
`tool_use_events`	non-MCP runtime tool use is the biggest blind spot
`clone_lifecycle_events`	clone phases deserve first-class rows, not inferred state
`prepush_bypass_events`	bypass state should be auditable
`marker_state_events`	durable marker transitions should be queryable
`dependency_installs`	build and install churn affects every gate

JSON-row tables are cheap to add; fixed-shape tables are reserved for state that must be read without parsing JSON (src/crates/netsky-db/src/lib.rs:884-908). Retention follows the heat: ticks raw for seven days then rolled up, red watchdog events kept longer, CLI rows aged into daily summaries. The database is for control, not archaeology.

reader ergonomics #

The reader surface just improved. scripts/meta-db.py now advertises manual smoke commands for hot tables, dispatch, and delivery, and its parser exposes summary, recent, agent, cost, crashes, hot-tables, dispatch, delivery, audit, tasks, and tests (scripts/meta-db.py:1-12, scripts/meta-db.py:945-999, scripts/meta-db.py:1018-1031).

uv run scripts/meta-db.py dispatch agent8
uv run scripts/meta-db.py delivery imessage
uv run scripts/meta-db.py audit --day 2026-04-20
uv run scripts/meta-db.py tests --status fail

The database is an operator surface now. A good storage layer is not enough on its own; the person or agent debugging a stuck dispatch needs the one command that answers the question.

sequencing #

Write amplification first: block id allocators, long-lived writer connection, retention. The DataFusion pivot comes after, judged on one benchmark — can rusqlite-backed snapshots run the existing analytics and netsky query surface with less dependency weight and no operator regression? The current architecture works and the live database is still small, so neither is urgent. Both compound.