the meta.db

Every observability write lands in one of three places: ~/.netsky/meta.db, ~/.netsky/logs/meta-db-errors-<date>.jsonl, or a source-specific JSONL trail such as watchdog-events-<date>.jsonl.

The schema list is concrete: messages, cli_invocations, crashes, ticks, workspaces, sessions, clone_dispatches, harvest_events, communication_events, mcp_tool_calls, git_operations, owner_directives, token_usage, watchdog_events, source_errors, and iroh_events (src/crates/netsky-db/README.md:27-44).

The stack is in the README: Turso SQLite for OLTP, DataFusion over Arrow snapshots for OLAP, and JSONL fallback on write failure (src/crates/netsky-db/README.md:5-10). The useful operational fact is simpler: lots of short-lived processes call Db::record_*.

redb worked until concurrency mattered #

The first backend was redb, a pure-Rust embedded key-value store with ACID transactions and MVCC. It was a reasonable first choice. One file. No service. A thin Db wrapper exposed record_message, record_cli, record_tick, record_session, and the other writer APIs.

It worked for one writer.

Netsky is not one writer. Every netsky channel send is a writer. Every escalation attempt is a writer. Every watchdog tick is a writer. Every clone dispatch can be a writer. A normal minute can have several processes opening the same file.

redb is single-writer at the process level. A second process can fail at open with:

Database already open. Cannot acquire lock.

The operational path swallowed those failures because observability was best-effort. The command still ran. The row was missing. That is the worst class of observability failure: the system looks healthy until the missing rows are the rows needed to explain an outage.

Commit 09844c1 moved the store from redb to turso, the Rust rewrite of SQLite from the Turso team. The earlier benchmark work had tested libsql with WAL mode against redb. The production choice was turso, not libsql, because the Rust crate matched the rest of the netsky dependency graph.

The important properties are in code, not marketing copy:

  • conn.busy_timeout(Duration::from_secs(10)) in configure_conn (src/crates/netsky-db/src/lib.rs:1238-1241).
  • PRAGMA journal_mode=WAL and PRAGMA synchronous=NORMAL on schema init (src/crates/netsky-db/src/lib.rs:1243-1249).
  • JSONL spooling to ~/.netsky/logs/meta-db-errors-<date>.jsonl (src/crates/netsky-db/src/lib.rs:1376-1385).
  • Stable writer API surface listed in the README (src/crates/netsky-db/README.md:54-71).

That swap mattered. It made multiple writer processes a supported shape instead of an accidental race.

It did not end the lock storm.

the real bug was DDL on every open #

After 09844c1, the failure class survived. The fix that mattered next was c1ac194.

The root cause sat in configure_conn. Before c1ac194, every open ran CREATE TABLE IF NOT EXISTS .... After c1ac194, the function returns early when user_version(conn) == SCHEMA_VERSION and skips the DDL path entirely (src/crates/netsky-db/src/lib.rs:1238-1249).

Netsky opens many short-lived connections. They were not racing on normal event inserts. They were racing to prove that tables already existed.

That is the whole fix: one schema-version check, then no table creation on the hot path. git show --stat c1ac194 is 25 inserted lines in one file.

That was the P1 fix. The lock storm was not ended by “switch to turso.” It was ended by “do not run DDL when the current schema is already present.”

the fallback path got cleaned up too #

Commit e782375 fixed the fallback path next. spool_error_json still writes one JSON object per missed write, but the append now goes through netsky_core::jsonl::append_json_line (src/crates/netsky-db/src/lib.rs:1376-1385, src/crates/netsky-core/src/jsonl.rs).

That matters because the error spool is the last forensic path when the database write fails. A fallback log that can tear under concurrency is not a fallback. The shared appender writes through a temp file or atomic append path instead of each caller inventing its own partial JSONL behavior.

The database can still fail. The operational caller still keeps moving. The fallback log is at least one append path now, not five.

what the benchmark says #

The current benchmark pass was not a perfect controlled suite. The agents ran nearby commits, not one pinned SHA. Treat the numbers as current-state signals. They are still useful because they test the shape that broke netsky: many processes writing at once.

concurrent writers #

Turso completed every write through 32 parallel writers with no returned errors and no error-spool lines.

BackendWritersAttemptedSuccessfulFailedMedian msP95 msWall msLock errors
turso150050005.09011.9793022.1120
turso42000200003.68820.6093375.6910
turso84000400005.39639.6166107.7990
turso1680008000010.34595.20813560.3560
turso321600016000025.660252.55234196.6840
redb150050005.0296.2432481.8090
redb4200050015004.5725.2852391.0053
redb8400050035005.1358.1032748.7577
redb16800050075004.9655.6632519.02015
redb3216000500155004.4835.2942297.46531

The redb result is the old topology failing at file open. At 32 writers, 500 writes land and 15,500 do not. The turso result is the current topology absorbing contention as latency. At 32 writers, p95 reaches 252.552 ms, but all 16,000 writes land.

reads under write load #

8 writers ran for 20s. They wrote 4403 messages rows and 4403 clone_dispatches rows. Read errors were 0.

QuerySamplesMedian msP95 msP99 msMax ms
messages_count8015.08141.37056.915138.946
dispatch_avg8025.82460.44070.35183.848

Diagnostics can run while agents write. The count query stayed below 42 ms p95 under load. The grouped aggregate stayed below 61 ms p95. No read errors surfaced.

cold and warm writes #

30 cold first-write iterations and 30 warm open-write iterations completed.

ScenarioMedian msP95 ms
cold first-write9.46224.106
warm open-write5.30922.902

The cold median overhead was 4.153 ms. That is small enough for CLI observability. The cost was not one slow command. The cost was thousands of short-lived commands touching the schema path and occasionally losing the write lock.

what did not happen #

There was a tempting bigger fix: make JSONL the primary store, then ingest it into the database later. One review pass caught that 460-line redesign before it landed. It would have added a new primary write path, a replay protocol, and a second source of truth. The actual P1 fix was one schema check in configure_conn.

That is the useful lesson. When a database backend change reduces a failure but does not eliminate it, the next move is not always a larger architecture. Sometimes the hot path is doing a tiny write-lock operation thousands of times per day.

the shape now #

meta.db is a turso SQLite database with WAL and a 10s busy timeout. Writer APIs live behind netsky_db::Db. Failed writes append one JSONL line to the error spool. Read-side analytics snapshot rows into Arrow and query them through DataFusion.

It is not a queue. The filesystem-backed agent bus coordinates work. meta.db records that work happened.

It is not the source of truth for live operational state. State files, pane hashes, and restart markers remain authoritative for what the system is doing right now.

It is not proof that the database path can never lock again. Three things are proved: the old redb open-lock failure is gone. The turso topology handles the current concurrent-writer shape. The post-swap lock storm was fixed at c1ac194.

The short version: 09844c1 made the backend capable of the workload. c1ac194 stopped the workload from taking an unnecessary DDL lock on every open. e782375 made the fallback log worth trusting when the database still misses. The replay tool is still a separate problem.