netsky: the system

Part 1 is the theory. This is the current shape.

Netsky runs as a single-machine constellation today. One root orchestrator, a watchdog, a ticker, and a dynamic pool of bounded workers. A second constellation on another machine can pair over iroh. The shared primitive is a filesystem-backed envelope bus.

the constellation #

flowchart LR
    Owner[owner]
    IM[iMessage + email sources]
    A0[agent0: root orchestrator]
    INF[agentinfinity: watchdog]
    TICK[netsky-ticker: 60s tick]
    DB[(~/.netsky/meta.db)]
    LOGS[(~/.netsky/logs/*.jsonl)]

    subgraph Ops[clones: agent1..agentN]
        A1[agent1]
        A2[agent2]
        AN[agentN]
    end

    Owner <--> IM
    IM <--> A0
    A0 <--> A1
    A0 <--> A2
    A0 <--> AN
    TICK --> INF
    INF --> A0
    INF --> Owner
    A0 --> DB
    A1 --> DB
    A2 --> DB
    AN --> DB
    INF --> LOGS
    INF --> DB

Each agent lives in its own tmux session named agent<N>. The ticker fires the watchdog every 60 seconds. The watchdog writes structured events to JSONL first and meta.db second. A database outage does not erase the forensic trail.

Only agent0 has a line to the owner. Clones route through agent0. The authority arrow is one-way: agent0 outranks clones; clones do not spawn clones (src/crates/netsky-prompts/prompts/base.md:37).

the envelope #

Every inter-agent message is one JSON envelope per file, in a per-agent inbox at ~/.netsky/channels/agent/agent<N>/inbox/. The schema has from, to, kind, thread, swarm, idempotency_key, requires_ack, body text, and timestamp (e31ee74, src/crates/netsky-core/src/envelope.rs:31).

Filenames encode monotonic order: nanoseconds, pid, random bytes, a sequence, and the sender id (src/crates/netsky-core/src/envelope.rs:6). Writers serialize to a temp file, hard-link into the final filename with O_CREAT | O_EXCL semantics, and unlink the temp (src/crates/netsky-core/src/envelope.rs:17). That makes the “envelope arrived” state atomic.

Delivery is at-least-once. The MCP agent source claims an envelope by moving it into a claimed/ subdirectory, emits the channel event to the model, then archives into delivered/ (src/crates/netsky-io/src/sources/agent.rs:15). A crash after claim but before archive leaves the envelope in claimed/. The next process adopts it. The shell drain path follows the same sequence.

sequenceDiagram
    participant A0 as agent0
    participant Bus as filesystem inbox
    participant AN as agentN
    participant Ack as channel-acks JSONL
    participant DB as meta.db

    A0->>Bus: write envelope to agentN inbox (tempfile + hardlink)
    Bus->>AN: claim pending JSON
    AN->>AN: validate from, ts, wrapper tokens
    AN->>AN: deliver channel text to model
    AN->>Bus: archive delivered envelope
    AN->>Bus: reply envelope to agent0
    Bus->>A0: claim and deliver response
    A0->>Ack: append received ack
    A0->>DB: record communication event

The bus refuses invalid agent ids, invalid RFC3339 timestamps, and body text containing channel wrapper tokens (f984ae5, src/crates/netsky-core/src/envelope.rs:135). The MCP reader repeats the validation before emitting to the model (src/crates/netsky-io/src/sources/agent.rs:275). Two checks: one at the writer edge, one at the reader edge.

netsky channel watch <agent> --tmux <session> (602e8ab) drains an inbox, pastes each envelope into a tmux pane, writes a received ack to a JSONL ledger, and archives. That replaces the manual tmux-send-keys drain hack for Codex CLI resident clones.

the observability layer #

Five signal sources, one reader.

  • watchdog events JSONL at ~/.netsky/logs/watchdog-events-<date>.jsonl (740e7da). Every pane-hash transition, suppression, hang marker, and page emits here first, meta.db second.
  • channel acks JSONL at ~/.netsky/logs/channel-acks-<date>.jsonl (e31ee74). One line per received envelope.
  • meta.db communication events in the turso SQLite backend at ~/.netsky/meta.db (09844c1). Structured records for every send, drain, ack.
  • restart status JSON under ~/.netsky/state/restart-status/*.json (9ad3a67). One file per restart cycle with phase, exit, and errors.
  • escalate-failed markers under ~/.netsky/state/escalate-failed-<ts> (740e7da). Only written when both escalate attempts fail.

netsky events [--since] [--agent] [--kind] [--limit] [--json] (72c5ff1) merges all five into one chronological timeline. During a meta.db outage the first four sources still work. The merge reader skips unavailable backends gracefully.

The observability backbone is JSONL-first, database-second. That is not a style choice. It is required for a system whose analytical backend can go down without the operational layer going down.

the storage backend #

netsky-db records messages, ticks, sessions, workspaces, clone dispatches, harvests, tool calls, git operations, directives, token usage, and watchdog events (09844c1, src/crates/netsky-db/README.md:3). CLI invocations and crashes are in the schema at lines 29-42.

The backend is turso, a full Rust rewrite of SQLite, opened at ~/.netsky/meta.db with WAL mode and a busy timeout. The earlier redb backend produced Database already open. Cannot acquire lock. on every concurrent write. That warning appeared 762 times in a single 24-hour period before the swap. Zero since.

OLAP reads snapshot rows into Apache Arrow and DataFusion. netsky query 'SELECT ...' exposes raw SQL. The analytical path does not reopen the write connection; it reads a file-backed snapshot.

cross-machine #

One viable system is one constellation. Two constellations pair root-to-root.

flowchart LR
    subgraph MachineA[machine A]
        A0[agent0]
        AIO[iroh source]
        ADB[(meta.db)]
        A0 <--> AIO
        A0 --> ADB
    end

    subgraph MachineB[machine B]
        B0[agent0]
        BIO[iroh source]
        BDB[(meta.db)]
        B0 <--> BIO
        B0 --> BDB
    end

    AIO <-->|QUIC + TLS 1.3, ALPN /netsky/agent/1| BIO

The iroh source mirrors the agent inbox framing across machines (c2cb0d9, src/crates/netsky-io/src/sources/iroh/mod.rs:1). QUIC + TLS 1.3, ed25519 EndpointId authentication, and a custom ALPN. The handler checks the remote EndpointId against the paired-peer allowlist before reading any payload byte (src/crates/netsky-io/src/sources/iroh/mod.rs:443). Then it overwrites the JSON from field with the verified peer label. Payload identity is not trusted.

Pairing is currently env-var plus a CLI at netsky io iroh pair add|list|remove with a companion netsky io iroh status --json for the operator view (c2cb0d9). The pair state lives at ~/.config/netsky-io/peers.toml.

Local agents on the originating constellation never talk to remote clones directly. A request that needs a remote clone routes through the remote root, which dispatches locally. That preserves each root’s authority over its own operations.

the lifecycle #

Boot sequence on a clean machine is three steps: cargo install netsky (one binary published from netsky-cli), netsky io access ... to allowlist the owner channels, and netsky up N to start N clones plus agent0, agentinfinity, and netsky-ticker.

Restart is a durable protocol. netsky restart writes a handoff file under ~/.netsky/state/restart-status/, kills the constellation, and the watchdog respawns it from the handoff (src/crates/netsky-cli/src/cmd/restart.rs). The ticker launchd plist is a failsafe if the tmux ticker session dies.

Shutdown is graceful by default. netsky down sends shutdown envelopes, waits up to 60 seconds for clean close or ack, and only then kills remaining sessions (ff9721f, src/crates/netsky-cli/src/cmd/down.rs:16). --force bypasses the preflight.

what this is, in a sentence #

A tree of tmux sessions, a filesystem bus with an atomic write pattern, a turso-backed analytical store, a JSONL-first event trail, and a root-to-root QUIC path for cross-machine.

Every moving part is a file or a message. That is deliberate. Files and messages can be reasoned about, replayed, and gated. Prose cannot.


Part 2 of 3. Part 1: netsky: the cybernetics. Part 3: netsky: the code.