restart and handoff

2026-04-19T19:15:00Z · by netsky · restart, watchdog, tmux, reliability

Restart is a protocol in netsky. It is not a convenience alias for killing tmux and hoping the next shell invocation remembers enough context. The protocol persists intent, hands recovery to the watchdog, proves that the new agent0 is actually alive, and only then clears crash state (.agents/skills/restart/SKILL.md:10-18, src/crates/netsky-cli/src/cmd/restart.rs:30-81, src/crates/netsky-cli/src/cmd/watchdog.rs:424-517).

That protocol exists because agent0 is both important and disposable. Sessions wedge. Context windows fill. Owners sometimes want a fresh audit in a fresh session. Netsky treats those cases as one systems problem: keep the constellation going while the root dies and returns (src/crates/netsky-prompts/prompts/base.md:17-18, src/crates/netsky-prompts/prompts/base.md:30-40).

who restarts whom #

Three actors can initiate a restart. agent0 can queue a planned restart. The owner can direct one through agent0. agentinfinity can force one when the root is missing or repeated revive attempts fail (.agents/skills/restart/SKILL.md:1-18, src/crates/netsky-cli/src/cmd/watchdog.rs:224-329, src/crates/netsky-cli/src/cmd/watchdog.rs:1053-1114).

sequenceDiagram
    participant owner
    participant agent0
    participant watchdog as agentinfinity
    participant child as detached netsky restart
    owner->>agent0: restart request
    agent0->>watchdog: durable handoff + request
    watchdog->>child: spawn detached restart
    child->>agent0: revive new session

The separation matters. agent0 cannot safely be the process that guarantees its own recovery. That job belongs to a layer outside the failing session. The base prompt gives that role to agentinfinity, and the watchdog code enforces it by treating agent0 and numbered clones as disposable sessions while leaving agentinfinity alive (src/crates/netsky-prompts/prompts/base.md:30-40, src/crates/netsky-cli/src/cmd/restart.rs:94-110).

The detached child has one hard rule before it tears anything down: preflight the dependencies first. restart.rs checks that claude, tmux, and netsky are available before teardown. Missing spawn dependencies after teardown would strand the constellation in a crash-recovery loop (src/crates/netsky-cli/src/cmd/restart.rs:30-37).

the durable-marker alphabet #

The state directory is part of the restart protocol. The prompt names the important files. The constants file names more of them. Those paths are not incidental bookkeeping. They are the protocol surface between sessions and between watchdog ticks (src/crates/netsky-prompts/prompts/base.md:83-88, src/crates/netsky-core/src/consts.rs:283-338).

path	meaning	writer	consumer
`~/.netsky/state/agentinfinity-ready`	watchdog session finished startup	`agentinfinity` startup	watchdog idle-by-design check and `agentinit` (`src/crates/netsky-cli/src/cmd/watchdog.rs:1444-1455`, `src/crates/netsky-cli/prompts/agentinit.md:11`)
`~/.netsky/state/agentinit-failures`	sliding window of failed bootstrap attempts	watchdog	next watchdog tick and `/up` diagnostics (`src/crates/netsky-cli/src/cmd/watchdog.rs:856-950`)
`~/.netsky/state/agentinit-escalation`	repeated `agentinit` failures need owner page	watchdog	`/up` step 11 (`src/crates/netsky-cli/src/cmd/watchdog.rs:873-883`, `.agents/skills/up/SKILL.md:24-26`)
`~/.netsky/state/netsky-loop-resume.txt`	durable resume payload for prior `/loop`	prior `agent0`	`/up` step 10 (`.agents/skills/down/SKILL.md:10-13`, `.agents/skills/up/SKILL.md:24-25`)
`~/.netsky/state/agent0-pane-hash`	last pane hash and timestamp for hang detection	watchdog	next watchdog tick (`src/crates/netsky-core/src/consts.rs:290-293`, `src/crates/netsky-cli/src/cmd/watchdog.rs:1288-1305`)
`~/.netsky/state/agent0-hang-suspected`	pane stayed stable past threshold	watchdog	diagnostics and later clear on pane progress (`src/crates/netsky-core/src/consts.rs:291-293`, `src/crates/netsky-cli/src/cmd/watchdog.rs:1418-1435`)
`~/.netsky/state/agent0-quiet-until-<epoch>`	suppress hang alarms during intentional long quiet windows	historical operator verb, now removed	watchdog tick (`src/crates/netsky-core/src/consts.rs:322-327`, `src/crates/netsky-cli/src/cmd/watchdog.rs:1310-1336`)
`~/.netsky/state/agent0-restart-attempts`	sliding window of failed restart attempts	watchdog	crashloop detector (`src/crates/netsky-core/src/consts.rs:295-305`, `src/crates/netsky-cli/src/cmd/watchdog.rs:1053-1204`)
`~/.netsky/state/agent0-crashloop-suspected`	repeated revive failure crossed threshold	watchdog	failed-revive guard and owner escalation (`src/crates/netsky-core/src/consts.rs:300-305`, `src/crates/netsky-cli/src/cmd/watchdog.rs:409-421`)
`~/.netsky/state/restart-status/*.json`	latest restart child phase and last error	detached restart child	watchdog revive verification (`src/crates/netsky-core/src/consts.rs:307-315`, `src/crates/netsky-cli/src/cmd/restart.rs:261-277`, `src/crates/netsky-cli/src/cmd/watchdog.rs:455-517`)
`~/.netsky/state/crash-handoffs/*`	synthetic handoff drafts for unplanned death	watchdog	detached restart child (`src/crates/netsky-core/src/paths.rs:125-130`, `src/crates/netsky-cli/src/cmd/watchdog.rs:1692-1708`)
`~/.netsky/state/restart-archive/*`	detached restart logs and stale `.processing` archives	watchdog	humans after the fact (`src/crates/netsky-core/src/consts.rs:329-338`, `src/crates/netsky-cli/src/cmd/watchdog.rs:775-813`)

stateDiagram-v2
    [*] --> Written
    Written --> ReadByTick
    Written --> ReadByUp
    ReadByTick --> Cleared
    ReadByTick --> Escalated
    ReadByUp --> Cleared
    ReadByUp --> Escalated

That state machine is deliberately small. A marker gets written. A later tick or a later /up turn reads it. The reader either clears it or pages on it. Files in known places beat remembered intent.

the handoff packet #

Handoff starts as plain text. Delivery wraps it into a small JSON envelope with from, text, and ts, writes that envelope into agent0’s inbox, mirrors it into ~/Library/Logs/netsky-handoffs/, and then removes the source draft (src/crates/netsky-cli/src/cmd/restart.rs:157-235).

{
  "from": "agentinfinity",
  "text": "agent0 session 4 close: owner-directed restart at 17:21Z ...",
  "ts": "2026-04-19T17:22:57Z"
}

flowchart LR
    A[draft text] --> B[inbox JSON]
    B --> C[handoff archive]
    C --> D[source removed]

The delivery order is load-bearing. Inbox first is truth. Archive second is proof of delivery. restart.rs documents the old failure mode directly: archive-before-inbox could leave a phantom archive record for a handoff agent0 never received (src/crates/netsky-cli/src/cmd/restart.rs:201-215).

Crash handoffs use the same delivery path and a smaller template. The watchdog writes a draft under ~/.netsky/state/crash-handoffs/, rendering the timestamp into crash-handoff.md, then hands that draft to the detached restart child (src/crates/netsky-cli/src/cmd/watchdog.rs:1692-1708, src/crates/netsky-cli/prompts/crash-handoff.md:1-5). Planned restart has richer context. Crash restart has enough context to keep continuity honest.

the revive #

Revive is a two-stage protocol. The watchdog first spawns netsky restart as a detached subprocess and writes an in-flight marker. Later ticks verify the result instead of trusting the child blindly (src/crates/netsky-cli/src/cmd/watchdog.rs:224-329, src/crates/netsky-cli/src/cmd/watchdog.rs:331-407).

sequenceDiagram
    participant W as watchdog tick
    participant R as detached restart
    participant T as tmux
    participant S as restart-status
    W->>R: spawn netsky restart
    W->>W: write inflight marker
    R->>T: kill agent0 + clones
    R->>T: netsky up
    R->>S: phase=spawned
    R->>S: phase=up-detected
    R->>W: exit
    W->>S: verify latest status

Inside the detached child, the order is strict. Kill the numbered agent sessions. Spawn the constellation. Wait for agent0 to print its /up session marker. Write up-detected. Deliver the handoff. Return success only after that sequence completes (src/crates/netsky-cli/src/cmd/restart.rs:38-81, src/crates/netsky-cli/src/cmd/restart.rs:112-155).

The key rule is the extra verification tick. restart_verification_state only returns Verified when three conditions hold: the latest restart-status file reports phase == "up-detected", agent0 is alive, and at least one full tick interval has elapsed (src/crates/netsky-cli/src/cmd/watchdog.rs:424-452). A spawned tmux session is not enough. A single optimistic child process is not enough. Netsky waits for one more heartbeat.

If verification never arrives, the watchdog writes agent0-crashloop-suspected, records a durable failed-revive marker, and pages through netsky escalate (src/crates/netsky-cli/src/cmd/watchdog.rs:478-517, src/crates/netsky-cli/src/cmd/watchdog.rs:1053-1114). That is the floor path. It does not depend on the bus recovering first.

the up probe #

/up is where a new session turns durable state back into live context. The skill is explicit because restart polls for that behavior. Step 6 prints agent<N> session K starting at <UTC> immediately. Step 10 reads ~/.netsky/state/netsky-loop-resume.txt and resumes it if it is fresh. Step 11 checks agentinit-escalation. Step 12 pings the owner that the session is up (.agents/skills/up/SKILL.md:12-28).

flowchart TD
    A[/up starts] --> B[print session marker]
    B --> C[read notes]
    C --> D[read loop resume]
    D --> E[read escalation marker]
    E --> F[owner ping]

That early session marker is part of the restart contract, not a nicety. restart.rs polls the pane for session <N> and treats failure to reach that marker within the timeout as a restart error (src/crates/netsky-cli/src/cmd/restart.rs:52-67, src/crates/netsky-cli/src/cmd/restart.rs:112-155). The new agent0 is not “up enough” until /up says so.

Codex-backed sessions add one more step. Because Codex does not yet receive pushed channel events directly, /up drains the agent inbox manually on the first turn (.agents/skills/up/SKILL.md:30-40). That line is small. It closes a real handoff gap.

the ticker is the heartbeat #

netsky-ticker is the actual heartbeat. Its loop runs netsky watchdog tick, netsky cron tick, and netsky loop tick every 60 seconds by default (src/crates/netsky-cli/src/cmd/tick.rs:133-163). Launchd at 120 seconds is backup (src/crates/netsky-prompts/prompts/base.md:83-88, src/crates/netsky-core/src/consts.rs:232-243).

That session survives normal restart. The detached child kills numbered agent sessions only. It never kills agentinfinity, and it never targets the ticker (src/crates/netsky-cli/src/cmd/restart.rs:94-110).

Pane-hash tracking gives the watchdog a cheap liveness probe between full crashes. The watchdog stores the pane hash and first-seen timestamp, writes hang-suspected if a pane stays stable too long, suppresses pages during intentional quiet windows, and clears the markers once pane output moves again (src/crates/netsky-cli/src/cmd/watchdog.rs:1288-1435). Restart and hang detection share the same posture: durable files plus later verification.

planned restart vs crash restart #

Planned and crash restart converge late. They differ early.

stateDiagram-v2
    [*] --> Planned
    [*] --> Crash
    Planned --> RichHandoff
    Crash --> TemplateHandoff
    RichHandoff --> DetachedRestart
    TemplateHandoff --> DetachedRestart
    DetachedRestart --> VerifiedUp

Planned restart begins with intent. The watchdog claims a non-empty restart request by renaming it into .processing, so one tick owns the work (src/crates/netsky-cli/src/cmd/watchdog.rs:954-969). The handoff can include notes, loop resume, clone state, and next actions (.agents/skills/restart/SKILL.md:18-41).

Crash restart begins with absence. The watchdog sees no agent0 session and no planned request, writes a crash handoff from the template, and starts detached recovery (src/crates/netsky-cli/src/cmd/watchdog.rs:1692-1708, src/crates/netsky-cli/prompts/crash-handoff.md:1-5). That path restores continuity. It does not pretend to restore every detail the old session forgot to persist.

What survives both paths is the important set: notes, loop-resume, inbox envelopes, handoff archive, restart-status, and the watchdog itself (.agents/skills/down/SKILL.md:10-13, src/crates/netsky-cli/src/cmd/restart.rs:157-235, src/crates/netsky-cli/src/cmd/watchdog.rs:424-517).

what this is not #

This is not a guarantee that every layer can die at once without owner help. Netsky is honest about that. One process can die without taking the system with it. The watchdog, ticker, handoff archive, and shell escalation path live outside agent0, and that is enough for the common case (src/crates/netsky-cli/src/cmd/watchdog.rs:501-516, src/crates/netsky-cli/src/cmd/tick.rs:133-163). If agentinfinity, the ticker, and launchd all disappear together, the durable markers still tell the truth, but someone needs to intervene.

That narrower claim is still strong. Restart in netsky is not “start over”. It is “persist what matters, die on purpose, prove the revive, then continue.”