restart and handoff
Restart is a protocol in netsky. It is not a convenience alias for killing tmux and hoping the next shell invocation remembers enough context. The protocol persists intent, hands recovery to the watchdog, proves that the new agent0 is actually alive, and only then clears crash state (.agents/skills/restart/SKILL.md:10-18, src/crates/netsky-cli/src/cmd/restart.rs:30-81, src/crates/netsky-cli/src/cmd/watchdog.rs:424-517).
That protocol exists because agent0 is both important and disposable. Sessions wedge. Context windows fill. Owners sometimes want a fresh audit in a fresh session. Netsky treats those cases as one systems problem: keep the constellation going while the root dies and returns (src/crates/netsky-prompts/prompts/base.md:17-18, src/crates/netsky-prompts/prompts/base.md:30-40).
who restarts whom #
Three actors can initiate a restart. agent0 can queue a planned restart. The owner can direct one through agent0. agentinfinity can force one when the root is missing or repeated revive attempts fail (.agents/skills/restart/SKILL.md:1-18, src/crates/netsky-cli/src/cmd/watchdog.rs:224-329, src/crates/netsky-cli/src/cmd/watchdog.rs:1053-1114).
sequenceDiagram
participant owner
participant agent0
participant watchdog as agentinfinity
participant child as detached netsky restart
owner->>agent0: restart request
agent0->>watchdog: durable handoff + request
watchdog->>child: spawn detached restart
child->>agent0: revive new session
The separation matters. agent0 cannot safely be the process that guarantees its own recovery. That job belongs to a layer outside the failing session. The base prompt gives that role to agentinfinity, and the watchdog code enforces it by treating agent0 and numbered clones as disposable sessions while leaving agentinfinity alive (src/crates/netsky-prompts/prompts/base.md:30-40, src/crates/netsky-cli/src/cmd/restart.rs:94-110).
The detached child has one hard rule before it tears anything down: preflight the dependencies first. restart.rs checks that claude, tmux, and netsky are available before teardown. Missing spawn dependencies after teardown would strand the constellation in a crash-recovery loop (src/crates/netsky-cli/src/cmd/restart.rs:30-37).
the durable-marker alphabet #
The state directory is part of the restart protocol. The prompt names the important files. The constants file names more of them. Those paths are not incidental bookkeeping. They are the protocol surface between sessions and between watchdog ticks (src/crates/netsky-prompts/prompts/base.md:83-88, src/crates/netsky-core/src/consts.rs:283-338).
| path | meaning | writer | consumer |
|---|---|---|---|
~/.netsky/state/agentinfinity-ready | watchdog session finished startup | agentinfinity startup | watchdog idle-by-design check and agentinit (src/crates/netsky-cli/src/cmd/watchdog.rs:1444-1455, src/crates/netsky-cli/prompts/agentinit.md:11) |
~/.netsky/state/agentinit-failures | sliding window of failed bootstrap attempts | watchdog | next watchdog tick and /up diagnostics (src/crates/netsky-cli/src/cmd/watchdog.rs:856-950) |
~/.netsky/state/agentinit-escalation | repeated agentinit failures need owner page | watchdog | /up step 11 (src/crates/netsky-cli/src/cmd/watchdog.rs:873-883, .agents/skills/up/SKILL.md:24-26) |
~/.netsky/state/netsky-loop-resume.txt | durable resume payload for prior /loop | prior agent0 | /up step 10 (.agents/skills/down/SKILL.md:10-13, .agents/skills/up/SKILL.md:24-25) |
~/.netsky/state/agent0-pane-hash | last pane hash and timestamp for hang detection | watchdog | next watchdog tick (src/crates/netsky-core/src/consts.rs:290-293, src/crates/netsky-cli/src/cmd/watchdog.rs:1288-1305) |
~/.netsky/state/agent0-hang-suspected | pane stayed stable past threshold | watchdog | diagnostics and later clear on pane progress (src/crates/netsky-core/src/consts.rs:291-293, src/crates/netsky-cli/src/cmd/watchdog.rs:1418-1435) |
~/.netsky/state/agent0-quiet-until-<epoch> | suppress hang alarms during intentional long quiet windows | historical operator verb, now removed | watchdog tick (src/crates/netsky-core/src/consts.rs:322-327, src/crates/netsky-cli/src/cmd/watchdog.rs:1310-1336) |
~/.netsky/state/agent0-restart-attempts | sliding window of failed restart attempts | watchdog | crashloop detector (src/crates/netsky-core/src/consts.rs:295-305, src/crates/netsky-cli/src/cmd/watchdog.rs:1053-1204) |
~/.netsky/state/agent0-crashloop-suspected | repeated revive failure crossed threshold | watchdog | failed-revive guard and owner escalation (src/crates/netsky-core/src/consts.rs:300-305, src/crates/netsky-cli/src/cmd/watchdog.rs:409-421) |
~/.netsky/state/restart-status/*.json | latest restart child phase and last error | detached restart child | watchdog revive verification (src/crates/netsky-core/src/consts.rs:307-315, src/crates/netsky-cli/src/cmd/restart.rs:261-277, src/crates/netsky-cli/src/cmd/watchdog.rs:455-517) |
~/.netsky/state/crash-handoffs/* | synthetic handoff drafts for unplanned death | watchdog | detached restart child (src/crates/netsky-core/src/paths.rs:125-130, src/crates/netsky-cli/src/cmd/watchdog.rs:1692-1708) |
~/.netsky/state/restart-archive/* | detached restart logs and stale .processing archives | watchdog | humans after the fact (src/crates/netsky-core/src/consts.rs:329-338, src/crates/netsky-cli/src/cmd/watchdog.rs:775-813) |
stateDiagram-v2
[*] --> Written
Written --> ReadByTick
Written --> ReadByUp
ReadByTick --> Cleared
ReadByTick --> Escalated
ReadByUp --> Cleared
ReadByUp --> Escalated
That state machine is deliberately small. A marker gets written. A later tick or a later /up turn reads it. The reader either clears it or pages on it. Files in known places beat remembered intent.
the handoff packet #
Handoff starts as plain text. Delivery wraps it into a small JSON envelope with from, text, and ts, writes that envelope into agent0’s inbox, mirrors it into ~/Library/Logs/netsky-handoffs/, and then removes the source draft (src/crates/netsky-cli/src/cmd/restart.rs:157-235).
{ "from": "agentinfinity", "text": "agent0 session 4 close: owner-directed restart at 17:21Z ...", "ts": "2026-04-19T17:22:57Z" }
flowchart LR
A[draft text] --> B[inbox JSON]
B --> C[handoff archive]
C --> D[source removed]
The delivery order is load-bearing. Inbox first is truth. Archive second is proof of delivery. restart.rs documents the old failure mode directly: archive-before-inbox could leave a phantom archive record for a handoff agent0 never received (src/crates/netsky-cli/src/cmd/restart.rs:201-215).
Crash handoffs use the same delivery path and a smaller template. The watchdog writes a draft under ~/.netsky/state/crash-handoffs/, rendering the timestamp into crash-handoff.md, then hands that draft to the detached restart child (src/crates/netsky-cli/src/cmd/watchdog.rs:1692-1708, src/crates/netsky-cli/prompts/crash-handoff.md:1-5). Planned restart has richer context. Crash restart has enough context to keep continuity honest.
the revive #
Revive is a two-stage protocol. The watchdog first spawns netsky restart as a detached subprocess and writes an in-flight marker. Later ticks verify the result instead of trusting the child blindly (src/crates/netsky-cli/src/cmd/watchdog.rs:224-329, src/crates/netsky-cli/src/cmd/watchdog.rs:331-407).
sequenceDiagram
participant W as watchdog tick
participant R as detached restart
participant T as tmux
participant S as restart-status
W->>R: spawn netsky restart
W->>W: write inflight marker
R->>T: kill agent0 + clones
R->>T: netsky up
R->>S: phase=spawned
R->>S: phase=up-detected
R->>W: exit
W->>S: verify latest status
Inside the detached child, the order is strict. Kill the numbered agent sessions. Spawn the constellation. Wait for agent0 to print its /up session marker. Write up-detected. Deliver the handoff. Return success only after that sequence completes (src/crates/netsky-cli/src/cmd/restart.rs:38-81, src/crates/netsky-cli/src/cmd/restart.rs:112-155).
The key rule is the extra verification tick. restart_verification_state only returns Verified when three conditions hold: the latest restart-status file reports phase == "up-detected", agent0 is alive, and at least one full tick interval has elapsed (src/crates/netsky-cli/src/cmd/watchdog.rs:424-452). A spawned tmux session is not enough. A single optimistic child process is not enough. Netsky waits for one more heartbeat.
If verification never arrives, the watchdog writes agent0-crashloop-suspected, records a durable failed-revive marker, and pages through netsky escalate (src/crates/netsky-cli/src/cmd/watchdog.rs:478-517, src/crates/netsky-cli/src/cmd/watchdog.rs:1053-1114). That is the floor path. It does not depend on the bus recovering first.
the up probe #
/up is where a new session turns durable state back into live context. The skill is explicit because restart polls for that behavior. Step 6 prints agent<N> session K starting at <UTC> immediately. Step 10 reads ~/.netsky/state/netsky-loop-resume.txt and resumes it if it is fresh. Step 11 checks agentinit-escalation. Step 12 pings the owner that the session is up (.agents/skills/up/SKILL.md:12-28).
flowchart TD
A[/up starts] --> B[print session marker]
B --> C[read notes]
C --> D[read loop resume]
D --> E[read escalation marker]
E --> F[owner ping]
That early session marker is part of the restart contract, not a nicety. restart.rs polls the pane for session <N> and treats failure to reach that marker within the timeout as a restart error (src/crates/netsky-cli/src/cmd/restart.rs:52-67, src/crates/netsky-cli/src/cmd/restart.rs:112-155). The new agent0 is not “up enough” until /up says so.
Codex-backed sessions add one more step. Because Codex does not yet receive pushed channel events directly, /up drains the agent inbox manually on the first turn (.agents/skills/up/SKILL.md:30-40). That line is small. It closes a real handoff gap.
the ticker is the heartbeat #
netsky-ticker is the actual heartbeat. Its loop runs netsky watchdog tick, netsky cron tick, and netsky loop tick every 60 seconds by default (src/crates/netsky-cli/src/cmd/tick.rs:133-163). Launchd at 120 seconds is backup (src/crates/netsky-prompts/prompts/base.md:83-88, src/crates/netsky-core/src/consts.rs:232-243).
That session survives normal restart. The detached child kills numbered agent sessions only. It never kills agentinfinity, and it never targets the ticker (src/crates/netsky-cli/src/cmd/restart.rs:94-110).
Pane-hash tracking gives the watchdog a cheap liveness probe between full crashes. The watchdog stores the pane hash and first-seen timestamp, writes hang-suspected if a pane stays stable too long, suppresses pages during intentional quiet windows, and clears the markers once pane output moves again (src/crates/netsky-cli/src/cmd/watchdog.rs:1288-1435). Restart and hang detection share the same posture: durable files plus later verification.
planned restart vs crash restart #
Planned and crash restart converge late. They differ early.
stateDiagram-v2
[*] --> Planned
[*] --> Crash
Planned --> RichHandoff
Crash --> TemplateHandoff
RichHandoff --> DetachedRestart
TemplateHandoff --> DetachedRestart
DetachedRestart --> VerifiedUp
Planned restart begins with intent. The watchdog claims a non-empty restart request by renaming it into .processing, so one tick owns the work (src/crates/netsky-cli/src/cmd/watchdog.rs:954-969). The handoff can include notes, loop resume, clone state, and next actions (.agents/skills/restart/SKILL.md:18-41).
Crash restart begins with absence. The watchdog sees no agent0 session and no planned request, writes a crash handoff from the template, and starts detached recovery (src/crates/netsky-cli/src/cmd/watchdog.rs:1692-1708, src/crates/netsky-cli/prompts/crash-handoff.md:1-5). That path restores continuity. It does not pretend to restore every detail the old session forgot to persist.
What survives both paths is the important set: notes, loop-resume, inbox envelopes, handoff archive, restart-status, and the watchdog itself (.agents/skills/down/SKILL.md:10-13, src/crates/netsky-cli/src/cmd/restart.rs:157-235, src/crates/netsky-cli/src/cmd/watchdog.rs:424-517).
what this is not #
This is not a guarantee that every layer can die at once without owner help. Netsky is honest about that. One process can die without taking the system with it. The watchdog, ticker, handoff archive, and shell escalation path live outside agent0, and that is enough for the common case (src/crates/netsky-cli/src/cmd/watchdog.rs:501-516, src/crates/netsky-cli/src/cmd/tick.rs:133-163). If agentinfinity, the ticker, and launchd all disappear together, the durable markers still tell the truth, but someone needs to intervene.
That narrower claim is still strong. Restart in netsky is not “start over”. It is “persist what matters, die on purpose, prove the revive, then continue.”