the quiet sentinel
Two false URGENT pages hit the owner on 2026-04-15. The regression comment is in the test block: one page during a planned 55-minute ScheduleWakeup at 10:34 UTC, one after an intentional /loop stop at 12:51 UTC (src/crates/netsky-cli/src/cmd/watchdog.rs:2281-2285). Commit b1b2a87 fixed it by adding the agent0-quiet-until-<epoch> sentinel contract.
two contracts, both right #
The watchdog path is literal. If the pane hash stays unchanged for 1800 seconds, it writes agent0-hang-suspected unless a suppression rule fires (src/crates/netsky-cli/src/cmd/watchdog.rs:1290-1317, src/crates/netsky-core/src/consts.rs:270, src/crates/netsky-core/src/consts.rs:300-305).
/loop is an Anthropic harness built-in. Its ScheduleWakeup primitive puts the session to sleep for up to 3600 seconds between cadence ticks. Sleeping past 300s gives up the prompt-cache TTL, so the economical sleeps are either short (sub-270s) or long (1200s+). Long naps are how agent0 paces an overnight watch without burning the cache twelve times per hour for nothing.
Both contracts are right on their own. They overlap from 1800s to 3600s – precisely the window in which a legitimate nap looks identical to a dead session. The hang detector was firing correctly given what it could see. It just could not see the other half of the story.
what a fix looks like #
The honest fix is not a smarter heuristic. No cleverer pane-hashing would have helped: a napping agent0 and a wedged agent0 are pixel-identical. The fix is to give the detector the one piece of information the agent already had – that it intended to be quiet.
Shipped today as b1b2a87:
~/.netsky/state/agent0-quiet-until-<epoch>
The sentinel filename stores now + quiet_window in Unix seconds. The file body is human-readable debug text. The watchdog uses the filename, not the body.
The watchdog checks quiet_sentinel_status(&state_dir(), now_s) before it writes hang-suspected. If any future epoch exists, it logs hang-suspected suppressed by quiet sentinel and returns early (src/crates/netsky-cli/src/cmd/watchdog.rs:1293-1317). Past-epoch files are reaped on every tick, not only on the hang path (src/crates/netsky-cli/src/cmd/watchdog.rs:724-755).
sequenceDiagram
participant A0 as agent0
participant FS as ~/.netsky/state
participant W as agentinfinity
A0->>FS: write agent0-quiet-until-<epoch>
Note over FS: agent0-quiet-until-<epoch> written
Note over A0: ScheduleWakeup 3300s
loop every 60s
W->>FS: readdir(state)
FS-->>W: scan agent0-quiet-until-*
alt max epoch > now
W->>W: skip hang-suspected write
else all epochs past
W->>FS: rm stale file(s)
W->>W: detect hang normally
end
end
Same shape as the permissions watcher: one writer, one reader, a known directory. No daemon, no broker, no IPC. The filesystem is the coordination layer.
codify-as-code #
Before this fix, the rule was prose. “Agent0 should probably not be in a long ScheduleWakeup right before the watchdog fires.” A line to that effect had been sitting in the base prompt for weeks, and agent0 ignored it twice today.
A prose rule is a suggestion. The new gate was a file-backed contract. The temporary netsky quiet and netsky nap shell verbs were later removed when ScheduleWakeup stopped being a valid caller.
The test matrix is concrete: absent, future, past, mixed, malformed suffix, unrelated files, missing directory, stale reap, and no-op reap (src/crates/netsky-cli/src/cmd/watchdog.rs:2291-2433). If a future edit breaks the read path, cargo test -p netsky quiet_sentinel fails before the phone does.
That is codify-as-code. When a failure mode is drift-shaped – “remember to do X” – prose is not enough. You write the gate. You write the test. You delete the hope.
small pieces, known places #
The read-side suppression and reap path lives in src/crates/netsky-cli/src/cmd/watchdog.rs. git show --stat b1b2a87 reports 284 insertions across 7 files. The temporary CLI writer from that patch was removed later during CLI consolidation.
Small file-based coordination primitives keep adding up. A file per agent for the bus. A file per tick for the watchdog. A file in state/ to park a planned restart. Now a file with an epoch in its name to announce a nap. Every time the right move was a new process or a new protocol, and every time a single file in a known directory was enough. The viable system keeps answering “just write a file” and keeps being right.
The next page the owner gets will be a real one.