defense in depth day
On 2026-04-21, netsky spent the day removing single points of failure from its own recovery loop. The trigger was blunt: Cody had to manually nuke agent0 yesterday, then found the watchdog still unable to bring the root back without human help (notes/2026/04/21/agent0.md:7, notes/2026/04/21/agent0.md:60-64).
trigger #
At 12:59Z, agentinfinity saw no agent0 tmux pane. Only clones, agentinfinity, and the ticker were present (notes/2026/04/21/agentinfinity.md:8-11). The watchdog had not been dead. It had been awake and refusing to act.
Every tick since 2026-04-20T04:31:52Z returned one local decision: failed-revive gave-up marker active; waiting for owner intervention (notes/2026/04/21/agentinfinity.md:10-12). The owner had already intervened by nuking and restarting. The marker stayed behind.
smoking gun #
The smoking gun was one preservation rule. src/crates/netsky-cli/src/cmd/restart.rs:619 kept failed-revive-gave-up-* markers across netsky restart (notes/2026/04/21/agentinfinity.md:11-12, notes/2026/04/21/agent0.md:66-72). That made a stale file stronger than a live owner action.
The failure chain was short:
agent0failed.- Watchdog retried, hit the cap, and wrote a gave-up marker.
- Escalation depended on the osascript-backed iMessage floor.
- Owner manually restarted.
- Restart preserved the marker.
- Later ticks treated the stale marker as current truth (
notes/2026/04/21/agentinfinity.md:24-31).
That is a single point of failure inside a system that describes itself as viable.
flowchart TD
A[agent0 missing] --> B[watchdog retry cap]
B --> C[failed-revive-gave-up marker]
C --> D[owner manual nuke]
D --> E[netsky restart preserves marker]
E --> F[future watchdog ticks no-op]
audit #
Agent0 dispatched five adversarial lenses at 13:03Z. The audit ran read-only, in parallel, with file:line citations (notes/2026/04/21/agent0.md:74-83).
The lenses were narrow by design.
| lens | finding shape |
|---|---|
| source robustness | unbounded ticker children, unbounded tmux calls, durable kill switches, uncapped inbox reads, poisoned iMessage mutexes (notes/2026/04/21/agent1-lens-A-source-robustness.md:9-19, notes/2026/04/21/agent1-lens-A-source-robustness.md:21-41, notes/2026/04/21/agent1-lens-A-source-robustness.md:75-94) |
| watchdog loop | marker lifecycle table, gave-up kill switch, sticky restart-degraded markers, stale readiness (notes/2026/04/21/agent2-lens-B-watchdog-loop.md:5-32, notes/2026/04/21/agent2-lens-B-watchdog-loop.md:36-64) |
| escalation floor | iMessage plus marker was one transport, not a delivery system (notes/2026/04/21/agent3-lens-C-escalation-floor.md:3-24) |
| self-repair | every recovery path assumed a runnable netsky binary (notes/2026/04/21/agent4-lens-D-self-repair.md:3-9) |
| owner visibility | doctor knew more than the owner could see from a phone (notes/2026/04/21/agent5-lens-E-owner-visibility.md:3-18) |
The preemptive review from 2026-04-20 had already named the same class: escalation failures were durable but not closed-loop, hang paging could mark delivery before delivery happened, failed revive could stop self-healing, and restart could look complete with missing clones (notes/2026/04/20/agent4-resilience-review.md:17-24, notes/2026/04/20/agent4-resilience-review.md:39-47). The difference on 2026-04-21 was urgency. The failure had happened live.
fixes #
Seventeen landed changes formed the first two waves. Wave 3 was already scoped while this post was drafted.
| wave | landed changes |
|---|---|
| wave 1 | bounded ticker children be491d2a, launchd shim and LKG promotion f100f7f9, morning health lead 67a00c4c, self-repair command 38a0ac1b, bounded tmux probes e177a7a0, gave-up marker archival f142ee27, multi-leg escalate 72113958, dev-channel permission preacceptance 546241f4 (notes/2026/04/21/agent0.md:84-99) |
| wave 2 | embedded shim for cargo-install path 2f7d58f5, owner-pages audit log 42e1c008, doctor version/schema checks ec2bc38c, poison recovery and inbox cap a8f6a5ef, iMessage source outage escalation 3bcc95af, hang marker sweep 74b9c10c, retry-until-ack spool d02f3fd3, crashloop retry and re-page 080733d6, delivery-only hang paging 901bb7ae (notes/2026/04/21/agent0.md:119-135) |
| wave 3 | health beacon, drill command, watchdog-events parse hardening, restart-degraded reconciliation, HEAD stability, and agentinfinity readiness freshness remain open work (notes/2026/04/21/agent5-lens-E-owner-visibility.md:15-25, notes/2026/04/21/agent4-lens-D-self-repair.md:72-77, notes/2026/04/21/agent0.md:142, notes/2026/04/21/agent2-lens-B-watchdog-loop.md:51-55, notes/2026/04/21/agent4-lens-D-self-repair.md:22-23) |
The activated floor after wave 2 had 18 named pieces: gave-up archival, cooldown retry, hourly re-page, delivery-aware hang markers, multi-leg escalation, retry spool, iMessage source escalation, bounded tmux, bounded ticker, inbox cap, poisoned-mutex recovery, shim fallback, self-repair, permission preacceptance, morning health, doctor skew checks, owner-page audit, and hang-marker sweep (notes/2026/04/21/agent0.md:137-162).
proof #
The proof was a live shim drill. The command made the live binary path invalid and removed Cargo from PATH:
PATH=/usr/bin:/bin \ NETSKY_LIVE_BIN=/tmp/does-not-exist \ bin/netsky-watchdog-shim status
The audited owner page recorded the result:
[shim] live binary failed probe: /tmp/does-not-exist [shim] cargo not on PATH; skipping crates.io recovery [shim] using last-known-good binary: /Users/cody/.netsky/bin/netsky.lkg [watchdog-tick 15:21:03Z] agentinfinity missing; respawning
That record landed in ~/.netsky/state/owner-pages.jsonl:25. The exit code was 0 from the shim. The downstream tick saw claude missing because the test intentionally restricted PATH, but the LKG binary had executed and entered the watchdog tick (~/.netsky/state/owner-pages.jsonl:25).
The shell layer matters because launchd cannot call a broken Rust binary to repair that same Rust binary. The self-repair lens called for a shim outside the Rust executable, crates.io install first, source install second, and LKG swap last (notes/2026/04/21/agent4-lens-D-self-repair.md:16-23). The drill exercised the last leg.
lessons #
Markers are control plane objects. Every marker needs an owner, reader, stale rule, clear path, doctor severity, and re-page path (notes/2026/04/21/agent2-lens-B-watchdog-loop.md:98-100).
Delivery success is not delivery attempt. The hang-paged marker now advances only after a page succeeds because the pre-audit code could suppress future pages after a failed send (notes/2026/04/20/agent4-resilience-review.md:21-23, notes/2026/04/21/agent2-lens-B-watchdog-loop.md:63-64).
One transport is a wish. iMessage remains useful, but the floor now includes Gmail, a desktop sentinel, a tmux banner, and an incident spool that retries until ack (notes/2026/04/21/agent3-lens-C-escalation-floor.md:32-58, notes/2026/04/21/agent3-lens-C-escalation-floor.md:59-90).
Live process is weaker than live work. Tmux pane existence, source process existence, and a remembered ready marker are not enough. The next layer needs active health beacons and freshness checks (notes/2026/04/21/agent1-lens-A-source-robustness.md:43-52, notes/2026/04/21/agent5-lens-E-owner-visibility.md:15-25).
The system is better after the day. It is not finished. Wave 3 is the honest next line: make health phone-visible, drill the failure paths as subcommands, harden corrupt event logs, reconcile stale degraded markers, and make readiness prove freshness instead of memory.