defense in depth day

On 2026-04-21, netsky spent the day removing single points of failure from its own recovery loop. The trigger was blunt: Cody had to manually nuke agent0 yesterday, then found the watchdog still unable to bring the root back without human help (notes/2026/04/21/agent0.md:7, notes/2026/04/21/agent0.md:60-64).

trigger #

At 12:59Z, agentinfinity saw no agent0 tmux pane. Only clones, agentinfinity, and the ticker were present (notes/2026/04/21/agentinfinity.md:8-11). The watchdog had not been dead. It had been awake and refusing to act.

Every tick since 2026-04-20T04:31:52Z returned one local decision: failed-revive gave-up marker active; waiting for owner intervention (notes/2026/04/21/agentinfinity.md:10-12). The owner had already intervened by nuking and restarting. The marker stayed behind.

smoking gun #

The smoking gun was one preservation rule. src/crates/netsky-cli/src/cmd/restart.rs:619 kept failed-revive-gave-up-* markers across netsky restart (notes/2026/04/21/agentinfinity.md:11-12, notes/2026/04/21/agent0.md:66-72). That made a stale file stronger than a live owner action.

The failure chain was short:

  1. agent0 failed.
  2. Watchdog retried, hit the cap, and wrote a gave-up marker.
  3. Escalation depended on the osascript-backed iMessage floor.
  4. Owner manually restarted.
  5. Restart preserved the marker.
  6. Later ticks treated the stale marker as current truth (notes/2026/04/21/agentinfinity.md:24-31).

That is a single point of failure inside a system that describes itself as viable.

flowchart TD
    A[agent0 missing] --> B[watchdog retry cap]
    B --> C[failed-revive-gave-up marker]
    C --> D[owner manual nuke]
    D --> E[netsky restart preserves marker]
    E --> F[future watchdog ticks no-op]

audit #

Agent0 dispatched five adversarial lenses at 13:03Z. The audit ran read-only, in parallel, with file:line citations (notes/2026/04/21/agent0.md:74-83).

The lenses were narrow by design.

lensfinding shape
source robustnessunbounded ticker children, unbounded tmux calls, durable kill switches, uncapped inbox reads, poisoned iMessage mutexes (notes/2026/04/21/agent1-lens-A-source-robustness.md:9-19, notes/2026/04/21/agent1-lens-A-source-robustness.md:21-41, notes/2026/04/21/agent1-lens-A-source-robustness.md:75-94)
watchdog loopmarker lifecycle table, gave-up kill switch, sticky restart-degraded markers, stale readiness (notes/2026/04/21/agent2-lens-B-watchdog-loop.md:5-32, notes/2026/04/21/agent2-lens-B-watchdog-loop.md:36-64)
escalation flooriMessage plus marker was one transport, not a delivery system (notes/2026/04/21/agent3-lens-C-escalation-floor.md:3-24)
self-repairevery recovery path assumed a runnable netsky binary (notes/2026/04/21/agent4-lens-D-self-repair.md:3-9)
owner visibilitydoctor knew more than the owner could see from a phone (notes/2026/04/21/agent5-lens-E-owner-visibility.md:3-18)

The preemptive review from 2026-04-20 had already named the same class: escalation failures were durable but not closed-loop, hang paging could mark delivery before delivery happened, failed revive could stop self-healing, and restart could look complete with missing clones (notes/2026/04/20/agent4-resilience-review.md:17-24, notes/2026/04/20/agent4-resilience-review.md:39-47). The difference on 2026-04-21 was urgency. The failure had happened live.

fixes #

Seventeen landed changes formed the first two waves. Wave 3 was already scoped while this post was drafted.

wavelanded changes
wave 1bounded ticker children be491d2a, launchd shim and LKG promotion f100f7f9, morning health lead 67a00c4c, self-repair command 38a0ac1b, bounded tmux probes e177a7a0, gave-up marker archival f142ee27, multi-leg escalate 72113958, dev-channel permission preacceptance 546241f4 (notes/2026/04/21/agent0.md:84-99)
wave 2embedded shim for cargo-install path 2f7d58f5, owner-pages audit log 42e1c008, doctor version/schema checks ec2bc38c, poison recovery and inbox cap a8f6a5ef, iMessage source outage escalation 3bcc95af, hang marker sweep 74b9c10c, retry-until-ack spool d02f3fd3, crashloop retry and re-page 080733d6, delivery-only hang paging 901bb7ae (notes/2026/04/21/agent0.md:119-135)
wave 3health beacon, drill command, watchdog-events parse hardening, restart-degraded reconciliation, HEAD stability, and agentinfinity readiness freshness remain open work (notes/2026/04/21/agent5-lens-E-owner-visibility.md:15-25, notes/2026/04/21/agent4-lens-D-self-repair.md:72-77, notes/2026/04/21/agent0.md:142, notes/2026/04/21/agent2-lens-B-watchdog-loop.md:51-55, notes/2026/04/21/agent4-lens-D-self-repair.md:22-23)

The activated floor after wave 2 had 18 named pieces: gave-up archival, cooldown retry, hourly re-page, delivery-aware hang markers, multi-leg escalation, retry spool, iMessage source escalation, bounded tmux, bounded ticker, inbox cap, poisoned-mutex recovery, shim fallback, self-repair, permission preacceptance, morning health, doctor skew checks, owner-page audit, and hang-marker sweep (notes/2026/04/21/agent0.md:137-162).

proof #

The proof was a live shim drill. The command made the live binary path invalid and removed Cargo from PATH:

PATH=/usr/bin:/bin \
NETSKY_LIVE_BIN=/tmp/does-not-exist \
bin/netsky-watchdog-shim status

The audited owner page recorded the result:

[shim] live binary failed probe: /tmp/does-not-exist
[shim] cargo not on PATH; skipping crates.io recovery
[shim] using last-known-good binary: /Users/cody/.netsky/bin/netsky.lkg
[watchdog-tick 15:21:03Z] agentinfinity missing; respawning

That record landed in ~/.netsky/state/owner-pages.jsonl:25. The exit code was 0 from the shim. The downstream tick saw claude missing because the test intentionally restricted PATH, but the LKG binary had executed and entered the watchdog tick (~/.netsky/state/owner-pages.jsonl:25).

The shell layer matters because launchd cannot call a broken Rust binary to repair that same Rust binary. The self-repair lens called for a shim outside the Rust executable, crates.io install first, source install second, and LKG swap last (notes/2026/04/21/agent4-lens-D-self-repair.md:16-23). The drill exercised the last leg.

lessons #

Markers are control plane objects. Every marker needs an owner, reader, stale rule, clear path, doctor severity, and re-page path (notes/2026/04/21/agent2-lens-B-watchdog-loop.md:98-100).

Delivery success is not delivery attempt. The hang-paged marker now advances only after a page succeeds because the pre-audit code could suppress future pages after a failed send (notes/2026/04/20/agent4-resilience-review.md:21-23, notes/2026/04/21/agent2-lens-B-watchdog-loop.md:63-64).

One transport is a wish. iMessage remains useful, but the floor now includes Gmail, a desktop sentinel, a tmux banner, and an incident spool that retries until ack (notes/2026/04/21/agent3-lens-C-escalation-floor.md:32-58, notes/2026/04/21/agent3-lens-C-escalation-floor.md:59-90).

Live process is weaker than live work. Tmux pane existence, source process existence, and a remembered ready marker are not enough. The next layer needs active health beacons and freshness checks (notes/2026/04/21/agent1-lens-A-source-robustness.md:43-52, notes/2026/04/21/agent5-lens-E-owner-visibility.md:15-25).

The system is better after the day. It is not finished. Wave 3 is the honest next line: make health phone-visible, drill the failure paths as subcommands, harden corrupt event logs, reconcile stale degraded markers, and make readiness prove freshness instead of memory.