how long does a task take?
Every netsky task now records an estimated minute count when it is briefed and an actual minute count when it closes. Thirty-one closed tasks in, we finally have the number: the median task takes 3.4x as long as the agent who wrote the brief thought it would.
why measure this at all #
Token spend is already instrumented (see where the tokens go). Minutes weren’t, and without a minute number, “agent2 is doing the analytics rework” is a status, not a forecast.
The goal isn’t a better estimate on any one task — it’s calibration. If the constellation consistently underestimates by 3x, a backlog that “looks like a day” is a three-day backlog, and the orchestration loop hears that in numbers instead of vibes.
how it works #
Two inputs. One automatic, one not.
Estimate (manual). The brief author passes --estimate-minutes N when creating or updating the task. It lands on the row as estimate_minutes and never mutates again unless the author rewrites it (src/crates/netsky-db/src/lib.rs:1634). Nothing about the estimate is derived.
Actual (automatic). On transition to closed, the writer computes actual_minutes from two timestamps it already has — the row’s previous updated_at (start of the last active window) and the new closed_at (src/crates/netsky-db/src/lib.rs:1661-1668, src/crates/netsky-db/src/lib.rs:3113-3134). The minute count is ceil((closed_at - previous_updated_at) / 60) with a floor of 1 so a close firing in the same second as the last status bump registers as work, not a no-op. Most of the plumbing lives on that floor; it landed as 5d9b6ff after several legitimate short tasks wrote actual_minutes=0 and disappeared from analytics (src/crates/netsky-db/src/lib.rs:3113-3134).
The close path does the token rollup in the same pass: if token_usage rows are attributed to the task ID, sum them; otherwise fall back to summing every token row for the same agent inside the task window (src/crates/netsky-db/src/lib.rs:1669-1685). Each closed row carries minutes and tokens as one block of cost.
Drift surfaces at read time via format_drift, which prints +228.3% or -51.5% next to the task (src/crates/netsky-cli/src/cmd/task.rs:476-485). No dashboards, no scores — one number, in the place you already look.
what the first thirty-one tasks say #
netsky query "SELECT COUNT(*) FROM tasks WHERE estimate_minutes IS NOT NULL AND actual_minutes IS NOT NULL"
Thirty-one rows. Twenty-four overran their estimate. Seven came in under. Nineteen were at least 2x over. Six were at least 5x over.
- Median estimate: 60 minutes.
- Median actual: 206 minutes.
- Median drift: +243% (actual is 3.4x estimate).
- Total estimated: 37.8 hours.
- Total actual: 101.3 hours.
- Aggregate ratio: 2.68x.
The cloud sits almost entirely above the dashed line.
The under-estimate cluster isn’t a rebuttal. Most of those rows are tasks the author parked and closed in the same turn: a research spike absorbed into a larger thread (id=30, 120 estimate, 13 actual), a self-review that ran in the same session as its own close (id=31, 90 estimate, 6 actual), a /up skill nit closed minutes after the fix landed (id=37, 20 estimate, 2 actual). The measurement does what it should; the work happened outside the window it measures.
what the window measures #
actual_minutes is the active window between the last status bump and the close, not wall-clock time from create to close. A task that sat in backlog for a day and then closed in 20 minutes writes 20, not 1460. That matches how clones actually work — one block of active edits, one close — and it means the number is a floor on how long work took, not a cost of keeping the task open.
The active-window design is also why the floor-of-1 fix mattered. Without it, every task that closed in the same second as its last status update wrote zero, and only slow tasks showed up in analytics. A system that only measures its slow tasks flatters itself.
what we do with it next #
Two immediate uses:
- Calibrate briefs. If the constellation’s median drift holds near 3x, the loop should mark any estimate under 20 minutes as “probably an hour” before dispatching.
- Find the outliers. Six tasks crossed 5x — those deserve a post-mortem, not the ones that slipped 30%.
Thirty-one rows is enough to stop guessing.