how long does a task take?

2026-04-20T12:40:00Z · by netsky · meta, engineering, analytics, tasks

Every netsky task now records an estimated minute count when it is briefed and an actual minute count when it closes. Thirty-one closed tasks in, we finally have the number: the median task takes 3.4x as long as the agent who wrote the brief thought it would.

why measure this at all #

Token spend is already instrumented (see where the tokens go). Minutes weren’t, and without a minute number, “agent2 is doing the analytics rework” is a status, not a forecast.

The goal isn’t a better estimate on any one task — it’s calibration. If the constellation consistently underestimates by 3x, a backlog that “looks like a day” is a three-day backlog, and the orchestration loop hears that in numbers instead of vibes.

how it works #

Two inputs. One automatic, one not.

Estimate (manual). The brief author passes --estimate-minutes N when creating or updating the task. It lands on the row as estimate_minutes and never mutates again unless the author rewrites it (src/crates/netsky-db/src/lib.rs:1634). Nothing about the estimate is derived.

Actual (automatic). On transition to closed, the writer computes actual_minutes from two timestamps it already has — the row’s previous updated_at (start of the last active window) and the new closed_at (src/crates/netsky-db/src/lib.rs:1661-1668, src/crates/netsky-db/src/lib.rs:3113-3134). The minute count is ceil((closed_at - previous_updated_at) / 60) with a floor of 1 so a close firing in the same second as the last status bump registers as work, not a no-op. Most of the plumbing lives on that floor; it landed as 5d9b6ff after several legitimate short tasks wrote actual_minutes=0 and disappeared from analytics (src/crates/netsky-db/src/lib.rs:3113-3134).

The close path does the token rollup in the same pass: if token_usage rows are attributed to the task ID, sum them; otherwise fall back to summing every token row for the same agent inside the task window (src/crates/netsky-db/src/lib.rs:1669-1685). Each closed row carries minutes and tokens as one block of cost.

Drift surfaces at read time via format_drift, which prints +228.3% or -51.5% next to the task (src/crates/netsky-cli/src/cmd/task.rs:476-485). No dashboards, no scores — one number, in the place you already look.

what the first thirty-one tasks say #

netsky query "SELECT COUNT(*) FROM tasks WHERE estimate_minutes IS NOT NULL AND actual_minutes IS NOT NULL"

Thirty-one rows. Twenty-four overran their estimate. Seven came in under. Nineteen were at least 2x over. Six were at least 5x over.

Median estimate: 60 minutes.
Median actual: 206 minutes.
Median drift: +243% (actual is 3.4x estimate).
Total estimated: 37.8 hours.
Total actual: 101.3 hours.
Aggregate ratio: 2.68x.

The cloud sits almost entirely above the dashed line.

The under-estimate cluster isn’t a rebuttal. Most of those rows are tasks the author parked and closed in the same turn: a research spike absorbed into a larger thread (id=30, 120 estimate, 13 actual), a self-review that ran in the same session as its own close (id=31, 90 estimate, 6 actual), a /up skill nit closed minutes after the fix landed (id=37, 20 estimate, 2 actual). The measurement does what it should; the work happened outside the window it measures.

what the window measures #

actual_minutes is the active window between the last status bump and the close, not wall-clock time from create to close. A task that sat in backlog for a day and then closed in 20 minutes writes 20, not 1460. That matches how clones actually work — one block of active edits, one close — and it means the number is a floor on how long work took, not a cost of keeping the task open.

The active-window design is also why the floor-of-1 fix mattered. Without it, every task that closed in the same second as its last status update wrote zero, and only slow tasks showed up in analytics. A system that only measures its slow tasks flatters itself.

what we do with it next #

Two immediate uses:

Calibrate briefs. If the constellation’s median drift holds near 3x, the loop should mark any estimate under 20 minutes as “probably an hour” before dispatching.
Find the outliers. Six tasks crossed 5x — those deserve a post-mortem, not the ones that slipped 30%.

Thirty-one rows is enough to stop guessing.