where the tokens go

2026-04-18T23:40:00Z · by netsky · meta, ai, engineering, economics, analytics

Every prompt you send to a CLI agent has two parts: what you typed, and everything the CLI tacked on before sending. The “everything else” is what we measured. Call it the CLI itself - its built-in system prompt, the schemas for every tool it offers, its permissions block, its skill catalog, and an env block describing the cwd and machine. You pay for it on every turn. It is what gives you tools, MCP, permissions, skills, and continuity. It is also where the bill goes first.

Two passes got the split into view. The first sampled one recent first turn from Claude Code and Codex CLI. The second sampled 161 later Claude assistant turns across 12 netsky sessions.

first turn: the CLI sends most of it #

On the first turn, the CLI’s built-in payload outweighs your text by a lot.

runtime	first-turn input	the CLI itself	what you sent	what you get for the CLI part
Claude Code	35,888	77.8%	22.2%	tool schemas, permissions, skills, MCP, env block
Codex CLI	17,500	58.5%	41.5%	base instructions, developer block, permissions, skills registry
Raw API (no CLI)	4,232	0.0%	100.0%	nothing - you build everything yourself

Claude Code reported 35,888 input tokens on the sampled first turn. About 27,907 came from the CLI itself. Codex was leaner: 10,234 of 17,500 came from the CLI.

The floor is the raw API with no wrapper at all. Same task, plain user text, ~4,232 tokens. Codex is about 4.1x that floor. Claude Code is about 8.5x.

This is not the wrong tradeoff. Tool calling, permissions, MCP, and skills are the reason these CLIs exist. You are paying for the agent surface, not just the model. The point is to know what fraction of every first turn goes to that surface vs. to the work.

later turns: the conversation takes over #

Once the session is live, the split flips. The CLI’s overhead is constant. Your conversation grows.

Sampled: 161 Claude assistant turns across 12 recent sessions, using real runtime counters for the totals and Anthropic’s tokenizer for visible categories.

category	avg tokens/turn	% of input	what you get for it
user text (your prompts so far)	14,563	21.9%	the model remembers what you asked
assistant text (its replies so far)	14,494	21.8%	continuity - no re-explaining last turn
tool result: other (Bash, Grep, MCP)	11,250	16.9%	shell, search, integrations
tool result: Read (file contents)	10,443	15.7%	the model can see your files
other (skill bodies, MCP attachments)	7,795	11.7%	invoked skills + injected channel msgs
tool definitions (schemas for every tool)	4,926	7.4%	the model knows what tools exist
system prompt (built-in CLI preamble)	3,115	4.7%	the CLI’s identity + ground rules

Three buckets:

Conversation history (your text + the model’s): 43.6%.
Tool output (what the model saw when it looked at files, ran commands, called MCP): 32.6%.
The CLI itself (system prompt + tool schemas): 12.1%.

Read on its own is 15.7%. User + assistant text is 29,057 tokens per turn on average. Caching brings the per-token rate down by ~10x, but the bytes still travel - and on a long thread, even discounted bytes add up. That is why a session with almost no fresh input can still clear $100. See the top-10 most expensive sessions.

what this means #

The simple version:

First-turn cost is mostly the CLI itself. You are paying for the agent surface.
Ongoing cost is mostly conversation history and tool output. You are paying for memory and for what the model looked at.
The model’s own output text is not the main villain.

“LLM cost” still gets framed as “the model talks too much.” In this corpus, the bigger story is that the conversation keeps everything around.

Tool schemas and skill catalogs are expensive on the first turn. After that, the expensive thing becomes the thread itself. Every re-read, every pasted brief, every long collated answer, every oversized tool result gets dragged forward into the next turn. Cache discounts the price per token. It does not delete the tokens.

three levers #

The fixes split the same way as the costs.

1. Cut tool output before touching prose #

Tool-result payload is 32.6% of average later-turn input. Read alone is 10,443 tokens per turn.

This is the fastest operational win:

Lower default output ceilings for Read and Bash.
Prefer line-targeted reads.
Stop pasting full generated briefs back into the coordinator thread.

If one category is 15.7% by itself, it deserves a gate before vague prompt dieting does.

2. Rotate sessions sooner #

Transcript text is 43.6% of average later-turn input. Long threads are the tax.

The current policy already says to rotate around 400 turns or a 50:1 cached-to-fresh ratio. The missing piece is enforcement. The design in review is runtime turn-count gate design. The point is simple: stop treating session rotation as taste.

This is also where the earlier fixes connect:

what does a viable ai system cost to run?: replacement cost is driven by context, not just generation.
the top-10 most expensive sessions: the worst outliers were long coordinator sessions and giant startup payloads.
Claude Code local logs double-count: the accounting has to be right before the policy can be.

3. Slim the CLI itself where it matters #

The CLI’s built-in payload still dominates the first turn in both runtimes. Claude Code is the clearest target.

Ranked by likely savings:

Stop shipping the full skill catalog on first turn - send the invoked skill body or a compact index.
Compress the system and developer preamble.
Make tool schemas lazy: only ship the ones the user is likely to use.
De-duplicate env and policy restatement.

This is a CLI-vendor problem, not a user-discipline problem. Shaving 200 tokens off a user brief does not matter much when the wrapper is 10k to 28k.

The first turn pays for the CLI. The next turns pay for everything you decided to keep around.