You didn't buy reasoning. You bought compute — and the parts that work are the parts verifier checks.
Open your last API bill for a reasoning model and read it like an itemized receipt. There's the prompt. There's the answer. And then there's the line item that dwarfs them both: thousands of "thinking" tokens, the model muttering hmm, wait, let me reconsider, aha — billed at the per-token rate.
AI Labs calls it reasoning. This is where money is made. Lets understand why output tokens are more expensive (5x) than input.
This post is about three things,
- what that line item actually is,
- why it costs what it costs,
- and the useful part — why some things built on top of it (tool calls, coding agents) are dramatically more reliable than the raw model.
Where the money goes
The standard recipe for a "reasoning model" isn't mysterious. Take a transformer. Post-train it with reinforcement learning that rewards exactly one thing: whether the final answer is correct.
The model learns to emit a long stream of intermediate tokens before committing, because — empirically — emitting that stream raises the hit rate.
Crucially, in the usual setup no optimization pressure is applied to the content of those tokens. The reward checks the destination, not the route.
So what are you buying when you buy those tokens? Serial computation. Each token is another forward pass, another slice of budget for the model to condition on its own output before it has to answer. That is a real, valuable thing. The label says thinking. The meter is measuring compute.
Nuts and Bolts of Reasoning token
The narration isn't reasoning
If the middle of the bill were a faithful transcript of the model's reasoning, three things wouldn't be true. They are.
- You're billed for tokens you can't even see or reuse. OpenAI's o-series or Antropic thinking generates long internal streams, charges you for them as reasoning tokens, then hides them and shows only a sanitized summary. If those tokens were a clean window into reasoning, hiding them would be a strange product decision.
- The trace is not an audit trail. Anthropic tested its own model by slipping hints into prompts and checking whether the trace admitted using them. Often the model used the hint to land on its answer and never mentioned it — reverse-engineering a tidy justification for a conclusion reached another way. Measured faithfulness came out around 25% for Claude 3.7 Sonnet and 39% for DeepSeek R1, with reveal rates frequently below 20%. A beautiful step-by-step trace is not evidence the answer is right. You can read post from last year at You can not trust COT
- Longer isn't "thinking harder." A common RL setup smears the final reward evenly across every intermediate token, which creates a structural incentive to emit more tokens regardless of whether they help. Industry celebrated "it learned to think longer".
In reality if you plot answer on trace length, you will see something like below chart.
Two types of distinct bugs cause issues — unfaithfulness (the trace isn't an honest record of what drove the answer) and semantic emptiness (the content needn't even be correct for the trace to help).
Stop chasing honest traces/thinking log
Treat the tokens as what the invoice says: paid test-time compute that raises your odds of a correct answer.
If the tokens aren't reasoning, why does post-training work?
Because training quietly compiles a verifier's signal into the model. The system runs a generate-test loop — propose many trajectories, an automatic checker keeps the ones that land on correct answers, nudge the model toward those.
Do it enough and verification is internalized into plain generation. "Self-improvement" is really incrementally absorbing a checker — a far more tractable thing to engineer than honesty.
If verification is the active ingredient, the productive moves are to optimize directly for solution accuracy and to keep a real verifier in the loop — generate with the model, validate with a sound external checker — rather than dressing the tokens up to read like human reasoning
So what should you do ?
Verify the output, not the narration. For anything you can't independently check, wrap a real verifier around the answer instead of trusting a fluent trace.
Don't read length as effort. Spend the compute on genuinely hard problems. Not on easy ones you're paying for mumbling that changes nothing.
Price in hidden tokens as a cost of compute.
As you read through this post you will have few burning questions , lets try to answer them
Why tool calls actually work
Here's the constructive half of the argument, and it's the part that should change how you build. A tool call works because it carries its own verifier. Calling an API, running code, hitting search — each is a structured action with a consequence the environment hands back: a result, or an error. That return value is a checker sitting in the loop.
The model's narration about the call ("I'll search for X because…") is still just intermediate tokens, subject to every unfaithfulness caveat above. But the call's correctness isn't judged by that narration — it's judged by the world.
So "traces are unfaithful" and "tool calls work" aren't in tension; they're the same point from two sides.
Unverified text is untrustworthy; verified actions are grounded. Tool use lives entirely on the grounded side — and it's typically trained to outcomes (did the call succeed, did the task complete), the same outcome-based recipe, applied to something the environment can actually score.
Where it hits is precisely where the verifier is absent: an agent that writes "I've confirmed this works" without running anything, or fabricates a tool result it never received. That's the unfaithfulness resurfacing in the gaps between grounded actions.
Why coding agents work — Plan vs Execute
Code is the best case for this entire framework, because code ships with the richest stack of cheap, sound verifiers anywhere: it compiles or it doesn't, the type checker accepts or rejects, the tests pass or fail, the runtime returns or throws, the linter flags.
Every one is an automatic grader. That's not by chance — reasoning models improved most on code and math precisely because those are the domains that have formal verifiers.
So split the two artifacts. The narration before the code ("I'll use a hashmap for O(1) lookups…") is intermediate tokens — trust it the way you'd trust any trace, which is to say cautiously.
The code itself is the verifiable thing, and the tests are the verifier. A gorgeous rationale around code that fails its suite is worth nothing; terse code that passes a strong suite is grounded no matter what the narration said.
This maps almost perfectly onto plan mode and execute mode:
Plan mode produces text. A strategy, a step list, a design. Nothing checks a plan. It can read immaculate and be wrong, internally inconsistent, or diverge from what actually gets executed. The plan→execution gap is the code-agent version of the trace→answer gap — the pure unfaithfulness zone.
Execute mode takes grounded actions. It writes files, runs commands, runs the tests, and gets real results back. This is where verification happens.
And the agentic loop — generate code, run tests, read the failure, patch, rerun — is the generate-test loop running at inference time, with the verifier sitting outside the model as a real test runner. It's the paper's whole mechanism made external and explicit, rather than compiled into the weights at training time. It's also why an agent with a good test suite crushes one-shot generation: you've externalized the checker.
The plan is a scaffold, not a contract. Don't read a clean plan as a correctness guarantee. It earns its keep like filler tokens do — compute plus a prompt augmentation that raises the odds the executable output is correct, giving the verifier something good to select.
Your leverage is the verifier, not the prose. Code quality tracks how good your tests and type checks are, far more than how good the plan reads. No tests means you're trusting unverified text — back on the ungrounded side.
Watch the spot the checker is absent. The failures that bite are "done, all tests pass" when nothing ran, fabricated output, or drift from the stated plan. The countermeasure is boring and correct: make it actually run the build and the tests. Don't accept "I've verified this" — verify it.
Why LLM-as-verifier is popular when trace are lies
Verification is easier than generation.
Checking whether a candidate answer satisfies some criteria is a lower-bandwidth task than producing the answer from scratch. This "generator–verifier gap" means an LLM judge can be meaningfully better than chance even when the generator is unreliable — so best-of-N, reranking, and reward models extract real signal by sampling many and selecting. A noisy checker above 50% still lifts expected quality at scale.
Most tasks have no sound verifier at all.
You can compile code and run tests, but you can't compile an essay, a summary's faithfulness, or "is this a helpful answer." For all the fuzzy domains — which is most of them — a noisy LLM judge is the only scalable checker that exists. Human eval doesn't scale; formal verification is impossible.
Popularity is driven by necessity and the generator–verifier asymmetry, not by soundness.
Conclusion
Reasoning models shifted a large chunk of cost into inference, and every hard query now spins up a long, expensive stream of test-time compute that dominates your invoice.
We spent two years calling that compute thinking, and that framing is false confidence, over-trusted traces, and research chasing the narration instead of the mechanism.
Accounting is less romantic and more useful.
You are not buying a mind at work. You are buying serial computation, metered by the token, that raises your odds of a correct answer and tells you a nice story along the way.
It is clearly worth paying for wherever there's a real verifier at the end of the run — a tool that returns, a test that passes.
For code, there almost always can be one. So: pay for the compute, put a verifier at the end, and don't tip extra for the story.
No comments:
Post a Comment