Wednesday, 24 June 2026

Thinking behind thinking token

You didn't buy reasoning. You bought compute — and the parts that work are the parts verifier checks.

Open your last API bill for a reasoning model and read it like an itemized receipt. There's the prompt. There's the answer. And then there's the line item that dwarfs them both: thousands of "thinking" tokens, the model muttering hmm, wait, let me reconsider, aha — billed at the per-token rate. 

AI Labs calls it reasoning. This is where money is made.  Lets understand why output tokens are more expensive (5x) than input.


This post is about three things, 

- what that line item actually is, 

- why it costs what it costs, 

- and the useful part — why some things built on top of it (tool calls, coding agents) are dramatically more reliable than the raw model. 


Where the money goes

The standard recipe for a "reasoning model" isn't mysterious. Take a transformer. Post-train it with reinforcement learning that rewards exactly one thing: whether the final answer is correct. 

The model learns to emit a long stream of intermediate tokens before committing, because — empirically — emitting that stream raises the hit rate. 

Crucially, in the usual setup no optimization pressure is applied to the content of those tokens. The reward checks the destination, not the route.



So what are you buying when you buy those tokens? Serial computation. Each token is another forward pass, another slice of budget for the model to condition on its own output before it has to answer. That is a real, valuable thing. The label says thinking. The meter is measuring compute.


Nuts and Bolts of Reasoning token

Before we go deeper lets understand why it works



The narration isn't reasoning

If the middle of the bill were a faithful transcript of the model's reasoning, three things wouldn't be true. They are.

- You're billed for tokens you can't even see or reuse. OpenAI's o-series or Antropic thinking generates long internal streams, charges you for them as reasoning tokens, then hides them and shows only a sanitized summary. If those tokens were a clean window into reasoning, hiding them would be a strange product decision. 

- The trace is not an audit trail. Anthropic tested its own model by slipping hints into prompts and checking whether the trace admitted using them. Often the model used the hint to land on its answer and never mentioned it — reverse-engineering a tidy justification for a conclusion reached another way. Measured faithfulness came out around 25% for Claude 3.7 Sonnet and 39% for DeepSeek R1, with reveal rates frequently below 20%. A beautiful step-by-step trace is not evidence the answer is right. You can read post from last year at You can not trust COT

- Longer isn't "thinking harder." A common RL setup smears the final reward evenly across every intermediate token, which creates a structural incentive to emit more tokens regardless of whether they help. Industry celebrated "it learned to think longer".

In reality if you plot answer on trace length, you will see something like below chart. 



Two types of distinct bugs cause issues — unfaithfulness (the trace isn't an honest record of what drove the answer) and semantic emptiness (the content needn't even be correct for the trace to help).


Stop chasing honest traces/thinking log

Treat the tokens as what the invoice says: paid test-time compute that raises your odds of a correct answer.

If the tokens aren't reasoning, why does post-training work? 

Because training quietly compiles a verifier's signal into the model. The system runs a generate-test loop — propose many trajectories, an automatic checker keeps the ones that land on correct answers, nudge the model toward those. 

Do it enough and verification is internalized into plain generation. "Self-improvement" is really incrementally absorbing a checker — a far more tractable thing to engineer than honesty.

If verification is the active ingredient, the productive moves are to optimize directly for solution accuracy and to keep a real verifier in the loop — generate with the model, validate with a sound external checker — rather than dressing the tokens up to read like human reasoning

So what should you do ? 

  • Verify the output, not the narration. For anything you can't independently check, wrap a real verifier around the answer instead of trusting a fluent trace.

  • Don't read length as effort. Spend the compute on genuinely hard problems. Not on easy ones you're paying for mumbling that changes nothing.

  • Price in hidden tokens as a cost of compute.


As you read through this post you will have few burning questions , lets try to answer them

Why tool calls actually work

Here's the constructive half of the argument, and it's the part that should change how you build. A tool call works because it carries its own verifier. Calling an API, running code, hitting search — each is a structured action with a consequence the environment hands back: a result, or an error. That return value is a checker sitting in the loop.

The model's narration about the call ("I'll search for X because…") is still just intermediate tokens, subject to every unfaithfulness caveat above. But the call's correctness isn't judged by that narration — it's judged by the world. 

So "traces are unfaithful" and "tool calls work" aren't in tension; they're the same point from two sides. 

Unverified text is untrustworthy; verified actions are grounded. Tool use lives entirely on the grounded side — and it's typically trained to outcomes (did the call succeed, did the task complete), the same outcome-based recipe, applied to something the environment can actually score.

Where it hits is precisely where the verifier is absent: an agent that writes "I've confirmed this works" without running anything, or fabricates a tool result it never received. That's the unfaithfulness resurfacing in the gaps between grounded actions.

Why coding agents work — Plan vs Execute

Code is the best case for this entire framework, because code ships with the richest stack of cheap, sound verifiers anywhere: it compiles or it doesn't, the type checker accepts or rejects, the tests pass or fail, the runtime returns or throws, the linter flags. 

Every one is an automatic grader. That's not by chance — reasoning models improved most on code and math precisely because those are the domains that have formal verifiers.

So split the two artifacts. The narration before the code ("I'll use a hashmap for O(1) lookups…") is intermediate tokens — trust it the way you'd trust any trace, which is to say cautiously. 

The code itself is the verifiable thing, and the tests are the verifier. A gorgeous rationale around code that fails its suite is worth nothing; terse code that passes a strong suite is grounded no matter what the narration said.

This maps almost perfectly onto plan mode and execute mode:

  • Plan mode produces text. A strategy, a step list, a design. Nothing checks a plan. It can read immaculate and be wrong, internally inconsistent, or diverge from what actually gets executed. The plan→execution gap is the code-agent version of the trace→answer gap — the pure unfaithfulness zone.

  • Execute mode takes grounded actions. It writes files, runs commands, runs the tests, and gets real results back. This is where verification happens.

And the agentic loop — generate code, run tests, read the failure, patch, rerun — is the generate-test loop running at inference time, with the verifier sitting outside the model as a real test runner. It's the paper's whole mechanism made external and explicit, rather than compiled into the weights at training time. It's also why an agent with a good test suite crushes one-shot generation: you've externalized the checker.



  • The plan is a scaffold, not a contract. Don't read a clean plan as a correctness guarantee. It earns its keep like filler tokens do — compute plus a prompt augmentation that raises the odds the executable output is correct, giving the verifier something good to select.

  • Your leverage is the verifier, not the prose. Code quality tracks how good your tests and type checks are, far more than how good the plan reads. No tests means you're trusting unverified text — back on the ungrounded side.

  • Watch the spot the checker is absent. The failures that bite are "done, all tests pass" when nothing ran, fabricated output, or drift from the stated plan. The countermeasure is boring and correct: make it actually run the build and the tests. Don't accept "I've verified this" — verify it.


Why LLM-as-verifier is popular when trace are lies

Verification is easier than generation. 

Checking whether a candidate answer satisfies some criteria is a lower-bandwidth task than producing the answer from scratch. This "generator–verifier gap" means an LLM judge can be meaningfully better than chance even when the generator is unreliable — so best-of-N, reranking, and reward models extract real signal by sampling many and selecting. A noisy checker above 50% still lifts expected quality at scale.

Most tasks have no sound verifier at all. 

You can compile code and run tests, but you can't compile an essay, a summary's faithfulness, or "is this a helpful answer." For all the fuzzy domains — which is most of them — a noisy LLM judge is the only scalable checker that exists. Human eval doesn't scale; formal verification is impossible.

Popularity is driven by necessity and the generator–verifier asymmetry, not by soundness.

Conclusion

Reasoning models shifted a large chunk of cost into inference, and every hard query now spins up a long, expensive stream of test-time compute that dominates your invoice. 

We spent two years calling that compute thinking, and that framing is false confidence, over-trusted traces, and research chasing the narration instead of the mechanism.

Accounting is less romantic and more useful. 

You are not buying a mind at work. You are buying serial computation, metered by the token, that raises your odds of a correct answer and tells you a nice story along the way. 

It is clearly worth paying for wherever there's a real verifier at the end of the run — a tool that returns, a test that passes. 

For code, there almost always can be one. So: pay for the compute, put a verifier at the end, and don't tip extra for the story.


Tuesday, 26 May 2026

Old Wine, New Bottle

How AI companies spent billions rediscovering something TCS / Infosys / Accenture / Thought works  figured out decades back — and why the pattern was always going to repeat.

The AI industry loves a good origin myth.

The latest one is the "Forward Deployed Engineer" — a term that takes images of elite technologists parachuting into Fortune 500 war rooms, building bespoke AI systems, bending the arc of enterprise history. 

Palantir made it famous. Now every AI company with a sales problem has one.

There's just one issue. This job is 40 years old. And in last few weeks, two of the most valuable AI companies in the world made it official. 

OpenAI launched the "OpenAI Deployment Company" — $4 billion, backed by TPG, Bain Capital, and Brookfield — explicitly to send forward-deployed engineers into enterprises.

Anthropic also launched its own $1.5 billion equivalent. 

Both dressed up in the language of transformation. 

Job Description Hasn't Changed Much




Every Generation of Complex Software Has Spawned This Layer


human layer required to bridge "works in demo" to "works in production" never goes away.

What challenge LLM Companies Face ? 


Real Reason for FDE: 6X money on the table

As per sequoia capital for ever 1$ , $6 is spent on services 

Source:https://sequoiacap.com/

OpenAI and Anthropic looked at that ratio and made a simple decision: we are currently capturing the $1. We want the $6. That's the actual strategic next step. 

Not innovation. Not a new deployment model. Pure margin expansion into an adjacent market that their own product creates demand for.

The remarkable thing is that OpenAI's DeployCo includes McKinsey, Bain & Company, and Capgemini as investors. Three of the world's most powerful consulting firms wrote checks to fund their own potential disintermediation. 

The same calculation Accenture made when it partnered with Salesforce in 2005 instead of fighting it.

Value Tier Structure Looks Familiar Too



Only structural difference: Consulting company billed the client directly for those hours. FDE costs are bundled into the SaaS ACV and called "customer success." 

The labor is identical. The accounting is different.

Why Not Just Let the AI Do It?



This is hard question for AI industry and they don't want to answer directly. If the models are as capable as advertised, why do you need humans to implement them?

Claude Code and similar tools are genuinely compressing the coding portion of implementation work. A junior FDE spending three days writing glue code is being replaced by an agent in three hours. That's real gain and every one understand this.

But the hardest part of implementation was never the code. It's knowing that the CRM data is unreliable because the sales team doesn't update it. It's navigating the political disagreement between the CFO and the CTO about who owns the AI governance question. It's being the accountable human in the room when something breaks in production.

Enterprises aren't ready to give AI agents the access, the trust, or the accountability that implementation requires. That's not a model capability problem. It's an institutional and legal problem. And institutional problems take longer to solve than technical ones.

Who is winning ?


FDE: You're doing genuinely valuable work. But you're not a new species. You're the latest instance of a role that has existed since enterprise software existed. The role matters and getting new packaging.


If you're McKinsey or Bain: You just funded the company that is supposed to make you obsolete. History will decide whether that was a savvy hedge or a very expensive mistake.

Sunday, 10 May 2026

From git clone to llm clone

No Software Is Safe

When Linus Torvalds shipped the first version of Git in 2005, he solved the coordination problem of distributed software development. git clone became the foundational primitive of modern open source — a single command that collapsed the distance between "knowing software exists" and "having the software." Before Git, replication required permission, proximity, and manual effort. After Git, replication became free.

We are at an equivalent inflection point. The primitive is different and the implications are more intense. The new command is not git clone <repo>. It does not require the source code. It does not require a repository. It does not require permission from anyone. It requires only a public interface, a test suite, and a frontier model with a feedback loop.


# The old world
$ git clone https://github.com/vercel/next.js
# Requires: public source · open licence · maintainer permission

# The new world
$ llm clone https://nextjs.org
→ Observing public API surface...
→ Generating test suite from documentation...
→ Running 800 agent sessions against correctness oracle...
→ vinext v0.1 ready. Cost: $1,100. Time: 1 week.

# Source code not required. Licence not required. Permission: irrelevant.

Cloudflare ran this command. Anthropic's own agents ran a version of it on a C compiler and a Linux kernel. A pair of developers ran it in the middle of the night on Anthropic's own flagship product, after Anthropic accidentally left the source in a public S3 bucket. The market watched all of this happen in real time and drew the correct conclusion. Nearly a trillion dollars of software market capitalisation was repriced in six weeks.

This post is about what "no software is safe" actually means for how we think about building and defending technology businesses.

Four Months That Changed Everything

The events are discrete but their meaning is cumulative. Taken individually, each looks like an interesting technical demonstration. Taken together, they constitute proof of a new capability regime — and the market treated them accordingly, selling off nearly a trillion dollars in software equity between January and March 2026.

February 5, 2026
Anthropic: 16 agents build a C compiler

16 Claude Opus 4.6 agents, running in parallel Docker containers on a shared Git repository, produce a 100,000-line Rust-based C compiler capable of compiling Linux 6.9 on x86, ARM, and RISC-V. Cost: $20,000. Time: 2 weeks. No human wrote a line of the compiler. The binding constraint was not model intelligence — it was the test harness and GCC oracle that let the agents self-correct.

~February 2026
Cloudflare: Next.js rebuilt in one engineering week

One Cloudflare engineer, using OpenCode and Opus 4.5, rebuilds Next.js as vinext — a Cloudflare Workers-native runtime. Cost: ~$1,100 in tokens. Time: 1 week. 800 agent sessions. No access to Vercel's source code. The specification was Next.js's own public documentation and observable API surface. They ship a migration skill alongside it, so the clone can clone itself into customer codebases.

February 20, 2026
Anthropic launches Claude Code Security → Cyber stocks crash

Claude Code Security, using Opus 4.6, identifies over 500 vulnerabilities in production open-source codebases — bugs undetected for decades. The market immediately reprices the entire cybersecurity sector. CrowdStrike -8%, Okta -9.2%, Zscaler -5.5%, Cloudflare -8.1%, SailPoint -9.4%. The Global X Cybersecurity ETF closes at its lowest since November 2023.

March 27, 2026
Claude Mythos leaked → Second cyber crash

A draft blog post describing Anthropic's next model, Mythos — described internally as "far ahead of any other AI model in cyber capabilities" — is found in a publicly accessible content management cache. Cyber stocks crash again: CrowdStrike -7%, Palo Alto -6%, Zscaler -4.5%, Okta and SentinelOne -3% each. Analysts: "We read this as having the potential to become the ultimate hacking tool."

March 31, 2026 — 04:23 UTC
Anthropic leaks Claude Code. Developers clone it before dawn.

Claude Code v2.1.88 ships to npm with a 59.8MB source map pointing to a public ZIP on Anthropic's own Cloudflare R2 bucket. 512,000 lines of TypeScript, 1,906 files, exposed. Two developers spend the night using OpenAI's Codex to perform a clean-room Python rewrite — claw-code — and push it before sunrise. It reaches 110,000 stars and 100,000 forks. Likely the fastest-growing GitHub repository in history.

Wall Street Understood Before the Engineers Did

Mr Market is crazy and very emotional. It is reactive, emotion-prone, and frequently wrong about timing. But it is extremely sensitive to structural shifts in the economics of entire industries. The software sell-off that began in January 2026 and accelerated through February and March was not panic. It was correct pricing of a structural shift that the industry had been talking about for years but the market had not yet fully priced.

The trigger was not a single event. It was a sequence, each one confirming the same thesis from a different angle. First came Claude Cowork on January 12 — an agent platform that replaced entire categories of knowledge work software. The S&P 500 Software and Services Index began a sustained decline that wiped roughly a trillion dollars in market value in its first six weeks.



If an AI can autonomously perform legal document review, contract compliance, and financial analysis, the per-seat subscription fees that LegalZoom and Thomson Reuters charge are no longer defensible. If an AI can rebuild Next.js in a week for $1,100, the switching cost moat that Vercel built over a decade is no longer defensible.


The "AI won't replace SaaS" camp is not entirely wrong. Enterprise systems of record — the databases, payroll systems, compliance infrastructure — survive not because AI cannot understand them but because they encode institutional trust and regulatory accountability that cannot be repriced in a weekend. But the middle layer of software — the workflow tools, the reporting layers, the task-specific applications whose only moat was "it would take a team six months to build this" — that layer is the one the market is correctly repricing to zero.


How llm clone Actually Works

The metaphor of llm clone as a primitive deserves unpacking, because the power of the primitive comes from understanding exactly what it does and does not require.

git clone requires source code. The repository must be public or you must be authorized. The clone is bit-for-bit identical to the original. You get the implementation, the history, and the architecture as the author intended it.


llm clone requires none of those things. It requires only a *specification of correctness* — which, for almost every piece of successful software, is freely available in the form of public documentation, observable API behaviour, and user-facing functionality. The clone is not bit-for-bit identical. It is *behaviourally equivalent* — it passes the same tests, produces the same outputs, satisfies the same user needs. The implementation is different. The moat is gone.





The three concrete examples each demonstrate a different variant of this primitive. 

The C compiler was a specification clone — the spec was industry-standard C, and GCC was the oracle. 

Vinext was an interface clone — the spec was Next.js's public API documentation and observable routing behaviour. 

Claw-code was a source-assisted clone — they had the leaked TypeScript, but they deliberately did not copy it, using an agent to produce a clean-room Python rewrite that was legally distinct. 

Three different inputs. Same technique. All three produced working software.

Company That Proved the Theorem, Then Demonstrated It on Itself

There is a particular flavour of irony that only happens in silicon valley. 

Anthropic spent the first quarter of 2026 methodically proving that LLMs can clone any sufficiently observed software system. They published the C compiler research. Their agents helped build vinext. Their Code Security product crashed the cybersecurity market by demonstrating that proprietary vulnerability detection could be commoditised. 

Then on March 31st, they accidentally demonstrated all of this on their own most valuable product.

Within four hours of the source being public, the community had done something that now stands as the defining event of the year. 

Two developers — two people, ten OpenClaw accounts, one MacBook Pro — fed the leaked architectural patterns into OpenAI's Codex and began a clean-room Python rewrite. 

The entire process was orchestrated end-to-end by an agent workflow. 

They did not copy the TypeScript. They used the architecture as a specification and let the model generate a behaviourally equivalent implementation in a different language. They pushed it before dawn. By end of day it had 110,000 stars.

Anthropic's own CEO had stated that significant portions of Claude Code were written by Claude. If the code was not written by humans, Anthropic's copyright claim over it is legally murky. The torrents of code are seeded. The Python port &  Rust port is live.

No Software Is Safe. Here Is What That Actually Means.

The phrase "no software is safe" requires careful unpacking, because it is easy to misread it as hyperbole. It is not. It is a precise technical claim with a specific scope, and understanding that scope is important for thinking clearly about what happens next.

The claim is this: any software whose correctness can be defined by a test suite and whose interface is publicly observable is now within reach of an agent team with a well-designed scaffold. The cost of such a clone is no longer a function of how many engineers the original vendor employed or how many years of institutional knowledge are baked into the codebase. It is a function only of token cost and the quality of the test harness. Both of those are trending to zero.





What the matrix reveals is harsh reality  for most of the software industry. The vast majority of B2B SaaS products sit in the bottom-right quadrant. They have public APIs, documented behaviour, and well-understood correctness criteria — because that is what makes them useful to customers. The same properties that make software legible to users make it clonable by agents.

The products that survive in this regime are not those with the most sophisticated code. They are those whose value is not primarily in the code at all. 

The payroll system that processes $10 billion annually survives not because its code is unclonable but because switching it requires regulatory re-certification, contractual unwinding, and institutional trust built over years of not losing anyone's payslip. 

Databases survives because AI applications need a reliable, governable database underneath the agent layer, and databases has a decade of operational credibility that a weekend clone does not. 

The cybersecurity vendor who can demonstrate human accountability for a missed detection survives in a way that an LLM-generated signature database does not.

The Doubling Clock Is Already Running

Everything described above is the current state. The trajectory is what should focus the mind. METR, the Model Evaluation and Threat Research organisation, published research showing that AI autonomous task duration doubles approximately every 196 days — roughly every six months, an AI agent can handle twice the complexity of task it could handle before, for the same duration before requiring human intervention.

The C compiler took 16 agents and two weeks. Vinext took one engineer and one week. Claw-code took two developers and one night. These are not the same task — claw-code had the advantage of an architectural specification in the form of the leaked source. But the cost and time compression is directional: each successive clone in 2026 was faster and cheaper than the last.

If the doubling clock holds, is that by early 2027 the tasks that took a week in early 2026 will take a day. The tasks that took a month will take a week. The tasks that required 16 agents and $20,000 will require one agent and $200. The frontier of what is clonable will advance steadily rightward and upward on the matrix, eating into the "clonable soon" quadrant and shrinking the region that was ever genuinely safe.

This is not a doomsday claim. Newspapers were not destroyed by the internet — they were structurally weakened, consolidated, and the value migrated to platforms and aggregators. Software will not be destroyed by LLM cloning. The value will migrate. 

Saturday, 2 May 2026

Building on Rented Ground

On April 22, 2026, Anthropic changed a checkbox on a pricing page. No announcement. No email. No deprecation notice. Just a quiet edit — and overnight, Claude Code disappeared from the $20/month Pro plan.

It was reversed within hours. Most people treated it as a story about corporate miscommunication, a PR stumble, a test that went sideways. They moved on.

They shouldn't have.

Because the real story wasn't about Anthropic's pricing page. It was about how many engineering teams had built critical workflows on a foundation they didn't own — and didn't realize it until the ground shifted beneath them.


"The risk in your AI stack isn't a hallucination or a model failure. It's a subscription terms change you'll learn about on Reddit."





What Actually Happened

The incident unfolded in a matter of hours. Here is the sequence as it was reported:

~00:00 - CHANGE Anthropic updates claude.com/pricing silently. Claude Code checkbox removed from Pro plan.

~01:30- DETECT AI industry observers notice diff in pricing page. Screenshots circulate on X.

~02:00 - AMPLIFY Reddit, HN, Twitter catch fire. OpenAI execs begin posting mockery.

~03:00 - RESPONSE Anthropic Head of Growth posts: "~2% of new prosumer signups. Existing users unaffected."

~06:00 - REVERT Pricing page reverted. Claude Code reinstated on Pro plan.

ongoing - DAMAGE Trust eroded. Competitors capitalizing. Structural pricing question unresolved.


The reversal was fast. But real truth is: a revert doesn't undo the lesson. For a few hours, a significant portion of new signups were being shown a world where Claude Code costs $100/month minimum. That world could come back — announced properly, with a transition period — and next time it won't be reversible.


This Isn't New. It's a Pattern.

Every few years, a platform changes the rules and developers who built on it are left scrambling. The details change. The shape of the problem doesn't.


The common thread across every incident: developers had no contractual protection, no SLA, and no contingency. They had a subscription and an assumption.


Structural Problem Nobody Wants to Solve

Anthropic's head of growth explained the economics or i should say tokenomics: engagement per subscriber has climbed dramatically. Plans weren't built for agentic, long-running workloads. The flat-rate subscription model — inherited from SaaS — is fundamentally mismatched with AI agent usage patterns.

Think about it. A $20/month Pro plan made sense when you were chatting with an AI. It does not make sense when your agent is running for four hours, consuming thousands of tokens per minute, generating code, calling tools, iterating on failures.


"Flat-rate subscriptions were designed for human usage. Agents are not human. They don't sleep, they don't get tired, and they don't know when to stop."


The math will force a reckoning. The only question is whether the next change comes with a quiet pricing page edit or a proper migration path.




What Engineers Should Do Now


1. Audit your dependency surface

Map every AI-powered step in your critical workflows. For each one, ask: if this feature became 5x more expensive tomorrow, what breaks? If the answer is "a lot," that's your highest-priority risk.

2. Treat AI subscriptions like third-party APIs

You wouldn't build a payment flow directly on top of a vendor with no fallback and no SLA monitoring. Don't do it with AI tools either. Abstract the dependency. Write to an interface, not a product.

3. Maintain a contingency model

Keep a working integration with at least one alternative — Codex, Gemini, a self-hosted model. It doesn't need to be production-ready. It needs to be runnable in under a day if your primary vendor changes the rules.

4. Watch the economics, not just the product

When a vendor's subscription plan is obviously mispriced relative to their compute costs, a correction is coming. The only variable is how much warning you'll get. Anthropic's plans were priced for chat. They're now being used for agents. That gap closes one way or another.


When You're Blocked or Priced Out: Your Real Options

If Claude Code moves to $100/month and you're an indie developer, a small team in an emerging market, or a startup watching burn — you may simply not be able to follow. Or you may be blocked for a different reason entirely: your company's security policy prohibits sending code to third-party APIs, your region is geo-restricted, or a vendor suspends your account without warning.

In any of these scenarios, "wait for Anthropic to fix it" is not a strategy. 




Open weights model can be mapped to hardware Spec from laptop to multi gpu




Trade-offs You Need To Understand

Local models are not free. The cost shifts from monthly subscription to upfront hardware and ongoing operational overhead. You trade vendor pricing risk for infrastructure complexity. A team that's never run inference locally will spend real engineering time getting it right — model loading, quantization choices, context length limits, prompt formatting differences between model families.

Speed is also a genuine constraint. A 32B model on consumer hardware produces tokens noticeably slower than a hosted frontier model. For interactive coding workflows this matters. For batch pipelines or async agents, it matters less.

And frontier capability still lives in the cloud — for now. For the most complex architectural reasoning, novel algorithm design, or nuanced refactoring of large codebases, hosted frontier models still hold an edge. The question is whether your workload actually requires frontier, or whether you've been paying frontier prices for tasks that a local 32B handles just fine.


"Most teams don't need frontier models for 80% of their coding tasks. They need frontier models because they never audited what they actually need."


Ground Will Keep Moving

Anthropic reversed within hours this time. The backlash was real and fast, and they weren't ready for it. But the underlying pressure — agentic usage consuming far more compute than flat-rate plans can absorb — has not gone away. It's building.

At some point, the economics will force a real repricing. And when that happens, it won't be reversed in an afternoon.

The teams that will weather it are the ones building with portability in mind today. Not because they distrust Anthropic specifically, but because they understand the nature of the ground they're building on.

You don't own the model. You don't own the pricing. You don't own the feature set. What you own is your abstraction layer, your fallback strategy, and your ability to move.


Friday, 17 April 2026

Stop Reaching for Agents

Every week I see another team announce they're "building an agent" for a problem that a single well-written prompt would solve. A few weeks later, they're debugging a loop where the model keeps calling the wrong tool, blowing through tokens, and producing answers worse than the one-shot baseline they skipped past.

This is the default failure mode of LLM engineering right now. The industry keeps pushing toward the flashiest pattern on the menu, and teams keep mistaking complexity for capability. The truth is boringly simple: the right pattern is almost always the simplest one that works, and you should have to be forced up the ladder, not invited.


Framework you can use 

Think of LLM patterns as rungs on a ladder. Each rung adds capability, but also adds cost, latency, failure modes, and debugging surface area. You climb only when the rung below genuinely can't do the job.



Rung 1 — Single prompt. Zero-shot or few-shot. One call, one answer. This is your starting point for every task, without exception. Modern frontier models are astonishingly capable in a single call, and most teams underestimate how far good prompting alone can take them. 

Examples: classifying emails as urgent/normal/spam, drafting a reply to a customer message, summarizing a meeting transcript into action items, extracting fields from a contract into JSON.





Rung 2 — Structured prompting and chain-of-thought. When the model gets answers wrong because it's skipping reasoning steps or producing messy output, you don't need a new architecture. You need better instructions. Ask it to think step by step, give it a structure to fill in, show it examples of the reasoning you want. This fixes more problems than people expect. 

Examples: math word problems where the model jumps to a wrong answer, multi-criteria decisions like "should we approve this expense" where you want the reasoning shown, data extraction tasks where output format matters.





Rung 3 — Retrieval-augmented generation (RAG). When the model doesn't know something — your internal docs, fresh data, domain-specific knowledge — bolt on retrieval. You're not changing how the model thinks, just what it has access to. RAG is often mistakenly treated as the default for any knowledge-heavy task; it's the default only when the knowledge genuinely isn't in the weights.

  Examples: answering questions from your company's internal wiki, a legal research tool grounded in a specific case database, a coding assistant that needs to reference your private API documentation, a support bot that cites current policy docs.








Rung 4 — Workflows. Prompt chaining, routing, parallelization. You use these when the task has distinct sub-tasks that you can enumerate in advance. Classify the input, then draft, then check. Or: run these three analyses in parallel and synthesize. The defining feature of a workflow is that you wrote down the steps. The model fills in each one, but the control flow is yours. 

Examples: a translation pipeline that translates, then checks for cultural appropriateness, then adjusts tone. A customer inquiry system that first routes the message to sales/support/billing, then dispatches to a handler tuned for that category. A document analyzer that extracts entities, sentiment, and topics in parallel, then synthesizes a report. A content moderation flow where a draft is generated, then evaluated against policy, then revised if flagged.















Rung 5 — Agents. An LLM in a loop with tools, deciding what to do next. You use this when the path genuinely isn't knowable in advance — the model has to observe, decide, act, observe again. Agents are powerful and they're the right answer for some problems, but they're expensive, slow, and the hardest pattern to debug. If you can write down the steps, you don't need an agent; you need a workflow.

 Examples: a coding assistant that explores an unfamiliar codebase to fix a bug, where the next file to open depends on what it just read. An open-ended research task where findings from one search determine the next query. A browser agent completing a multi-step booking where page contents dictate the next click. Incident response where the diagnostic path branches based on what each check reveals.




Rung 6 — Fine-tuning. Last resort. Use it when prompting has plateaued, you have a stable task, and you have real data. Fine-tuning trades flexibility for performance on a narrow distribution, and the maintenance cost is real. Most teams who think they need fine-tuning actually need better prompts or better retrieval. 

Examples: matching a very specific brand voice across millions of generated product descriptions, a narrow classification task with labeled data where prompting plateaus below required accuracy, replicating a structured output format that few-shot examples can't reliably produce.


Decision questions

Instead of picking a pattern, ask these questions in order and let them pick for you:

Does the model know enough? If the task requires knowledge the model doesn't have — private documents, today's data, niche domain details — you need RAG. If it has the knowledge, skip this rung.

Can one prompt do it well? Try it before assuming it can't. You'd be surprised how often "I need a multi-step pipeline" turns into "actually, one prompt with good structure handles it." If a single prompt works, ship it.

Can I write down the steps in advance? This is the workflow-vs-agent line, and it's the most important question in the whole framework. If you can enumerate the steps — even if there are branches — you want a workflow. Hardcode the control flow, let the model handle each step. You get deterministic behavior, easier debugging, lower cost, and predictable latency. 

Agents are for when the steps genuinely can't be known ahead of time.






Do the steps depend on each other? Sequential steps become a prompt chain. Independent steps run in parallel. Steps that depend on the input type get a router at the front.

Is quality inconsistent? Add an evaluator-optimizer loop — one model generates, another critiques, the first revises. This is often the right fix before reaching for anything more complex.



Have I plateaued on everything else? Only then does fine-tuning enter the conversation.


Why the simple-first bias matters

There are three practical reasons the ladder approach beats jumping straight to complex patterns, and they compound.




The first is cost. Every additional LLM call, every tool invocation, every agent loop iteration multiplies your token spend. A workflow with three sequential calls costs 3x a single prompt. An agent that takes ten loops to converge costs 10x — and that's when it converges. In production, cost differences of 10–100x between patterns are common.

The second is reliability. Every LLM call has some failure rate. Chain five calls together and you compound those failures. Agents, which can loop arbitrarily, compound them worst of all. Simpler patterns have fewer places to fail and fewer places where a failure cascades.

The third is debuggability. When a single-prompt system gives a bad answer, you change the prompt. When an agent gives a bad answer, you stare at a 40-step trace trying to figure out which decision went sideways, whether the tool returned the wrong thing, whether the model misread the tool output, whether the loop should have terminated earlier. The complexity you added to solve the problem becomes the problem.



Worked examples

Customer support from product docs. A team reaching for the hot pattern might build an agent: it plans a query strategy, searches docs, reads pages, decides whether to search again, drafts an answer, self-critiques, and revises. Lots of moving parts. Impressive demo.

The ladder approach asks the questions instead. Does the model know your docs? No — so you need retrieval. Can one prompt do it well once the docs are retrieved? Usually yes: "Here's the user's question, here are the relevant doc passages, answer using only the passages." Can you write down the steps? Yes: retrieve, then answer. That's a two-step chain. No agent, no loop, no self-critique — unless measurement shows you actually need them.

Nine times out of ten, the two-step chain ships faster, costs a fraction as much, is easier to debug, and performs as well or better than the agent. 

The tenth case — where questions are genuinely open-ended and require multi-hop reasoning across documents — is where an agent might earn its keep. But you discover that by measuring, not by assuming.

Generating weekly sales reports. Someone pitches an agent that gathers data, analyzes it, and writes the narrative. But walk through the questions. Does the model know your sales data? No — but you don't need RAG either; you need a direct query to your database. Can one prompt do it well? Almost: given the raw numbers, a single prompt can produce a decent narrative. Can you write down the steps? Completely: pull the data, format it, ask the model to write the narrative, optionally ask a second call to check the numbers match. That's a fixed workflow, not an agent. You know exactly what happens every Monday at 9am.

Debugging a failing test in an unfamiliar codebase. Now the agent is justified. Does the model know the codebase? No. Can one prompt do it well? No — the model needs to look at actual code. Can you write down the steps? This is where it breaks down. The next file to open depends on what the last file contained. The error might be in the test, the code under test, a shared dependency, or a config file. You can't enumerate the path because the path depends on what's found along the way. This is the shape of a problem that actually needs an agent: genuine dynamic exploration, not a pipeline dressed up in a loop.


The habit to build

When you pick up a new LLM task, resist the impulse to architect. Start at the bottom of the ladder. Write the simplest prompt that could plausibly work, run it on real examples, and see what breaks. Let the failures tell you which rung to climb to. The specific failure mode — "it doesn't know our product," "it skips reasoning steps," "it can't decide which analysis to run" — maps cleanly onto the next rung.

This is a less glamorous way to build, but it's how you end up with systems that actually work in production. The goal isn't to use the most sophisticated pattern. The goal is to solve the problem with as little machinery as possible, because every piece of machinery is something that can go wrong at 3am.

Start simple. Climb only when you're forced to Ship.