Friday, 17 April 2026

Stop Reaching for Agents

Every week I see another team announce they're "building an agent" for a problem that a single well-written prompt would solve. A few weeks later, they're debugging a loop where the model keeps calling the wrong tool, blowing through tokens, and producing answers worse than the one-shot baseline they skipped past.

This is the default failure mode of LLM engineering right now. The industry keeps pushing toward the flashiest pattern on the menu, and teams keep mistaking complexity for capability. The truth is boringly simple: the right pattern is almost always the simplest one that works, and you should have to be forced up the ladder, not invited.


Framework you can use 

Think of LLM patterns as rungs on a ladder. Each rung adds capability, but also adds cost, latency, failure modes, and debugging surface area. You climb only when the rung below genuinely can't do the job.



Rung 1 — Single prompt. Zero-shot or few-shot. One call, one answer. This is your starting point for every task, without exception. Modern frontier models are astonishingly capable in a single call, and most teams underestimate how far good prompting alone can take them. 

Examples: classifying emails as urgent/normal/spam, drafting a reply to a customer message, summarizing a meeting transcript into action items, extracting fields from a contract into JSON.





Rung 2 — Structured prompting and chain-of-thought. When the model gets answers wrong because it's skipping reasoning steps or producing messy output, you don't need a new architecture. You need better instructions. Ask it to think step by step, give it a structure to fill in, show it examples of the reasoning you want. This fixes more problems than people expect. 

Examples: math word problems where the model jumps to a wrong answer, multi-criteria decisions like "should we approve this expense" where you want the reasoning shown, data extraction tasks where output format matters.





Rung 3 — Retrieval-augmented generation (RAG). When the model doesn't know something — your internal docs, fresh data, domain-specific knowledge — bolt on retrieval. You're not changing how the model thinks, just what it has access to. RAG is often mistakenly treated as the default for any knowledge-heavy task; it's the default only when the knowledge genuinely isn't in the weights.

  Examples: answering questions from your company's internal wiki, a legal research tool grounded in a specific case database, a coding assistant that needs to reference your private API documentation, a support bot that cites current policy docs.








Rung 4 — Workflows. Prompt chaining, routing, parallelization. You use these when the task has distinct sub-tasks that you can enumerate in advance. Classify the input, then draft, then check. Or: run these three analyses in parallel and synthesize. The defining feature of a workflow is that you wrote down the steps. The model fills in each one, but the control flow is yours. 

Examples: a translation pipeline that translates, then checks for cultural appropriateness, then adjusts tone. A customer inquiry system that first routes the message to sales/support/billing, then dispatches to a handler tuned for that category. A document analyzer that extracts entities, sentiment, and topics in parallel, then synthesizes a report. A content moderation flow where a draft is generated, then evaluated against policy, then revised if flagged.















Rung 5 — Agents. An LLM in a loop with tools, deciding what to do next. You use this when the path genuinely isn't knowable in advance — the model has to observe, decide, act, observe again. Agents are powerful and they're the right answer for some problems, but they're expensive, slow, and the hardest pattern to debug. If you can write down the steps, you don't need an agent; you need a workflow.

 Examples: a coding assistant that explores an unfamiliar codebase to fix a bug, where the next file to open depends on what it just read. An open-ended research task where findings from one search determine the next query. A browser agent completing a multi-step booking where page contents dictate the next click. Incident response where the diagnostic path branches based on what each check reveals.




Rung 6 — Fine-tuning. Last resort. Use it when prompting has plateaued, you have a stable task, and you have real data. Fine-tuning trades flexibility for performance on a narrow distribution, and the maintenance cost is real. Most teams who think they need fine-tuning actually need better prompts or better retrieval. 

Examples: matching a very specific brand voice across millions of generated product descriptions, a narrow classification task with labeled data where prompting plateaus below required accuracy, replicating a structured output format that few-shot examples can't reliably produce.


Decision questions

Instead of picking a pattern, ask these questions in order and let them pick for you:

Does the model know enough? If the task requires knowledge the model doesn't have — private documents, today's data, niche domain details — you need RAG. If it has the knowledge, skip this rung.

Can one prompt do it well? Try it before assuming it can't. You'd be surprised how often "I need a multi-step pipeline" turns into "actually, one prompt with good structure handles it." If a single prompt works, ship it.

Can I write down the steps in advance? This is the workflow-vs-agent line, and it's the most important question in the whole framework. If you can enumerate the steps — even if there are branches — you want a workflow. Hardcode the control flow, let the model handle each step. You get deterministic behavior, easier debugging, lower cost, and predictable latency. 

Agents are for when the steps genuinely can't be known ahead of time.






Do the steps depend on each other? Sequential steps become a prompt chain. Independent steps run in parallel. Steps that depend on the input type get a router at the front.

Is quality inconsistent? Add an evaluator-optimizer loop — one model generates, another critiques, the first revises. This is often the right fix before reaching for anything more complex.



Have I plateaued on everything else? Only then does fine-tuning enter the conversation.


Why the simple-first bias matters

There are three practical reasons the ladder approach beats jumping straight to complex patterns, and they compound.




The first is cost. Every additional LLM call, every tool invocation, every agent loop iteration multiplies your token spend. A workflow with three sequential calls costs 3x a single prompt. An agent that takes ten loops to converge costs 10x — and that's when it converges. In production, cost differences of 10–100x between patterns are common.

The second is reliability. Every LLM call has some failure rate. Chain five calls together and you compound those failures. Agents, which can loop arbitrarily, compound them worst of all. Simpler patterns have fewer places to fail and fewer places where a failure cascades.

The third is debuggability. When a single-prompt system gives a bad answer, you change the prompt. When an agent gives a bad answer, you stare at a 40-step trace trying to figure out which decision went sideways, whether the tool returned the wrong thing, whether the model misread the tool output, whether the loop should have terminated earlier. The complexity you added to solve the problem becomes the problem.



Worked examples

Customer support from product docs. A team reaching for the hot pattern might build an agent: it plans a query strategy, searches docs, reads pages, decides whether to search again, drafts an answer, self-critiques, and revises. Lots of moving parts. Impressive demo.

The ladder approach asks the questions instead. Does the model know your docs? No — so you need retrieval. Can one prompt do it well once the docs are retrieved? Usually yes: "Here's the user's question, here are the relevant doc passages, answer using only the passages." Can you write down the steps? Yes: retrieve, then answer. That's a two-step chain. No agent, no loop, no self-critique — unless measurement shows you actually need them.

Nine times out of ten, the two-step chain ships faster, costs a fraction as much, is easier to debug, and performs as well or better than the agent. 

The tenth case — where questions are genuinely open-ended and require multi-hop reasoning across documents — is where an agent might earn its keep. But you discover that by measuring, not by assuming.

Generating weekly sales reports. Someone pitches an agent that gathers data, analyzes it, and writes the narrative. But walk through the questions. Does the model know your sales data? No — but you don't need RAG either; you need a direct query to your database. Can one prompt do it well? Almost: given the raw numbers, a single prompt can produce a decent narrative. Can you write down the steps? Completely: pull the data, format it, ask the model to write the narrative, optionally ask a second call to check the numbers match. That's a fixed workflow, not an agent. You know exactly what happens every Monday at 9am.

Debugging a failing test in an unfamiliar codebase. Now the agent is justified. Does the model know the codebase? No. Can one prompt do it well? No — the model needs to look at actual code. Can you write down the steps? This is where it breaks down. The next file to open depends on what the last file contained. The error might be in the test, the code under test, a shared dependency, or a config file. You can't enumerate the path because the path depends on what's found along the way. This is the shape of a problem that actually needs an agent: genuine dynamic exploration, not a pipeline dressed up in a loop.


The habit to build

When you pick up a new LLM task, resist the impulse to architect. Start at the bottom of the ladder. Write the simplest prompt that could plausibly work, run it on real examples, and see what breaks. Let the failures tell you which rung to climb to. The specific failure mode — "it doesn't know our product," "it skips reasoning steps," "it can't decide which analysis to run" — maps cleanly onto the next rung.

This is a less glamorous way to build, but it's how you end up with systems that actually work in production. The goal isn't to use the most sophisticated pattern. The goal is to solve the problem with as little machinery as possible, because every piece of machinery is something that can go wrong at 3am.

Start simple. Climb only when you're forced to Ship.

No comments:

Post a Comment