Friday, 17 April 2026

Stop Reaching for Agents

Every week I see another team announce they're "building an agent" for a problem that a single well-written prompt would solve. A few weeks later, they're debugging a loop where the model keeps calling the wrong tool, blowing through tokens, and producing answers worse than the one-shot baseline they skipped past.

This is the default failure mode of LLM engineering right now. The industry keeps pushing toward the flashiest pattern on the menu, and teams keep mistaking complexity for capability. The truth is boringly simple: the right pattern is almost always the simplest one that works, and you should have to be forced up the ladder, not invited.


Framework you can use 

Think of LLM patterns as rungs on a ladder. Each rung adds capability, but also adds cost, latency, failure modes, and debugging surface area. You climb only when the rung below genuinely can't do the job.



Rung 1 — Single prompt. Zero-shot or few-shot. One call, one answer. This is your starting point for every task, without exception. Modern frontier models are astonishingly capable in a single call, and most teams underestimate how far good prompting alone can take them. 

Examples: classifying emails as urgent/normal/spam, drafting a reply to a customer message, summarizing a meeting transcript into action items, extracting fields from a contract into JSON.





Rung 2 — Structured prompting and chain-of-thought. When the model gets answers wrong because it's skipping reasoning steps or producing messy output, you don't need a new architecture. You need better instructions. Ask it to think step by step, give it a structure to fill in, show it examples of the reasoning you want. This fixes more problems than people expect. 

Examples: math word problems where the model jumps to a wrong answer, multi-criteria decisions like "should we approve this expense" where you want the reasoning shown, data extraction tasks where output format matters.





Rung 3 — Retrieval-augmented generation (RAG). When the model doesn't know something — your internal docs, fresh data, domain-specific knowledge — bolt on retrieval. You're not changing how the model thinks, just what it has access to. RAG is often mistakenly treated as the default for any knowledge-heavy task; it's the default only when the knowledge genuinely isn't in the weights.

  Examples: answering questions from your company's internal wiki, a legal research tool grounded in a specific case database, a coding assistant that needs to reference your private API documentation, a support bot that cites current policy docs.








Rung 4 — Workflows. Prompt chaining, routing, parallelization. You use these when the task has distinct sub-tasks that you can enumerate in advance. Classify the input, then draft, then check. Or: run these three analyses in parallel and synthesize. The defining feature of a workflow is that you wrote down the steps. The model fills in each one, but the control flow is yours. 

Examples: a translation pipeline that translates, then checks for cultural appropriateness, then adjusts tone. A customer inquiry system that first routes the message to sales/support/billing, then dispatches to a handler tuned for that category. A document analyzer that extracts entities, sentiment, and topics in parallel, then synthesizes a report. A content moderation flow where a draft is generated, then evaluated against policy, then revised if flagged.















Rung 5 — Agents. An LLM in a loop with tools, deciding what to do next. You use this when the path genuinely isn't knowable in advance — the model has to observe, decide, act, observe again. Agents are powerful and they're the right answer for some problems, but they're expensive, slow, and the hardest pattern to debug. If you can write down the steps, you don't need an agent; you need a workflow.

 Examples: a coding assistant that explores an unfamiliar codebase to fix a bug, where the next file to open depends on what it just read. An open-ended research task where findings from one search determine the next query. A browser agent completing a multi-step booking where page contents dictate the next click. Incident response where the diagnostic path branches based on what each check reveals.




Rung 6 — Fine-tuning. Last resort. Use it when prompting has plateaued, you have a stable task, and you have real data. Fine-tuning trades flexibility for performance on a narrow distribution, and the maintenance cost is real. Most teams who think they need fine-tuning actually need better prompts or better retrieval. 

Examples: matching a very specific brand voice across millions of generated product descriptions, a narrow classification task with labeled data where prompting plateaus below required accuracy, replicating a structured output format that few-shot examples can't reliably produce.


Decision questions

Instead of picking a pattern, ask these questions in order and let them pick for you:

Does the model know enough? If the task requires knowledge the model doesn't have — private documents, today's data, niche domain details — you need RAG. If it has the knowledge, skip this rung.

Can one prompt do it well? Try it before assuming it can't. You'd be surprised how often "I need a multi-step pipeline" turns into "actually, one prompt with good structure handles it." If a single prompt works, ship it.

Can I write down the steps in advance? This is the workflow-vs-agent line, and it's the most important question in the whole framework. If you can enumerate the steps — even if there are branches — you want a workflow. Hardcode the control flow, let the model handle each step. You get deterministic behavior, easier debugging, lower cost, and predictable latency. 

Agents are for when the steps genuinely can't be known ahead of time.






Do the steps depend on each other? Sequential steps become a prompt chain. Independent steps run in parallel. Steps that depend on the input type get a router at the front.

Is quality inconsistent? Add an evaluator-optimizer loop — one model generates, another critiques, the first revises. This is often the right fix before reaching for anything more complex.



Have I plateaued on everything else? Only then does fine-tuning enter the conversation.


Why the simple-first bias matters

There are three practical reasons the ladder approach beats jumping straight to complex patterns, and they compound.




The first is cost. Every additional LLM call, every tool invocation, every agent loop iteration multiplies your token spend. A workflow with three sequential calls costs 3x a single prompt. An agent that takes ten loops to converge costs 10x — and that's when it converges. In production, cost differences of 10–100x between patterns are common.

The second is reliability. Every LLM call has some failure rate. Chain five calls together and you compound those failures. Agents, which can loop arbitrarily, compound them worst of all. Simpler patterns have fewer places to fail and fewer places where a failure cascades.

The third is debuggability. When a single-prompt system gives a bad answer, you change the prompt. When an agent gives a bad answer, you stare at a 40-step trace trying to figure out which decision went sideways, whether the tool returned the wrong thing, whether the model misread the tool output, whether the loop should have terminated earlier. The complexity you added to solve the problem becomes the problem.



Worked examples

Customer support from product docs. A team reaching for the hot pattern might build an agent: it plans a query strategy, searches docs, reads pages, decides whether to search again, drafts an answer, self-critiques, and revises. Lots of moving parts. Impressive demo.

The ladder approach asks the questions instead. Does the model know your docs? No — so you need retrieval. Can one prompt do it well once the docs are retrieved? Usually yes: "Here's the user's question, here are the relevant doc passages, answer using only the passages." Can you write down the steps? Yes: retrieve, then answer. That's a two-step chain. No agent, no loop, no self-critique — unless measurement shows you actually need them.

Nine times out of ten, the two-step chain ships faster, costs a fraction as much, is easier to debug, and performs as well or better than the agent. 

The tenth case — where questions are genuinely open-ended and require multi-hop reasoning across documents — is where an agent might earn its keep. But you discover that by measuring, not by assuming.

Generating weekly sales reports. Someone pitches an agent that gathers data, analyzes it, and writes the narrative. But walk through the questions. Does the model know your sales data? No — but you don't need RAG either; you need a direct query to your database. Can one prompt do it well? Almost: given the raw numbers, a single prompt can produce a decent narrative. Can you write down the steps? Completely: pull the data, format it, ask the model to write the narrative, optionally ask a second call to check the numbers match. That's a fixed workflow, not an agent. You know exactly what happens every Monday at 9am.

Debugging a failing test in an unfamiliar codebase. Now the agent is justified. Does the model know the codebase? No. Can one prompt do it well? No — the model needs to look at actual code. Can you write down the steps? This is where it breaks down. The next file to open depends on what the last file contained. The error might be in the test, the code under test, a shared dependency, or a config file. You can't enumerate the path because the path depends on what's found along the way. This is the shape of a problem that actually needs an agent: genuine dynamic exploration, not a pipeline dressed up in a loop.


The habit to build

When you pick up a new LLM task, resist the impulse to architect. Start at the bottom of the ladder. Write the simplest prompt that could plausibly work, run it on real examples, and see what breaks. Let the failures tell you which rung to climb to. The specific failure mode — "it doesn't know our product," "it skips reasoning steps," "it can't decide which analysis to run" — maps cleanly onto the next rung.

This is a less glamorous way to build, but it's how you end up with systems that actually work in production. The goal isn't to use the most sophisticated pattern. The goal is to solve the problem with as little machinery as possible, because every piece of machinery is something that can go wrong at 3am.

Start simple. Climb only when you're forced to Ship.

Friday, 10 April 2026

Agents Don't Speak It

This is my continuation post of Broken promise of Agile and Agile Manifesto In Age Of Ai Agentic Software development world

What happens to sprint planning, standups, retros, and bug reports when the team building your software isn't human.

Jeff Bezos had a simple heuristic for team size: if two pizzas can't feed the team, the team is too big. It was never really about pizza. It was about communication overhead — the invisible tax that grows quadratically as you add people. Small teams move fast because coordination is cheap.

Now imagine replacing those six engineers with six AI agents. No standups. No slack threads at midnight. No pushback during planning. They just run week day / weekend / 24*7 .

Sounds like a superpower. It isn't — or rather, it isn't straightforwardly one. The coordination problems don't disappear. They move , they concentrate and they become invisible in ways that human teams never were.




Fundamental Difference

When you manage six engineers, you get a huge amount of coordination intelligence for free.


Btw what is coordination intelligence ? 

Coordination intelligence is the ability to self-organize around incomplete information without being told to — noticing collisions, resolving ambiguity, pushing back before work goes wrong. In human teams it emerges for free from social context: reputation, embarrassment, shared history


 Engineers notice when two people are working on the same thing. They push back on bad estimates. They carry context from last month's decision. They feel embarrassed when they ship something broken. That embarrassment is load-bearing infrastructure.

Agents have none of this. An agent will accept any scope you give it, work confidently in the wrong direction for hours, produce six internally-consistent but mutually-incompatible outputs — and report back with no signal that anything went wrong.

Six agents don't reduce management overhead. They concentrate it into a single engineer's head.


What Happens to Each Ceremony

Sprint Planning

Planning with engineers is a negotiation. Engineers push back. That pushback is annoying — it's also your earliest warning system. Agents don't negotiate. They accept any scope. Without pushback you'll consistently over-assign, and agents won't tell you — they'll just produce something confidently wrong at scale.

Sprint planning stops being about capacity negotiation. It becomes context package design. 


Lets expand Context package 

Context package design is the discipline of deciding exactly what an agent needs to know, what it must not know, and where its work begins and ends — so it can complete a task correctly without asking questions, without drifting into adjacent scope, and without conflicting with what other agents are building in parallel.


For each task: what does this agent need to know? What must it explicitly not know? Where does it hand off, and to whom? The role shifts from breaking down stories to writing intelligence mission briefs.

Daily Standup

Standups exist to catch invisible blockers early through human signal — tone, hesitation, the "I'll figure it out" when someone won't. Agents don't have tone. The standup equivalent becomes a health-check dashboard: are agents producing output? Did any contradict each other? Is any stuck in a tool-call loop? Status collection becomes anomaly detection.

Sprint Review

In a normal review, the engineer who built the feature explains its edge cases. Knowledge transfers. Pride is a quality signal. With agents, the output exists but nobody fully understands it. An agent can produce 600 lines of passing code and the engineer who prompted it cannot explain every architectural decision.


CeremonyHuman purposeWith agents, becomes
Sprint PlanningCapacity negotiation + pushbackContext package design — precise briefs, explicit scope boundariesmutates
Daily StandupCatching invisible blockers via tone + signalAnomaly detection dashboard — traces, diffs, loop detectionmutates
Sprint ReviewDemo + informal knowledge transferComprehension gate — a human must own and explain the outputintensifies
RetrospectiveProcessing human failures via memoryPrompt autopsy — full trace replay, brief quality analysismutates
Bug TriageAssign → investigate → fix with moral ownershipRe-ownership ritual before fix — someone must read the whole moduleintensifies
Code ReviewPeer knowledge transfer + quality gateReview wall — agents outpace human review capacity almost immediatelybreaks

Human tech debt traces back to a decision. Agent tech debt has no author intent — only output.


Bug Reports

Bug reported → assigned to whom? The agent that wrote the buggy code no longer exists. The bug might live in the interaction between two agents' outputs — nobody's fault in isolation. If you assign the fix to another agent without human comprehension in between, you risk entering a patch spiral.




Code Review

Agents generate PRs faster than a single human can review them. The review wall hits almost immediately. What Options you have ? 



New Pattern Emerging

Every Agile ceremony was designed to solve a human coordination problem. When you replace engineers with agents, those problems don't disappear — they move up the stack to the one or two humans managing the agents. Those humans now carry the full cognitive load that was previously distributed across a team of six.



The skill of managing agents isn't delegation. It's context architecture — what each agent knows, when, and in what form.


Agile solved for human limits: attention, memory, communication bandwidth. Agents don't have those limits. But they have different ones — context window coherence, statelessness, silent failure, no social accountability. We don't yet have a name for the ceremonies that solve for those.

The teams who figure out the new paradigm first will ship faster — not because they have more agents, but because they've rebuilt the coordination layer from scratch for the law of physics that actually govern them.


Friday, 3 April 2026

Great Claude Code Leak and Under the hood - Claude code | Codex | Gemini

Accidental leak of the Claude Code source code on April 1, 2026, has provided an unprecedented look into Anthropic's agentic architecture. With thousands of mirrors now circulating online, the industry has a rare opportunity to analyze the prompt design decisions and tool-use frameworks that power high-end coding agents. This is the ideal moment to conduct a comparative study of how leading AI companies structure their internal developer workflows

What the system prompts of Codex CLI, Gemini CLI, and Claude Code reveal about each team's theory of AI reliability — and what that means if you're building agents yourself.



Link to prompts you are eager to read that first



Every system prompt is a Natural Language Program,  list of instructions — a code of how an AI agent becomes reliable. 

When OpenAI, Google, and Anthropic each built their flagship coding CLI tools, they made the same bet differently: that there exists a root cause for agent failure, and that the right prompt addresses it at the root.

Reading the published system prompt structures for Codex CLI, Gemini CLI, and Claude Code side by side, what emerges is not a feature comparison. It is three distinct philosophies of control.

OpenAI says: give the model a coherent identity and it will make coherent decisions. 

Google says: give the model explicit operational procedures and the decisions follow. 

Anthropic says: enumerate what the model must never do and the safety boundary itself becomes the guarantee.

Every company building on top of these models will face the same architectural choice. Understanding what the frontier labs chose — and why — is a prerequisite for making that choice well.

Identity, Process, and Constraint as Design Primitives

Codex CLI's prompt is dominated by persona construction. Personality, values, interaction style, escalation behavior — the overwhelming share of prompt surface area is spent answering: who is this agent? The implicit theory is that a model with a coherent, well-specified identity will produce coherent behavior by inference. Tell the model it is pragmatic, rigorous, and respectful; that it values clarity over cleverness; that it should challenge bad requirements rather than silently comply — and the specific behaviors emerge from that character.

Gemini CLI takes the opposite approach. The prompt allocates most of its weight to operational procedures: context efficiency strategies, search-and-read patterns, development lifecycle phases (Research → Strategy → Execution), sub-agent orchestration instructions. The model's identity is thin. The workflow is thick. The implicit theory is that reliable outputs come from constraining the action space rather than shaping the decision-making self.

Claude Code occupies a different axis entirely. The heaviest sections are not about who the agent is, nor about how it should work — they are about what it must not do. Blast radius. Reversibility. No destructive operations. Explicit OWASP threat categories. The theory here is that agent reliability is a negative property: an agent is trustworthy to the degree that it cannot cause harm, not to the degree that it has good values or follows good procedures.




OpenAI Bets on Identity




The Codex CLI prompt reads less like an instruction manual and more like a character sheet for a fictional software engineer. It specifies personality traits (pragmatic, communicative), professional values (clarity, rigor), and crucially — an escalation philosophy. The agent is explicitly told when to push back: when it detects a bad tradeoff, when requirements seem underspecified, when the pragmatic path diverges from the literal ask.

This is the most sophisticated model of human collaboration in any of the three tools. Most agent prompts tell the model what to do. Codex tells it when to refuse, and how. That is a fundamentally different relationship with the user — it treats the engineer as a peer whose judgment can be wrong, not as a principal whose instructions are commands.



The escalation section — "challenge, pragmatic mindset, tradeoff" — is load-bearing in a way that is easy to miss. It encodes a theory of collaboration: the agent's job is not to execute instructions but to contribute judgment. This is what separates a coding tool from a coding collaborator. OpenAI made that choice deliberately, and it is visible in the prompt structure.


There is a notable anomaly in the Codex prompt: the frontend tasks section, which specifically mentions bold choices, surprising colors, and visual creativity. For a CLI tool targeting professional engineers, this is unusual. It suggests one of two things: either OpenAI designed Codex for a broader creative audience than the command line implies, or the frontend callout reflects the team's belief that creative judgment — not just technical execution — is a property the agent should possess by default.

The editing constraints are instructive in their specificity. Don't amend commits. Apply patches rather than rewrites. Maintain good code comments. These are not general principles — they are the learned lessons of a team that watched models cause damage in codebases and back-encoded the failure modes into the prompt. The specificity is a learning from failure.

Full Prompt is available @ Codex System Prompt

Google Bets on Process




Where Codex builds a person, Gemini CLI builds a workflow. The prompt is structured around phases and patterns: how to search efficiently, how to read large codebases without exhausting context, when to spawn sub-agents, how the development lifecycle should flow from research through strategy to execution. Identity is thin. The word "pragmatic" does not appear. What appears instead is an explicit context budget awareness that no other tool's prompt contains.

The "Context Efficiency" section — strategic tool use, estimated context usage — is the tell. This is an infrastructure concern bleeding into the prompt layer. Google is aware that Gemini's context, however large, is a finite and expensive resource, and they have encoded context management as a first-class concern for the agent itself. The model is being asked to reason about its own resource consumption in real time.


When a company encodes "estimate context usage" into an agent's operating principles, it is admitting something: context window economics are not solved at the infrastructure layer, so they are being delegated to the agent layer. This is a runtime concern being pushed into the prompt. It is not elegant, but it is honest.

The Development Lifecycle section — Research → Strategy → Execution — is the most ambitious design choice in any of the three prompts. It tries to impose a thinking structure on the model: don't execute before you understand, don't implement before you have a strategy. Most tools treat the agent as reactive; Gemini CLI tries to make it deliberate. Whether a model actually follows this structure in practice is a different question. As a design intention, it is the clearest signal that Google is trying to build a thinking partner rather than a code-generation endpoint.

The sub-agents section is equally revealing. Gemini CLI explicitly models itself as an orchestrator: codebase investigation, CLI help, and generalist tasks are treated as separable concerns that can be delegated to specialized sub-agents. This is an architectural declaration — that the right model of AI-assisted development is multi-agent, not monolithic, and the prompt structure should reflect that from the start.

Full Prompt is available @ Gemini System Prompt

Anthropic Bets on Constraint



Claude Code's prompt has a different texture from the other two. It is not warmer or colder — it is more cautious in its diction. The language of the operations sections borrows from risk management: blast radius, reversibility, local change scope, no destructive operations. These are not metaphors. They are explicit categories that the agent is meant to evaluate before acting. The implicit model is that every action the agent takes should be assessed for its damage potential before execution, not after.

The capitalized IMPORTANT section — for security and URLs — is itself a prompt engineering technique, not merely a content category. Anthropic knows that models attend to capitalization and structural salience. Labeling a section IMPORTANT is a way of increasing the probability that the model treats its contents as non-negotiable rather than advisory. This is a team that knows how the sausage is made, and they are using that knowledge inside the prompt itself.



No other tool's prompt contains the phrase "blast radius." The use of weapons-of-war language for file operations is not accidental. It encodes a severity calibration: deleting the wrong file is not an inconvenience to be apologized for, it is a detonation. The vocabulary shapes how the model weights consequences, not just which actions it permits.


The security vulnerabilities section is the most technically specific of any prompt section across all three tools. Command injection, XSS, SQL injection, OWASP Top 10.  Anthropic is not asking the agent to "be security-conscious." They are naming threat classes and expecting the agent to recognize them in context. The implicit assumption is that a model trained on enough security literature can pattern-match against named vulnerabilities in real code, and the prompt's job is to activate that capability rather than describe it from scratch.

The Compressed Conversation section — handling context limit and context window overflow — is a admission that long-running agentic sessions will hit memory boundaries, and the agent needs a recovery behavior rather than silent degradation. This is operational visibility: the prompt accounts for the session not fitting in the window, which is a runtime failure mode that most prompts ignore entirely.

Full Prompt is available @ Claude Code System Prompt


What the Surface Area Reveals



Three Design Choices You Should Consider

If you are building an AI product that involves an agent taking actions — writing code, modifying files, calling APIs — these three prompts are good reference implementation. They are proofs of three different product bets, each with predictable failure modes.

The identity approach fails gracefully in ambiguous situations but fails badly at the capability ceiling. A model with a well-specified persona makes sensible judgment calls when the instructions run out. But persona is not a substitute for operational procedure in repetitive, high-stakes workflows. When the agent needs to search a large codebase efficiently, knowing it is "pragmatic" does not help. You need the grep patterns.




For most AI products, the right prompt architecture layers all three: a thin identity layer to establish tone and judgment defaults, a procedure layer for the high-frequency operational paths, and a constraint layer for the actions where failure is not recoverable. The mistake is choosing one and applying it universally. Each layer serves a different failure mode.

The process approach fails at novel tasks. If the agent's workflow is Research → Strategy → Execution, and the user asks for something that doesn't fit that shape, the agent either forces the task into the wrong template or falls back to undefined behavior. Procedures are brittle at their boundaries. This is the same critique Rich Hickey makes of complected code — when the procedure and the judgment are tangled, changing one breaks the other.

The constraint approach fails at capability, by design. An agent that is maximally conservative about blast radius, reversibility, and destructive operations will refuse or seek permission at the moments when an experienced engineer would just act. The safety guarantee comes with a throughput cost. For consumer-facing products, this is the right trade. For developer tools used by people who understand the risk, it may be too conservative.

One structural observation that cuts across all three: none of these prompts is static and many instruction are added at run time. 

The specificity of Codex's editing constraints, Gemini's context efficiency instructions, and Claude Code's OWASP threat categories all bear the fingerprints of post-hoc repair — lessons learned from watching models fail in production, back-encoded into the prompt. The prompt is not a design document. It is a running incident log, formatted as instructions.

The prompt is not a design document. It is a running incident log, formatted as instructions. Every overly specific rule is a failure that happened once.

If you want to understand what problems a team has actually encountered with their agent, read the most specific sections of their system prompt. The level of specificity is directly proportional to the pain these team faced during building tool.

So what is the story for each model ? 



The prompts are archives of expensive mistakes, and reading them carefully is the cheapest form of safety research available.



Thursday, 2 April 2026

AI Engineering Terms You Will Memorize and then forget

LLM era has a reliable product cycle: someone coins a term, someone more famous endorses it, the internet credits the endorser, and LinkedIn does the rest.



Cycle is embarrassingly simple



We have done this at least three times in four years. Each time with complete sincerity. Each time with worse naming.

The progression is instructive. Prompt engineering at least had the decency to describe what you were actually doing — writing prompts. 

Context engineering was already stretching it; the word "engineering" is doing considerable structural load-bearing for what Shopify CEO Tobi Lutke accurately described on June 18, 2025 as "the art of providing all the context for the task to be plausibly solvable by the LLM.

One week later, Andrej Karpathy endorsed the term with a longer explanation. 

This is the naming mythology rewriting itself in real time, which is either poetic or depressing depending on your tolerance for how information actually spreads.

Harness engineering arrived in February 2026, formalized by Mitchell Hashimoto and documented by an OpenAI team who used it to describe building a million lines of agent-generated code. The OpenAI post is worth reading carefully, because what it actually describes — at length, with genuine insight — is: write config files, enforce architectural patterns with linters, and run cleanup jobs to remove code the agents wrote badly. In previous decades we called this maintenance. We did not issue certifications for it.

A brief taxonomy, in ascending order of naming audacity

Prompt engineering (2020 onward). You wrote better prompts. The model got better and started understanding worse prompts. The practice quietly became "just describe what you want" and nobody held a funeral. Peak era: elaborate system prompts explaining that the model was a helpful assistant who should be honest. The model already knew. The prompt was load-bearing only for the engineer's confidence.

Context engineering (June 2025). Toby Lutke named it. Karpathy amplified it. The internet misattributed it. The underlying activity — deciding what information the model needs to do its job — predates LLMs by however long humans have been briefing other humans before asking them to do things. The new contribution was giving that activity a name that sounds like it requires a degree. Within weeks, "context engineering" had a LangChain blog post, a Hugging Face explainer, an Anthropic guide, and a GitHub repository with a biological metaphor that progressed from atoms to neural fields. The speed from named to over-explained remains a record.

Harness engineering (February 2026). The OpenAI post describes three engineers spending every Friday cleaning up what they called "AI slop" before they automated that cleanup into a recurring agent task. This is a genuinely useful observation. It is not a new discipline of engineering. It is a description of what happens when you run software in production and things go wrong, which has been happening since software existed in production.


AI slop is low-quality, mass-produced digital content generated by artificial intelligence that lacks human effort, meaning, or artistic value



Why this keeps happening, and who benefits

Naming game is not accidental and it is not innocent. A new term does three things simultaneously: it creates a hiring category, it creates a product category, and it creates a reason to hold a conference. All three monetize faster than the underlying practice matures.



Vendors benefit most cleanly. The company whose engineer names the discipline owns the default tooling search. OpenAI documented harness engineering using Codex. OpenAI sells Codex. The educational content is the distribution strategy wearing a lab coat. 

It is just how technology markets work. The observation worth making is that it works every single time, on an accelerating schedule, with no apparent ceiling on how quickly a named practice can generate a certification program.

Engineers benefit more ambiguously. A new title justifies a salary band and provides a legible identity in a market where "I work with AI" is too broad to be useful. 

The cost is that the identity couples to practices with an expiry date. 

The prompt engineer of 2022 discovered this. The context engineer of 2025 is discovering it now. The harness engineer of 2026 has perhaps eighteen months before the runtime absorbs the harness and the job title requires a new noun.


Every AI engineering term names a gap between what the model can currently do and what the business currently needs. The gap is real. The engineering discipline named after it is temporary. These are compatible facts that the certification market prefers not to foreground.





Predictions — clearly labeled as such

Following are forward-looking extrapolations that are likely to emerge soon. I want to present these early so that if they catch the attention of influential voices or gain traction publicly, they can spark a self-reinforcing cycle — from hiring categories to product ecosystems to conferences — ultimately unlocking significant value. And who knows, it might even make me a bit famous.






The next term will probably be verification engineering, describing the practice of checking whether agent output is correct before it causes a problem in production. This is currently called testing. It will get renamed when a sufficiently publicized production failure is traced to inadequate output validation from an autonomous agent, and when that failure generates enough LinkedIn posts to constitute a discourse.

After that, something like decomposition engineering — the practice of breaking high-level goals into units of work that agents can handle without producing nonsense. The OpenAI harness post describes this as their team's primary job: working "depth-first, breaking down larger goals into smaller building blocks." 

This activity already has a name in project management. It will get a new name when someone publishes a paper showing that agent output quality correlates more strongly with task decomposition quality than with model selection. The paper will be correct. The naming will still be funny.

At some point the industry will notice that the "prior art" column and the "AI engineering term" column describe the same activities, and software engineering — which has existed since the 1960s — will be quietly declared to have been context-harness-verification engineering all along. A retrospective blog post will be written. It will get many LinkedIn reposts.


What actually transfers

Why i wrote this satire ? 

Building reliable systems around non-deterministic components is genuinely hard. The existing vocabulary of software engineering does not perfectly fit — a prompt is not a function, a context window is not a database, an agent failure is not a standard exception. The gap in vocabulary is real, and new terms can be useful even when the underlying activity is old.

The mistake is not naming practices. The mistake is mistaking the name for the skill. The engineer who understands why a system should behave a certain way — and can specify, verify, and maintain that behavior across model versions — will be fine regardless of what the current term is. The engineer who has mastered the current term and little else will be retraining when the next model ships.

The models will keep improving. The limitations will keep shifting. The terms will keep coming. And somewhere, a course will always be ready before the ink on the blog post is dry.



Friday, 27 March 2026

Similarity is not Relevance

There is a subtle confusion baked into every LLM-powered system in production today, and it is responsible for a larger fraction of failures than most teams realize. The confusion is this: we have built systems optimized for similarity, and we have shipped them as if they deliver relevance. They do not, and the difference is not academic.

Similarity is a geometric property. Two things are similar when they are close to each other in some metric space — cosine distance between embeddings, edit distance between strings, perplexity under a language model. It is computable, differentiable, and entirely indifferent to purpose. Relevance, by contrast, is teleological. Something is relevant if it advances a goal, reduces uncertainty, or changes what you should do next. Relevance is defined relative to an intention. Similarity is blind to intention.

Every major component of modern LLM stacks — retrieval, generation, alignment — is built on similarity. When they fail, they fail for the same reason: they found what was close, not what was needed.

The model is always correct about what is similar. It has no native mechanism for knowing what is needed.


Similarity ≠ Relevance








The autocomplete that always answers the wrong question


Consider a developer working on a distributed payment service. She types a function signature for retry logic with exponential backoff and asks the coding assistant to complete it. The assistant produces a clean, syntactically valid implementation — well-formatted, documented, handling the common cases. It looks exactly like the retry logic that appears in ten thousand open-source repositories.

What the assistant has done is retrieve and synthesize code that is maximally similar to retry logic in its training distribution. What the developer needed was retry logic that respects the idempotency contract already established elsewhere in the codebase, coordinates with the circuit-breaker state that her colleague committed last week, and avoids the cascading retry storm that their incident review identified two sprints ago. None of that information lives in the local similarity neighborhood of "exponential backoff implementation."

The assistant solved the problem it could measure. It optimized for syntactic and semantic proximity to known good code. But relevance in this context is defined by the surrounding system — the architecture decisions, the failure post-mortems, the implicit contracts between services. These are irreducibly contextual. They do not compress into an embedding.


Deeper problem is that similarity-based retrieval actively misleads by presenting confident outputs. A retrieved chunk with cosine similarity 0.91 feels authoritative. Developer accepts it, integrates it, and the failure surfaces in production three months later — not as an obvious crash, but as a subtle degradation under specific load patterns. 

Similarity score was high and relevance was near zero.



Fancy Word , Empty Head

Email generation is where the similarity-relevance gap is most visible and least discussed, because the outputs feel so undeniably correct. 

You ask the model to draft a follow-up email to a client who missed a deadline. It produces something professional, appropriately apologetic, clear in its next-step request, and tonally calibrated to business correspondence. 

Every sentence resembles what a senior professional would write in this situation.

But "resembles" is exactly the problem. 

The model has matched the surface pattern of the email genre. It does not know that this particular client is two weeks from the end of their annual contract and the conversation has been quietly tense since a pricing dispute in Q3. 

It does not know that the missed deadline was likely caused by a restructuring on their side that your account manager mentioned in passing on Slack. 

It does not know that a direct ask for a new timeline would land badly right now, while an offer of support would open the door. The relevant email is defined by that relational history, not by similarity to the genre of follow-up emails.

The model produces text that is similar in form to a good email. It has no mechanism for knowing whether the email is good for this situation.


What gets produced is a document that would earn an A in a business writing course and accomplish nothing — or worse, accelerate a deteriorating relationship by applying a generic professional register to a moment that required something specific. The failure is invisible because the output is fluent. Fluency is a similarity property. 

It measures proximity to well-formed text. It says nothing about whether the text does the right work in the right moment.

This is where RLHF compounds the problem. Human raters, presented with the email during training, reward it — because it looks like good professional writing. 

The model is trained to produce outputs that humans rate as high quality in isolated evaluation. But isolated evaluation cannot capture relational context. 

The model gets better at producing emails that resemble good emails. The gap between resemblance and genuine utility quietly widens.


When grouping by proximity destroys meaning


Clustering is the case that most directly exposes the architectural assumption underneath the whole stack. When you cluster documents, support tickets, or customer feedback using LLM embeddings, you are grouping by geometric proximity in the embedding space. The algorithm puts similar things together. This is, on its face, exactly what clustering should do.

Except that the purpose of clustering is never geometry. The purpose is always analytical — you are trying to understand the structure of a problem, identify actionable segments, or surface patterns that inform a decision. And those analytical goals define what "same group" means, independently of what "similar text" means.

A support ticket that reads "the dashboard is slow" and a ticket that reads "the API is timing out" might be semantically distant in embedding space — different vocabulary, different technical register, different surface description. But if both are caused by the same database query bottleneck, they belong in the same bucket for the engineering team. Conversely, two tickets that both say "I can't log in" might be superficially identical but one is a password reset issue and one is an account suspension, and routing them to the same team is actively harmful.



Similarity clusters by surface. Relevant clusters by cause. The right grouping depends on what you intend to do with the groups.


The geometry of the embedding space does not know that your goal is actionable routing. It knows word co-occurrence patterns. Sometimes those align. When the stakes are low, the alignment is good enough. When you are making resource allocation decisions, prioritizing engineering work, or segmenting customers for intervention, the gap between what is similar and what is relevant determines whether the analysis was worth running.

The seductive part is that similarity-based clusters look coherent. The topics within each cluster feel related. The outputs pass a plausibility check. But plausibility is another similarity property — it measures whether the output resembles something true. It does not measure whether the groupings actually serve the analytical purpose for which the clustering was run.




Three faces but single failure

Across all three cases — the code that fits the genre but breaks the system, the email that sounds right but says the wrong thing, the clusters that are coherent but not actionable — the failure has identical structure. 

The model or the pipeline optimized for a measurable proxy (syntactic similarity, surface fluency, geometric proximity) and produced an output that scores well on that proxy. 

The proxy and the goal coincided in the average training case. They diverged in the specific deployment case. The system had no way to detect the divergence.

This is not a hallucination problem. The outputs in all three cases can be entirely accurate in a narrow sense. The code is syntactically correct. The email is factually unobjectionable. The clusters are internally coherent. The failure is not falseness — it is misalignment between what was optimized and what was needed.


What this means practically is that the verification burden sits entirely with the human in the loop. Every LLM output comes pre-packaged with high confidence and fluent presentation — both similarity properties — and zero signal about whether it is relevant to the specific situation at hand. The engineer must know the system well enough to see past the fluent implementation. The account manager must know the client well enough to see past the professional tone. The analyst must know the business well enough to see past the coherent clusters. The AI provides the shape of an answer. Relevance is still a human judgment.


Conclusion

None of this means the tools are not useful. Similarity to good outputs is a genuinely valuable prior. 

A coding assistant that produces implementations similar to idiomatic, working code accelerates the developer who knows the system. 

An email assistant that produces text similar to professional correspondence accelerates the writer who knows the relationship.

The similarity machinery handles the generic, leaving the expert to handle the specific.

The error is the frame — treating outputs that scored high on similarity as if they had been evaluated for relevance. 

They have not been. They cannot be, because relevance requires the deployment context that was absent at training time. 

The model is excellent at finding what is close. Determining whether what is close is what is needed remains, stubbornly, a problem for the human who knows what is needed.

Confusion about this distinction is major issue. It is the source of an entire category of quiet, confident, professionally-formatted failures.