Friday, 13 March 2026

The Tail at Scale


 When engineers talk about latency, they almost always talk about average latency. P50. The median experience. It's a comfortable metric — it responds to optimization, it's easy to visualize, and it makes dashboards look good. The trouble is that in any non-trivial distributed system, average latency is nearly irrelevant to whether your system actually feels fast.

A 2013 paper from Google, The Tail at Scale, reframed the entire conversation. The insight was simple: in a system where a single user request fans out to hundreds of backend machines, the response time is determined not by the average machine but by the slowest machine in the fan-out. The 99th percentile is not a corner case. It is, structurally, the common case for any sufficiently large request graph.

This is the founding observation behind Google's latency engineering philosophy — and it reshapes how you think about almost every architectural decision in a shared environment.

"In a large enough system, the tail is the average. The stragglers are not noise — they are the signal."

Why Shared Environments Are Inherently Hostile to Latency

A shared environment — whether it's a multi-tenant cluster, a distributed storage layer, or a cloud runtime shared across teams — introduces a class of latency that has nothing to do with your code. It comes from contention: for CPU time, for memory bandwidth, for network queues, for disk I/O. These are resources that your workload competes for with processes it has no visibility into and no control over.

Result is what the paper calls "variability amplification." Even a single machine exhibiting transient slowness — a GC pause, a cache eviction storm, a background compaction job — introduces latency that propagates through the system in ways that are entirely disproportionate to the duration of the original event. A 50ms hiccup on a single shard becomes a 200ms tail for every request that happened to touch that shard during that window.



This is the fundamental problem. And the conventional response — "profile and optimize the slow path" — doesn't work, because the slowness is not in the code. It's in the environment. You cannot optimize away a garbage collector running on a machine you don't control, or a noisy neighbor saturating the memory bus two NUMA nodes away.


Hedged Requests

The first and most important pattern the paper describes is the hedged request. The idea is counterintuitive enough that it's worth stating plainly: rather than waiting for a slow server to respond, you send the same request to a second server after a short delay — and take whichever response arrives first.

The delay is critical. You don't want to double your load by default. Instead, you observe your system's typical P95 response time and use that as the hedge threshold. If a request hasn't completed within that window, you issue an identical request to a different replica and race them. The moment either one responds, you cancel the other.



The practical effect is remarkable. Measurements at Google showed that hedging could reduce 99.9th percentile latency by an order of magnitude while increasing load by only a few percent — because most requests don't trigger the hedge at all, and those that do are precisely the ones stuck on slow replicas.

Hedged requests trade a small amount of extra load for a large reduction in tail latency. Load amplification is bounded by the fraction of requests that exceed your hedge threshold — which, by definition, is a small minority if you set the threshold near P95.

What makes this pattern powerful in shared environments specifically is that it sidesteps the cause of slowness entirely. You don't need to know why Replica 1 is slow. You don't need to detect it, alert on it, or drain it. You just race around it.

Tied Requests and Cancellation

Hedged requests have a subtle problem: if both replicas are fast, you've wasted work on both. Tied request pattern refines this by introducing coordination between the two requests. When you issue a hedge, you attach a "cancellation token" that the replicas share. Whichever replica starts processing the request first notifies the other to cancel, and proceeds alone.

This is particularly valuable when requests are expensive to process — when the work itself consumes significant CPU or I/O on the backend. Instead of duplicating work silently, tied requests minimize wasted computation by communicating intent across the request boundary.

The implementation requires some infrastructure: replicas need to be aware of each other's state for a given request, which typically means either a shared coordination layer or an out-of-band cancellation channel. In Google's architecture, this was handled via internal RPC cancellation propagation. In most systems, you can approximate it with request-scoped context cancellation — the Go context.Context model being a modern analogue of this idea.


Micro-Partitioning and Fine-Grained Load Balancing

Both hedging and tying are reactive: they respond to latency after it has occurred. Complementary proactive strategy is micro-partitioning — dividing work into far more partitions than you have machines, so that load imbalance between logical partitions can be corrected by reassigning partitions rather than migrating state.

Intuition is straightforward. If you have 100 machines and partition your keyspace into 100 shards, a hot key on one shard means one machine is overloaded and there's nowhere to move it without a full reshard. If instead you have 10,000 virtual partitions distributed across 100 machines, a hot partition can be migrated to a less-loaded machine in seconds, with minimal disruption.



This is less a trick and more a structural principle: partition granularity determines your ability to respond to imbalance. Google's Bigtable uses this extensively — tablet splits are designed to be cheap precisely so that hot tablets can be redistributed across tablet servers without downtime.

Good Citizens and Background Throttling

In any shared environment, there are two classes of work: foreground requests with latency SLOs that users directly feel, and background work — compaction, replication sync, index rebuilds, garbage collection — that has no user-visible deadline but consumes the same physical resources. The conflict between these two classes is one of the most consistent sources of latency spikes in production systems.

Solution is conceptually simple: background tasks must be "good citizens." They should yield CPU and I/O to foreground work when demand is detected. In practice, this means implementing throttle mechanisms that observe system load indicators — request queue depth, disk I/O wait, CPU steal — and automatically back off when those indicators cross a threshold.

Google's approach to this problem includes priority queues in their RPC layer, where foreground traffic can preempt background work mid-execution. Bigtable's compaction scheduler monitors foreground request rates and adjusts compaction aggressiveness in real time. The principle is that background jobs should "earn" their CPU time during slack periods, not consume it as a fixed entitlement.

What's important here is that this isn't optional in a shared environment — it's a contract. If your background jobs don't throttle, you are imposing your latency cost on every other workload sharing your infrastructure. In large organizations, this becomes a coordination problem: the team running the nightly reindex doesn't know which other team's latency SLO they're violating at 2am.

Selective Replication of Hot Data

The patterns above all treat slowness as something to route around or absorb. This final pattern takes a different approach: eliminate the bottleneck entirely for the data that matters most.

In most systems, data access follows a power-law distribution. A small number of keys — a viral post, a high-traffic configuration value, a globally shared counter — account for a disproportionate fraction of reads. These hot items are precisely the ones most likely to create queuing delays, cache evictions, and server-level saturation.

Solution is selective, on-the-fly replication of hot items. Rather than replicating everything uniformly, the system detects hot keys — through access frequency monitoring or explicit client hints — and creates additional in-memory replicas across multiple servers. Reads are then distributed across those replicas, reducing per-server load for the items that need it most.



This pattern is now standard in systems like Memcached (Facebook's lease mechanism was a direct response to this problem), Redis Cluster, and modern distributed caches. The underlying principle — don't treat all data as equally hot, and adapt replication depth to observed access patterns — generalizes far beyond caching.


Latency-Aware Load Balancing

Round-robin load balancing assumes that all backend servers are equivalent and equally available. In a shared environment, this assumption fails constantly. A server experiencing memory pressure or CPU saturation will accept requests at the same rate as a healthy one — queuing them invisibly while the client believes load is distributed evenly. The result is that round-robin actively routes traffic into latency holes it cannot see.

Latency-aware load balancing corrects this by making routing decisions based on observed response times rather than theoretical capacity. Client maintains a rolling measurement of each backend's recent latency and biases requests toward the faster ones. The simplest version is the "power of two choices" algorithm: rather than picking randomly from all backends, pick two at random and route to whichever has the lower current latency. The probabilistic gain is disproportionate to the cost — two random samples are enough to avoid the worst servers most of the time.



Elegance of this approach is that it requires no central coordinator and no global view of server health. Each client maintains its own local latency measurements independently. The collective effect of many clients doing this converges on a system-wide load distribution that naturally isolates slow servers — without any explicit health-checking infrastructure.

Google's gRPC and Envoy proxy both implement variants of this. Netflix's Ribbon client-side load balancer added latency-based weighting as a core feature after observing that round-robin was systematically directing traffic into degraded nodes during partial cluster failures.

Request Deadline Propagation

Every distributed request has an implicit budget: the maximum time the user is willing to wait before the response becomes useless. A search result that arrives after the user has navigated away is not a slow success — it is a waste of resources that could have been spent on a fresher request. Yet most systems treat their internal RPC calls as if they exist outside of time, with no awareness of how much of the outer deadline has already been consumed.

Deadline propagation makes the remaining time budget explicit and transmits it across every service boundary. When a frontend handler receives a request with a 200ms SLO and spends 30ms doing authentication, the downstream RPC it issues should carry a deadline of 170ms — not an unconstrained call that could block for seconds. Each hop in the call graph receives a shrinking time window, and each service is expected to abandon work and return an error rather than continuing once that window closes.

Deadline propagation transforms timeout from a per-hop configuration into a system-wide invariant. Instead of each service having its own independently configured timeout — which can add up to far more than the user's actual patience — the deadline flows through the entire call graph as a shared, decrementing constraint.

Without deadline propagation, a slow backend continues burning CPU on a request whose answer will never be seen. The frontend has already returned an error to the user, but the downstream services don't know this — they keep working, consuming resources that could serve other requests. With deadline propagation, a cancelled frontend request immediately cancels the entire downstream tree. The work stops the moment it becomes irrelevant.



Go's context.Context is the most widely adopted implementation of this idea in modern systems. Passing a context with a deadline through every function call is the idiomatic Go way of expressing exactly this contract. The Dapper tracing system and gRPC's deadline mechanism implement the same principle at the RPC layer.


Probabilistic Early Completion

There is a class of read-heavy workload where the question "what is the correct answer?" is less important than "what is a good enough answer, returned quickly?" Search ranking, recommendation feeds, autocomplete suggestions, approximate analytics — in each of these, the value of the response degrades gradually with quality, not catastrophically. A slightly stale recommendation list is almost as useful as a fresh one. A search result that includes 98% of relevant documents is indistinguishable from one with 100%, from the user's perspective.

Probabilistic early completion exploits this tolerance by allowing a request to return as soon as it has gathered "enough" signal, rather than waiting for every shard to respond. The coordinator tracks how many responses have arrived and, once a statistically sufficient fraction of shards have replied, returns the aggregated result rather than waiting for the stragglers. The remaining responses, when they eventually arrive, are discarded.

The fraction required is a tunable parameter that encodes the application's quality-vs-latency tradeoff. Setting it at 90% means the request finishes when 9 of 10 shards have responded — the one slow shard no longer determines the outcome. The quality loss is bounded by the fraction omitted, and in practice for approximate workloads the loss is negligible while the latency gain is substantial.

Probabilistic early completion only makes sense when partial results are semantically valid. It is appropriate for search, recommendations, aggregated metrics, and autocomplete. It is inappropriate for financial transactions, inventory updates, authentication checks, or any computation where partial data produces incorrect rather than merely approximate output.

Overload Admission Control

All of the patterns discussed so far are concerned with how an individual request navigates a slow or overloaded system. This one operates at a different level: preventing the system from accepting more work than it can complete within latency bounds in the first place.

The counterintuitive observation is that queueing is not a buffer — it is a latency amplifier. When a service is operating at capacity and accepts additional requests into a queue, those requests do not get served "slightly later." They get served much later, because every subsequent request must wait behind everything already in the queue. The 99th percentile latency of a system at 95% utilization can be ten times worse than the same system at 80% utilization, even though the throughput difference is modest.



Admission control accepts this reality and acts on it. Rather than allowing the queue to grow unbounded during traffic spikes, the system measures current utilization — active request count, queue depth, recent latency percentiles — and explicitly rejects incoming requests when those indicators cross a threshold. The rejected requests receive an immediate error rather than a delayed one. From the client's perspective, a fast rejection is often preferable to a timeout: it can retry against a different backend, fail fast, or serve from cache, rather than hanging indefinitely.

Google's internal systems use a technique called "client-side throttling" where the client itself participates in admission control: it tracks its own recent reject rate and probabilistically drops requests before sending them, reducing load on an already-stressed backend without requiring the backend to process and reject each request individually. Netflix's Concurrency Limits library implements a similar adaptive mechanism based on TCP congestion control algorithms — treating the request queue like a network pipe and backing off as soon as it detects queuing delay increasing.


Latency SLO Budgeting Across Teams

The patterns above are all technical. This last one is organizational, but its absence makes every technical pattern less effective.

In a large engineering organization, a single user-facing latency SLO — say, P99 < 300ms — is actually a composite of dozens of internal service SLOs. The frontend has a budget. The auth service has a budget. The ranking service has a budget. The storage layer has a budget. When these budgets are implicit, undocumented, or uncoordinated, teams make local decisions that are individually reasonable but collectively catastrophic. The auth team tightens its internal retry logic, adding 20ms to every call. The indexing team adds a synchronous cache warm-up step. Neither change violates any documented contract, and neither team knows what the other did. The cumulative effect is a P99 regression that shows up in the frontend SLO and takes weeks to attribute.

Latency budgeting makes these implicit contracts explicit. Each service in a call graph is assigned a latency budget — its maximum allowed contribution to the end-to-end P99 — derived from the top-level SLO. Changes that affect that budget require coordination across the services that share the call path. The budget is measured, reported, and treated as a first-class engineering constraint, like memory or CPU quota.

This is less a distributed systems pattern and more a systems-thinking pattern. Latency is a shared resource in the same way that bandwidth or storage is a shared resource. The only difference is that it is invisible until the moment it fails, at which point attribution is painful and slow. Making latency budgets explicit — even approximately — transforms latency from an emergent surprise into a managed constraint.

Embracing Stochastic Reality

What's important about these patterns, taken together, is what they have in common. None of them attempt to eliminate variability from the system. None of them assume that the environment can be made deterministic, or that every machine can be made equally fast, or that background noise can be suppressed. They all start from the premise that variability is irreducible — that shared environments will always produce straggler events — and design around it rather than against it.

"You cannot engineer your way to a deterministic distributed system. You can only engineer your way to one that degrades gracefully in the face of guaranteed non-determinism."

Old engineering intuition was that good infrastructure means predictable infrastructure — every component behaves the same way every time. New intuition is that good infrastructure means resilient infrastructure — every request completes within acceptable bounds, regardless of what any individual component is doing.

Each pattern acknowledges that something will go wrong and designs so that something going wrong in one place cannot become everything going wrong everywhere.

The question worth sitting with, for any system you're currently building, is not "what is our average latency?" It is: "when something goes wrong on one machine, where does that pain go?" If the answer is "it propagates to every user touching that partition," you have a tail latency problem, and the patterns above are where the solution starts.

Further reading: The ideas in this post are drawn from The Tail at Scale — Communications of the ACM, February 2013. Worth reading in full if any of this resonated.

Nobody Bought It For You

 Why the AI your company just mandated was designed for a boardroom slide, not your workflow — and how the whole machine keeps spinning.

There's a moment every knowledge worker knows. You open the new tool your company has mandated — the AI assistant, the workflow platform, the "intelligent" something-or-other — and within thirty seconds you understand that nobody who bought this has ever used it. Not once. Not even to check.

This is not an accident.





Buyer Problem

The people who buy software and the people who use software are, in most companies, completely different people with completely different problems. 

The buyer's problem is

-  How do I look like I'm modernizing? 

- How do I get the board off my back about AI? 

- How do I justify my budget? 

These are legitimate career problems. They just have nothing to do with whether the tool is good.

So when a B2B AI vendor walks into a boardroom, they are not selling a product. They are selling a story that the executive can then retell upward. "We've implemented an AI strategy." 

Software is almost incidental — it's a prop in a presentation that hasn't been written yet. 

This is why enterprise AI demos are always flawless. They're not showing you the product. They're giving you the slide.

Adoption Gap

The word "adoption" exists to handle the gap between purchase and reality. When a company buys an AI tool and nobody uses it, the vendor does not say "nobody is using this." They say adoption is a journey. 

They offer change management resources. They suggest lunch-and-learns. 

Adoption is the word that lets everyone pretend the gap between what was promised and what is happening is a people problem rather than a product problem.

And the buyer accepts this framing enthusiastically, because the alternative is admitting they spent several hundred thousand dollars on a demo.

Ask the vendor to show you an everyday user — not a champion, not an executive sponsor — who would be visibly annoyed if the tool was taken away tomorrow. Most can't. They'll show you a champion who championed it, an executive who approved it, a case study from a company you can't contact.


What "It Works" Means

Here is what "it works" means when a CTO says it about an AI product: it does not crash during the board presentation. That's largely it. Whether it saves anyone time, whether the outputs are accurate, whether the people supposedly using it have quietly found workarounds — none of that is tracked with any rigor, because tracking it rigorously creates accountability, and accountability is the enemy of momentum.

This is why AI vendors love talking about time saved. Not measured time. Surveyed time. They ask employees "do you feel like you save time?" after a three-week rollout, and employees — who have correctly identified that this tool is their manager's priority — say yes. 

This number goes in the case study. The case study goes on the website. The website convinces the next buyer.


Build for Wrong Audience

The smart vendors learned early that the sale is not to the user, it's to the person one or two levels above the user. So they built products optimized for that audience. Beautiful dashboards showing utilization metrics. Reporting features that surface "AI activity" in a way that looks great in a quarterly review. Integrations with the tools executives actually look at.

The product for the user is often an afterthought dressed up in good fonts. This is why so many AI writing tools produce text that is technically fluent and completely hollow. 

The executive doesn't read the output — they read the metric that says outputs were generated. 

The number goes up. This is, within the logic of the system, a success.

Not new just louder

None of this is unique to AI, of course. It's the standard lifecycle of any enterprise software category in its gold rush phase. Collaboration tools, digital transformation platforms, data analytics suites — they all went through the same arc. The vocabulary changes but the structure is identical: a buzzword creates board-level anxiety, vendors rush in to sell relief from that anxiety, and the actual workers are handed something that was built for a pitch deck.

What's different with AI is the stakes feel higher, so the anxiety is more acute, so the purchasing is more frantic, so the gap between promise and reality is wider, so the rationalizations have to be more elaborate. The lunch-and-learns are longer. The change management consultants are more expensive. The case studies are more breathless.

That's the product. That's what was bought. Not for you — for the story about you.

Wednesday, 25 February 2026

Mechanical Sympathy 2.0: From Software Tuning to Model-as-Silicon

A Toronto startup called Taalas hardwired an LLM into transistors and got 16,000 tokens per second. That number sounds like a benchmark. It's actually a paradigm shift hiding in plain sight.




Every decade or so, the computing industry reaches a point where it keeps solving the wrong problem. In the 1990s, the industry optimized instruction pipelines endlessly while memory latency quietly became the actual bottleneck — what scientists called the "memory wall." solution wasn't a faster CPU. It was a different architectural philosophy: caches, NUMA awareness, locality of reference. 

We are building more powerful general-purpose accelerators for an increasingly specific workload, while the actual barriers — latency and cost per inference — remain stubbornly high. A startup called Taalas just walked through a door that everyone else assumed was locked.

Their idea sounds almost offensive in its simplicity. Instead of building a better computer to run AI models, they asked: what if the model itself became the computer? Not metaphorically. Literally. They etched the weights of Llama 3.1-8B directly into silicon — one weight, one multiply, one transistor. Result is a chip that does exactly one thing and does it at 16,000 tokens per second per user. That's not a 2× improvement. It's an order of magnitude beyond what Nvidia, Cerebras, and Groq can achieve on the same model.

Abstraction Tax We Stopped Noticing

To understand why this matters, consider what happens every time a GPU runs inference. You have a general-purpose parallel compute engine. On top of that sits CUDA. On top of that, a deep learning framework. On top of that, a model serving system. On top of that, the model itself — with weights loaded from High Bandwidth Memory that sits physically separated from the compute units, connected by a bandwidth-constrained bus. Every layer of that stack has a cost: power, latency, engineering complexity. HBM memory stacked on modern AI chips consumes significant power just shuttling weights back and forth. Chip doesn't know it's running a transformer. It learns this at runtime through software.




Taalas eliminated the entire middle of that stack. Their HC1 chip — built on TSMC's N6 process at 815mm² — stores model weights in the transistors themselves using a mask ROM fabric. Compute and memory collapse into the same physical location. The von Neumann bottleneck, the memory wall that has haunted computer architects for forty years, simply doesn't exist. There is no bus to saturate. There is no data to move. The multiply happens where the weight lives.

What's striking is not just the performance number, but the power consumption story. Ten HC1 chips running continuous inference consume 2.5 kilowatts. An equivalent GPU setup for the same throughput would demand significantly more power and require liquid cooling, custom packaging, and HBM stacks. Taalas runs in standard air-cooled racks. If this scales, it doesn't just change AI economics — it changes where AI can physically run.

Flexibility-Performance Corner Nobody Explored

The obvious objection is flexibility. An HC1 chip runs exactly one model: Llama 3.1-8B. Update the model, retape the chip. In a field where frontier models are replaced every few months, betting on dedicated silicon seems reckless. This is exactly why nobody went down this path before. The assumption — reasonable until recently — was that AI was changing so fast that any specialized hardware would be obsolete before it paid for itself.

"Nobody went into this corner because everybody felt AI was changing so rapidly that it would be a massively risky thing to do. But we wanted to see what's hiding in that corner."
— Ljubisa Bajic, CEO, Taalas

But Taalas found something in that corner. Two things changed that make their bet less reckless than it appears. First, a growing subset of model families — the Llamas, the DeepSeeks, the Qwens — are stabilizing into production workhorses. Enterprises aren't running the frontier model of the week. They're running fine-tuned versions of models that are already 6–12 months old, because that's what their workflows are validated against. Second, Taalas' retaping cycle is two months, not two years. They only customize two metal layers on an otherwise fixed chip — borrowing an idea from structured ASICs of the early 2000s. The base chip is permanent; only the weight layer changes. Order a chip for your deployment window, run it until the model evolves, retape. If the cost per inference drops by 1,600×, you can absorb a faster hardware refresh cycle and still come out far ahead.




What Does Sub-Millisecond Inference Unlock?

Here is where it gets interesting — and where most analysis of Taalas misses the bigger story. The coverage tends to frame this as "Nvidia competitor" or "cheaper inference." Both are true but both are underselling it. The more important question is: what categories of software become possible when inference is effectively free and instantaneous?

Think about agentic AI systems the way we think about database transactions. Today, every call to an LLM is expensive enough that you architect your system to minimize them — a prompt here, a structured output there, careful chain design. It's the equivalent of designing around the cost of disk I/O in the 1980s. Every application decision was shaped by that constraint. When memory got cheap enough, the constraint dissolved, and entire new software paradigms emerged. In-memory databases. Real-time analytics. Applications that would have been unthinkable when you had to plan every memory access became trivial. Sub-millisecond, near-zero-cost inference does the same thing for AI-native applications.

A coding agent that can spawn 100 parallel reasoning threads to explore different implementation approaches — and complete all of them in the time a single GPU call takes today — is not just a faster version of Copilot. It's a different class of tool. Voice interfaces that feel genuinely instantaneous rather than simulated-fast-typing change the interaction model entirely. IoT devices that run inference locally, on-chip, without cloud round-trips, enable entirely new application categories: real-time translation in earbuds, continuous monitoring in industrial settings, robotic perception loops that don't wait for a network packet.



Architecture of Deployment is About to Flip

There is a deeper structural implication here that I haven't seen discussed elsewhere. The current AI deployment model is highly centralized. You train at hyperscale data centers, you serve from hyperscale data centers, and latency is a tax you pay for accessing that centralized intelligence. This isn't a choice — it's a law of the physics. GPU clusters consume hundreds of kilowatts. You run them where power is cheap and cooling is achievable. Everything else connects via API.

Taalas' HC1 running 10 chips at 2.5 kilowatts fits in a standard rack. Not a special power-zone rack. Not a liquid-cooled custom installation. A standard rack. Scale this to their second-generation silicon and frontier models, and suddenly the economics of edge inference look very different. A hospital running inference on-premise. A factory running quality control loops locally. A telecom running inference at the edge of the network. None of these require a supercomputer. They require a box that costs a few hundred kilowatts and delivers sub-millisecond responses.

The historical parallel that comes to mind is the minicomputer revolution. In the 1960s, computing was centralized by necessity — mainframes were expensive and power-hungry, and only institutions could afford them. The minicomputer didn't just make computing cheaper. It redistributed computing into departments, into labs, into engineering teams that previously had to submit batch jobs and wait. The same shift happened again with workstations, again with PCs, again with smartphones. Each wave moved intelligence closer to the point of use, and each wave unlocked applications that were inconceivable at the previous scale. Taalas, if their roadmap holds, is proposing that AI inference can make that same journey — from hyperscale data center to edge server to eventually embedded device.

What the Risk Profile Actually Looks Like

The hardwired approach carries genuine risks that deserve an honest look. The model specificity is not a minor caveat — it's the central bet. If the Llama family fades and a new architecture dominates, chips hardwired for the old model have limited residual value. The two-month retaping cycle is fast by traditional ASIC standards, but in an AI field where significant model releases happen monthly, it still represents a lag. There's also the question of whether Taalas' approach scales to frontier models. The HC1 runs an 8B parameter model — valuable for production workloads, but well below the frontier. Their second-generation silicon targets a mid-size reasoning model, and frontier capability is planned for later in 2026. That progression is the one to watch.

There's also a market dynamics question. Cloud providers don't necessarily want their customers achieving this kind of cost reduction on inference. Lower inference costs are great for consumers but threaten the economics of API businesses. Whether hyperscalers will adopt Taalas chips, build competing specialized silicon, or simply let GPU clusters continue to dominate through inertia — that's an open question with real strategic stakes.

And yet, the technical result is hard to argue with. 16,000 tokens per second per user. $0.0075 per million tokens. 250 watts per chip. These aren't paper benchmarks — the chip exists, developers can apply for access today. A Toronto company with 25 employees and $219 million raised has produced a benchmark that makes the GPU stack look architecturally mismatched for this workload. 

Lesson Worth Carrying Forward

The lesson I take from Taalas isn't about AI chips specifically. It's about the value of inhabiting the corners of solution spaces that everyone else has deemed too risky to explore. The GPU path is rational. General-purpose compute is flexible, quickly amortized, and continuously improved by massive R&D budgets. The structured ASIC path looks irrational until you do the physics carefully enough to see that the entire software stack you're preserving with all that flexibility is itself the bottleneck. Taalas didn't find a new physics. They found a corner in the design space where the tradeoffs that seemed unacceptable from the outside look entirely acceptable from the inside — because the gains are large enough to absorb the rigidity.

For software engineers building AI-powered systems today, the practical implication is this: the inference cost model you're designing around right now is not a law of nature. It's an artifact of the current hardware generation. If Taalas' approach — or the pressure it creates on incumbent vendors — succeeds in driving inference costs down by one or two orders of magnitude, the right architectural choices for AI-native applications will look completely different in 18 months. The applications that seem economically impossible today — agents that think in parallel, voice interactions that feel truly instant, intelligence embedded in every device — are not science fiction. They're just waiting for the infrastructure to catch up with the ambition.

The model, it turns out, can become the machine. That changes more than inference costs.

Tuesday, 17 February 2026

No One Builds a Search Engine in a Weekend

A solo developer spent a weekend building an AI agent. Two million people used it within weeks. OpenAI and Meta immediately came knocking. Try imagining this story with Google Search. You can't. That's the entire problem with the AI lab business model.

Here is a thought experiment. Imagine a developer spends a weekend building a new search engine. It gets 196,000 GitHub stars. Two million people use it every week. Google sends an acquisition offer within the month. Impossible, right? infrastructure alone — the crawlers, the index spanning hundreds of billions of pages, the query-serving infrastructure that returns results in under 200 milliseconds at global scale — takes years and billions of dollars to assemble. A weekend project cannot replicate it. The moat is structural, physical, and time-locked.

Now run the same thought experiment with the App Store. A developer can build an app that sits on top of the App Store. They cannot build a replacement App Store in a weekend. The payment rails, the developer trust relationships, the OS-level integration, the review infrastructure — none of this is replicable. Apple's moat is not the quality of any individual app. It is the platform that makes apps possible at all.

Peter Steinberger spent a weekend in November 2025 building OpenClaw — an AI agent framework that could control your computer, browse the web, run shell commands, manage your email, and post to social platforms autonomously. Within weeks it had 196,000 GitHub stars and 2 million weekly users. Both Meta and OpenAI sent acquisition offers. OpenAI won the acqui-hire. Steinberger is now inside Sam Altman's operation, tasked with building the next generation of personal agents.

Gap between those two thought experiments is the entire story of why AI labs, for all their astronomical valuations, are operating on sand rather than bedrock.



What Made Google and Apple Unassailable

Google's search moat has three layers that compound on each other. First is the index — years of crawling the web, storing and ranking hundreds of billions of documents, building the infrastructure that makes real-time query response possible at global scale. Second is the feedback loop — two decades of user query data that trained ranking algorithms no competitor can replicate from scratch. Third is distribution — default search agreements with browser makers and device manufacturers that cost Google approximately $26 billion in 2021 alone, just to maintain the default position. A weekend developer cannot interrupt any of these three layers simultaneously. Moat is not one wall, it is three walls reinforcing each other.

Apple's App Store moat is different but equally structural. It is not the quality of Apple's own apps — it is the OS-level trust relationship with the device. Every app on an iPhone exists inside Apple's permission system. Developers build on Apple's infrastructure, follow Apple's rules, pay Apple's cut, and cannot distribute outside Apple's channel without jailbreaking the device. Moat is not about any particular capability. It is about controlling the ground on which all capabilities are built.

Now look at what Steinberger actually built. OpenClaw is an interface layer — a framework for issuing instructions to AI models and executing the outputs. It required no proprietary infrastructure. It required no exclusive data. It required OpenAI's and Anthropic's own API keys, which any developer can obtain in minutes. Entire product sat on top of infrastructure that the AI labs themselves made openly available, then immediately disrupted the market position those same labs were trying to establish. Steinberger did not build a moat. He exposed the absence of one.

Why Anthropic's Reaction Revealed Everything

When OpenClaw was still named ClawdBot — to capture ClaudeCode momentum , the Anthropic model that many developers were using to power it — Anthropic's response was to threaten legal action over the name. This forced Steinberger to rename the project twice, eventually landing on OpenClaw after checking with Sam Altman that the name was acceptable.

Read that sequence again carefully. A solo developer builds the most viral open-source AI agent framework of late 2025, powered substantially by Anthropic's own Claude model, and Anthropic's first move is to send a cease-and-desist letter about a name.

Name threat was not really about trademark law. It was about Claude Code. Anthropic had spent significant resources building Claude Code as its flagship agent-developer product — the agentic interface that would cement Claude's relationship with the engineering community. OpenClaw, running on Claude's API, was demonstrating better viral product dynamics than Claude Code's official launch. ClawdBot's very name threatened to create confusion in exactly the market segment Anthropic was trying to own: developers building with agentic AI. Anthropic looked at a solo developer capturing their intended market and reached for a lawyer instead of a product manager.

When the most viral agent experience is built on your model and you respond with a trademark letter, you have revealed that you believe your moat is your brand — not your technology, not your distribution, not your platform. That is a very thin moat.

Google does not threaten developers who build search-adjacent products. It doesn't need to. No search-adjacent product has ever threatened to replace Google Search because the infrastructure required to replace it doesn't fit in a weekend project. When your competitive position is genuinely structural, you don't respond to open-source alternatives with legal letters. You respond by noting that the alternative needs your infrastructure to function and cannot survive without it. Anthropic could not make that response. Agent ran fine without Anthropic's blessing — it just needed the API key.

Specific Thing AI Labs Cannot Build

Every AI lab in 2026 will tell you their moat is their model. Benchmark performance, the training runs that cost hundreds of millions of dollars, the research teams producing capabilities no open-source alternative has yet matched. This argument has surface plausibility and a fatal flaw.

Flaw is that OpenClaw was explicitly model-agnostic. It ran on Claude, GPT-5, Gemini, Grok, and local models via Ollama. Most viral agent interface of early 2026 was architected from day one to treat every frontier model as a commodity interchangeable with every other. Steinberger himself committed to keeping it model-agnostic even after joining OpenAI. If the product that captured 2 million weekly users doesn't care which model it runs on, what is the model moat actually protecting?

Structural Comparison

Google built a search product that requires years, billions, and global infrastructure to replicate. Apple built a distribution platform that requires OS-level trust to compete with. OpenAI and Antropic built a frontier model, then watched a developer spend a weekend building the interface layer that users actually wanted — using their APIs — and had to acquire or threaten  him. 

Difference is not capability. It is whether the moat lives in the product or in the infrastructure beneath the product.

Google and Apple are not threatened by weekend projects because their moats are below the application layer. Search index is below any search interface. App Store payment rail is below any app. Whatever you build on top cannot replace what is underneath. AI labs have the opposite problem: their most defensible asset — the frontier model — is exposed at the API level to anyone with a credit card. Everything built on top of that API, every interface layer, every agent framework, every product that users actually interact with, is up for grabs every weekend.

What a Real AI Moat Would Look Like

This is not an argument that AI labs are worthless or that the frontier model is irrelevant. It is an argument about what kind of moat is durable versus what kind evaporates the moment a motivated developer has a good weekend.

A durable AI moat would look like Google's: infrastructure that is physically impossible to replicate quickly. Stargate project — OpenAI's $500 billion joint venture with Oracle and SoftBank to build dedicated AI infrastructure — is a bet in this direction. 

If running capable agents at mass scale requires compute infrastructure only a handful of players can afford to build, then the compute becomes the moat the way the search index is Google's moat. But this is an infrastructure bet, not a model bet. OpenAI is effectively betting that the future of AI advantage looks more like owning a power grid than owning a better algorithm.

A durable AI moat would also look like Apple's: owning the OS-level relationship with the device, such that no agent framework can operate without your permission. Microsoft comes closest to this with Windows and the enterprise stack. Google has it with Android. Apple has it most completely with iOS. 

AI labs that sit inside these platforms — OpenAI's ChatGPT integration with Apple Intelligence, Anthropic's enterprise agreements — are paying for distribution access rather than building it. They are tenants in someone else's moat.

What is conspicuously absent from every major AI lab's current strategy is the thing that made Google and Apple truly unassailable: a proprietary feedback loop that improves with use and cannot be transferred to a competitor. 

Google's search gets better with every query because the query data belongs to Google. Apple's App Store gets stronger with every app because developer relationships belong to Apple's ecosystem. 

Every time someone uses ChatGPT or Claude, the interaction data could theoretically compound into better models — but the API-first distribution model means that a large portion of actual usage happens through third-party interfaces, with the data relationship owned ambiguously or not at all. 

Steinberger's 2 million weekly OpenClaw users were generating interaction data that told you something profound about how humans actually want to use agents. That data lived with OpenClaw, not with the model providers whose APIs were processing the requests.

Conclusion

OpenClaw acquisition is not primarily a story about a talented developer getting a well-deserved outcome. It is a story about what happens when the product layer of a technology platform is structurally undefended. 

Peter Steinberger could build OpenClaw in a weekend because the infrastructure he needed was all openly available, cheaply accessible, and deliberately designed to be used by anyone. 

Labs built it that way intentionally — API-first distribution was the fastest path to revenue and adoption. But API-first distribution is also moat-last distribution. Every interface you don't control is an OpenClaw waiting to happen.

Google has never had to acquire a weekend search project because no weekend search project could threaten Google Search. Index is not for sale. Feedback loop is not accessible. Distribution agreements are not replicable. Moat is below the level where weekend projects operate.

AI labs have built their products at the level where weekend projects operate. That is, right now, their most significant strategic vulnerability — and no acquisition, however well-timed, changes the underlying architecture.

Steinberger asked Sam Altman whether naming the project "OpenClaw" was acceptable. Altman said yes. 

Most revealing detail in this entire story is not that OpenAI acquired the project. It is that the founder of the project felt he needed to ask the CEO of OpenAI for naming permission, and got it, and still had 2 million weekly users and full negotiating leverage with both Meta and OpenAI. 

That is what the absence of a structural moat looks like in practice: you are powerful enough to threaten the biggest AI company in the world from a weekend project, and polite enough to check if the name is okay first.