Wednesday, 28 January 2026

Why Your AI Coworker Will Never Understand Your Code

The Uncomfortable Pattern Everyone's Ignoring

When Claude Code first started autocompleting my code with insane accuracy, I felt what every engineer feels: a flash of obsolete. Here's a system that writes cleaner boilerplate than I do, recalls API signatures I've forgotten, and implements patterns faster than I can type them. Then I asked it to help with a custom threading synchronization algorithm I was designing - something that had never been written before - and watched it confidently generate a total mess.

Pattern repeats everywhere. LLMs write SQL queries brilliantly until you need to optimize for your specific data distribution. They explain React patterns perfectly until you're debugging a novel state management approach. They're simultaneously genius and confused grade schoolers, and everyone's pretending this jaggedness is just a scaling problem waiting to be solved.

Andrej Karpathy finally said what we've been avoiding: we're not building animals that learn from reality. We're summoning ghosts - digital entities distilled from humanity's text corpus, optimized for mimicry rather than understanding. And if Yann LeCun is right, the architecture itself guarantees they can never become anything else.

This isn't point of view to stop using LLMs. It's recognition that the tools transforming software engineering today have a ceiling we need to see clearly, because your career decisions should account for what AI will never do, not just what it's learning to automate.

What Karpathy Actually Said About Ghosts

Metaphor landed because it captured an asymmetry everyone building with LLMs feels but struggles to articulate. Karpathy's framing is black and white: "today's frontier LLM research is not about building animals. It is about summoning ghosts."

Distinction is architectural. Animals learn through dynamic interaction with reality. "AGI machine" concept envisions systems that form hypotheses, test them against the world, experience consequences, and adapt. 

There's no massive pretraining stage of imitating internet webpages. There's no supervised finetuning where actions are teleoperated by other agents. 

Animals observe demonstrations but their actions emerge from consequences, not imitation.

Ghosts are different. They're "imperfect replicas, a kind of statistical distillation of humanity's documents with some sprinkle on top." 

Optimization pressure is fundamentally different: human neural nets evolved for tribal survival in jungle environments - optimization against physical reality with life-or-death consequences. 

LLM neural nets are optimized for imitating human text, collecting rewards on math puzzles, and getting upvotes on LMSYS Arena. These pressures produce different species of intelligence.

This creates what Karpathy calls "jagged intelligence" - expertise that doesn't follow biological patterns because it wasn't shaped by biological constraints. 

LLM can explain quantum field theory while failing basic common sense about physical objects. It writes elegant code for standard patterns while generating nonsense for novel architectures. 

Jaggedness isn't a bug - it's the signature of learning from corpus statistics rather than reality.

LeCun's Mathematical Doom Argument

While Karpathy's ghost metaphor describes the phenomenology, Yann LeCun argues the architecture itself is mathematically doomed. His position isn't "we need better training" - it's "autoregressive generation can't work for genuine intelligence."

The core argument is this: imagine the space of all possible token sequences as a tree. Every token you generate has options - branches in the tree. 

Within this massive tree exists a much smaller subtree corresponding to "correct" answers. 

Now imagine probability e that any given token takes you outside that correct subtree. Once you leave, you can't return - errors accumulate. Probability your sequence of length n remains correct is (1-e)^n.



This is exponential decay. Even if you make e small through training, you cannot eliminate it entirely. Over sufficiently long sequences, autoregressive generation inevitably diverges from correctness. You can delay the problem but you cannot solve it architecturally.

LeCun's critics point out the math assumes independent errors, which isn't true - modern LLMs use context to self-correct. They note that LLMs routinely generate coherent thousand-token responses, which seems impossible under exponential decay. Recent research shows errors concentrate at sparse "key tokens" (5-10% of total) representing critical semantic junctions, not uniformly across all tokens.

But LeCun's deeper point stands: the autoregressive constraint means sequential commitment without exploring alternatives before acting.

The Lookback vs Lookahead Distinction

To be precise about what "autoregressive" actually constrains: LLMs have full backward attention - at each token, the model attends to ALL previous tokens in the context. This is fundamental to how transformers work. They're constantly "looking back."

What they don't do is lookahead during generation:

Standard Autoregressive Generation:
Token 1: Generate → COMMIT (can't change later)
Token 2: Generate → COMMIT  
Token 3: Generate → COMMIT
...each decision is final upon generation

Compare this to search-based planning (like AlphaGo):

Consider move A → simulate outcome → score: 0.6
Consider move B → simulate outcome → score: 0.8  
Consider move C → simulate outcome → score: 0.4
Choose B (explored before committing)

Chess analogy: Standard LLM generation is like being forced to move immediately after seeing the board position, without considering "if I move here, opponent does this, then I..." Human planning involves internally simulating multiple futures before committing to action. Autoregressive generation commits token-by-token without exploring alternative continuations.

What People will say to this argument ? 

Modern techniques add planning on top of base generation. Chain-of-thought generates reasoning tokens first but still commits sequentially. 

Beam search keeps multiple candidates but is exponentially expensive for deep exploration. 

OpenAI's o1 reportedly uses tree search during inference, which IS genuine lookahead - a significant architectural addition beyond pure autoregressive generation.

LeCun's claim isn't that these improvements are impossible. It's that they're band-aids on an architecture that doesn't naturally support the kind of internal world simulation that characterizes animal intelligence.

Four Gaps That Can't Be Trained Away

LeCun identifies four characteristics of intelligent behavior that LLMs fundamentally lack: understanding the physical world, persistent memory, the ability to reason, and the ability to plan. But the deepest issue is what they're optimized for.

Consider what this means for scientific reasoning. Scientists don't generate hypotheses by pattern-matching previous hypotheses - they observe phenomena, form novel explanations, design experiments to falsify them, observe results that surprise them, and refine their models. 

Every step involves interaction with ground truth that can prove you wrong.

LLMs have no mechanism for this. 

- They can't run an experiment and be surprised. 

- They can't observe results that contradict their predictions and update based on physical evidence.

 Every token is inference from prior tokens in a corpus that only contains what was already discovered and written down. You cannot discover novel physics from a corpus that only contains known physics.

This explains why LLMs excel at code but struggle with physical reasoning. Code operates in "a universe that is limited, discrete, deterministic, and fully observable" - the state space is knowable and verification is programmatic. 

Physical reality is continuous, partially observable, probabilistic, and full of phenomena we haven't documented. 

Animals navigate this effortlessly because they learn directly from it. LLMs can only learn from our linguistic shadows of it.

When LeCun states "Auto-Regressive LLMs can't plan (and can't really reason)," he's not being provocative - he's describing an architectural constraint. 

Even chain-of-thought prompting doesn't fix this because it's "converting the planning task into a memory-based (approximate) retrieval." You're not teaching reasoning - you're teaching corpus-level pattern matching about what reasoning looks like. This is exact reason that "Prompt" is hyper parameter and model is very very sensitive to it. 

LLM learns to generate text that resembles reasoning steps because that's what appears in the training data, not because it's internally simulating multiple future scenarios and choosing the best path.

Why Code Works But Reality Doesn't

I've noticed an interesting pattern with GitHub Copilot/Claude Code. When implementing a standard REST API or writing React components, the suggestions are good - often exactly what I was about to type. 

When debugging a distributed systems issue or architecting a novel state management approach, the suggestions become actively unhelpful, confidently wrong in ways that would break production.

The difference isn't random. Standard patterns exist extensively in training corpora - GitHub is full of REST APIs and React components.

LLM has seen thousands of implementations and learned the statistical regularities of how these patterns manifest in code. 

It's not understanding the requirements and generating a solution; it's recognizing "this looks like a REST endpoint" and retrieving an approximate match from distribution of REST endpoints in its training data.

For novel code that deviates from conventions, this breaks down. When you are building custom thread synchronization, models repeatedly failed because they kept pattern-matching to standard practices - adding defensive try-catch statements, turning focused implementations into bloated production frameworks. 

They couldn't understand his actual intent because they don't understand intent at all. They understand corpus statistics.

This is why code works better than general reasoning for LLMs: code is verifiable, domains are closed, and common patterns dominate the training data. You can build benchmarks with programmatic correct answers. 

You can use Reinforcement Learning from Verifiable Rewards (RLVR) because verification is automatic. But this success doesn't generalize to open-ended domains where ground truth isn't programmatically checkable.

Strategic question for engineers: which parts of your work are "standard patterns well-represented in training corpora" versus "novel architecture requiring genuine understanding?" 

First category is being rapidly automated at 100X speed. Second isn't just hard for current LLMs - it may be architecturally impossible for them.

What Animals Have That Ghosts Never Will

LeCun's alternative to LLMs is Joint Embedding Predictive Architecture (JEPA), which inverts the paradigm entirely. Instead of predicting next token in pixel/word space, JEPA learns to predict in latent representation space - building an internal world model that captures structural regularities while ignoring unpredictable details.

Key insight: most of reality's information is noise. When you watch a video of someone throwing a ball, the exact trajectory is predictable from physics but the precise pixel values (lighting, shadows, texture) contain high entropy. 

Generative models waste capacity modeling all this unpredictable detail. JEPA learns representations that "choose to ignore details of the inputs that are not easily predictable" and focus on "low-entropy, structural aspects" - like the parabolic arc, not the exact RGB values.

This mirrors biological learning. An infant knocking objects off a table learns gravity not by memorizing pixel sequences but by building an abstract model: "objects fall downward." 

The model ignores irrelevant details (color, texture, lighting) and captures the physical law. 

No books required, no 170,000 years of reading - just observation and interaction.

Meta's V-JEPA demonstrates this works. When tested on physics violations (objects floating mid-air, collisions with impossible outcomes), it showed higher surprise/prediction error than state-of-the-art generative models. 

It acquired common-sense physics from raw video by building an actual world model, not by memorizing corpus statistics about how people describe physics.

Architectural difference matters because it determines what's learnable. LLMs learn "what humans wrote about the world" - a tiny, biased, lossy compression. 



JEPA-style models can learn "how the world actually works" through observation. The first hits a data ceiling when you've processed all available text. The second has access to reality's infinite bandwidth.



Architecture Is The Constraint

LeCun's prediction is bold: "within three to five years, no one in their right mind would use" autoregressive LLMs. 

His position is that better systems will appear "but they will be based on different principles. They will not be auto-regressive LLMs.

This isn't incrementalism. It's claiming the entire paradigm is a dead end for genuine intelligence. The reason is architectural: 

LLMs and humans play by completely different rules. One is a master of compression, and the other is a master of adaptation.

Simply feeding more data to "this compression beast will only make it bigger and stronger, but it won't make it evolve into an adaptive hunter.

Consider what this means for the current race toward AGI via scaling. If LeCun is right, we're optimizing along a dimension that can't reach the target. Better compression of human text gets you better mimicry, not understanding. 

Larger context windows let you mimic longer documents, not think longer thoughts. 

RLVR on verifiable domains like code and math creates better pattern-matchers for those domains, not general reasoners.

Counterargument is so convincing: LLMs keep surprising us with emergent capabilities. GPT-4 does things GPT-3 couldn't, and Claude Sonnet 4 does things GPT-4 struggled with. 

Maybe there's no architectural ceiling, just insufficient scale. Maybe chain-of-thought reasoning plus tool use plus larger context windows eventually produces something indistinguishable from genuine intelligence.

LeCun's response: show me the world model. Show me the system that can watch a video, form a novel hypothesis about what happens next, be surprised when it's wrong, and update its model of reality.

 Autoregressive text generation can't do this by construction - it has no mechanism for ground truth interaction, no ability to be surprised by reality rather than corpus statistics.

What This Actually Means For Your Career

Practical implication isn't abandoning LLMs - they're extraordinarily useful for what they actually are. It's recognizing their ceiling so your skill development accounts for what AI will never automate.

Here's the framework: 

LLMs excel at problems where 

(1) the solution space is well-represented in training corpora, 

(2) verification is possible through execution or programmatic checking, and 

(3) the domain is closed and discrete. 

They struggle where 

(1) the problem is genuinely novel, 

(2) correctness requires understanding beyond pattern-matching, or 

(3) the domain involves continuous physical reality or open-ended reasoning.

This creates a clear dividing line in engineering work:

Automatable (pattern-matching sufficient): Standard CRUD implementations, boilerplate reduction, API integration following documentation, test generation for known patterns, code explanation and documentation, refactoring for style consistency.

Not automatable (understanding required): Novel algorithm design, distributed systems debugging with emergent behavior, performance optimization for specific workload characteristics, architectural decisions balancing tradeoffs, security reasoning about attack surfaces, integration of fundamentally new technologies.

The difference isn't difficulty - it's whether success requires recognizing patterns in existing code versus forming and testing novel hypotheses about system behavior. 

One is corpus retrieval, the other is scientific method.

For career strategy, this suggests investing in skills that require building world models: understanding how systems actually behave under load, why certain architectural patterns create subtle failure modes, what tradeoffs matter for your specific context. 

These aren't pattern-matching problems. They're "I need to understand this system well enough to predict what happens in scenarios I haven't seen."

The engineers who thrive won't be those who resist AI tools. They'll be those who understand exactly which problems LLMs can solve (letting them automate aggressively) versus which problems require genuine understanding (where they need to think like scientists, not pattern-matchers). 

The tools are getting better at mimicry. But mimicry isn't understanding, and the architecture guarantees it never will be.

If Karpathy's right that we're summoning ghosts, and LeCun's right that ghosts can never become animals, then the question isn't "how do I compete with AI." 

It's "which problems require animal intelligence?" Those problems aren't going away. They might be the only ones that matter.

No comments:

Post a Comment