Friday, 27 March 2026

Similarity is not Relevance

There is a subtle confusion baked into every LLM-powered system in production today, and it is responsible for a larger fraction of failures than most teams realize. The confusion is this: we have built systems optimized for similarity, and we have shipped them as if they deliver relevance. They do not, and the difference is not academic.

Similarity is a geometric property. Two things are similar when they are close to each other in some metric space — cosine distance between embeddings, edit distance between strings, perplexity under a language model. It is computable, differentiable, and entirely indifferent to purpose. Relevance, by contrast, is teleological. Something is relevant if it advances a goal, reduces uncertainty, or changes what you should do next. Relevance is defined relative to an intention. Similarity is blind to intention.

Every major component of modern LLM stacks — retrieval, generation, alignment — is built on similarity. When they fail, they fail for the same reason: they found what was close, not what was needed.

The model is always correct about what is similar. It has no native mechanism for knowing what is needed.


Similarity ≠ Relevance








The autocomplete that always answers the wrong question


Consider a developer working on a distributed payment service. She types a function signature for retry logic with exponential backoff and asks the coding assistant to complete it. The assistant produces a clean, syntactically valid implementation — well-formatted, documented, handling the common cases. It looks exactly like the retry logic that appears in ten thousand open-source repositories.

What the assistant has done is retrieve and synthesize code that is maximally similar to retry logic in its training distribution. What the developer needed was retry logic that respects the idempotency contract already established elsewhere in the codebase, coordinates with the circuit-breaker state that her colleague committed last week, and avoids the cascading retry storm that their incident review identified two sprints ago. None of that information lives in the local similarity neighborhood of "exponential backoff implementation."

The assistant solved the problem it could measure. It optimized for syntactic and semantic proximity to known good code. But relevance in this context is defined by the surrounding system — the architecture decisions, the failure post-mortems, the implicit contracts between services. These are irreducibly contextual. They do not compress into an embedding.


Deeper problem is that similarity-based retrieval actively misleads by presenting confident outputs. A retrieved chunk with cosine similarity 0.91 feels authoritative. Developer accepts it, integrates it, and the failure surfaces in production three months later — not as an obvious crash, but as a subtle degradation under specific load patterns. 

Similarity score was high and relevance was near zero.



Fancy Word , Empty Head

Email generation is where the similarity-relevance gap is most visible and least discussed, because the outputs feel so undeniably correct. 

You ask the model to draft a follow-up email to a client who missed a deadline. It produces something professional, appropriately apologetic, clear in its next-step request, and tonally calibrated to business correspondence. 

Every sentence resembles what a senior professional would write in this situation.

But "resembles" is exactly the problem. 

The model has matched the surface pattern of the email genre. It does not know that this particular client is two weeks from the end of their annual contract and the conversation has been quietly tense since a pricing dispute in Q3. 

It does not know that the missed deadline was likely caused by a restructuring on their side that your account manager mentioned in passing on Slack. 

It does not know that a direct ask for a new timeline would land badly right now, while an offer of support would open the door. The relevant email is defined by that relational history, not by similarity to the genre of follow-up emails.

The model produces text that is similar in form to a good email. It has no mechanism for knowing whether the email is good for this situation.


What gets produced is a document that would earn an A in a business writing course and accomplish nothing — or worse, accelerate a deteriorating relationship by applying a generic professional register to a moment that required something specific. The failure is invisible because the output is fluent. Fluency is a similarity property. 

It measures proximity to well-formed text. It says nothing about whether the text does the right work in the right moment.

This is where RLHF compounds the problem. Human raters, presented with the email during training, reward it — because it looks like good professional writing. 

The model is trained to produce outputs that humans rate as high quality in isolated evaluation. But isolated evaluation cannot capture relational context. 

The model gets better at producing emails that resemble good emails. The gap between resemblance and genuine utility quietly widens.


When grouping by proximity destroys meaning


Clustering is the case that most directly exposes the architectural assumption underneath the whole stack. When you cluster documents, support tickets, or customer feedback using LLM embeddings, you are grouping by geometric proximity in the embedding space. The algorithm puts similar things together. This is, on its face, exactly what clustering should do.

Except that the purpose of clustering is never geometry. The purpose is always analytical — you are trying to understand the structure of a problem, identify actionable segments, or surface patterns that inform a decision. And those analytical goals define what "same group" means, independently of what "similar text" means.

A support ticket that reads "the dashboard is slow" and a ticket that reads "the API is timing out" might be semantically distant in embedding space — different vocabulary, different technical register, different surface description. But if both are caused by the same database query bottleneck, they belong in the same bucket for the engineering team. Conversely, two tickets that both say "I can't log in" might be superficially identical but one is a password reset issue and one is an account suspension, and routing them to the same team is actively harmful.



Similarity clusters by surface. Relevant clusters by cause. The right grouping depends on what you intend to do with the groups.


The geometry of the embedding space does not know that your goal is actionable routing. It knows word co-occurrence patterns. Sometimes those align. When the stakes are low, the alignment is good enough. When you are making resource allocation decisions, prioritizing engineering work, or segmenting customers for intervention, the gap between what is similar and what is relevant determines whether the analysis was worth running.

The seductive part is that similarity-based clusters look coherent. The topics within each cluster feel related. The outputs pass a plausibility check. But plausibility is another similarity property — it measures whether the output resembles something true. It does not measure whether the groupings actually serve the analytical purpose for which the clustering was run.




Three faces but single failure

Across all three cases — the code that fits the genre but breaks the system, the email that sounds right but says the wrong thing, the clusters that are coherent but not actionable — the failure has identical structure. 

The model or the pipeline optimized for a measurable proxy (syntactic similarity, surface fluency, geometric proximity) and produced an output that scores well on that proxy. 

The proxy and the goal coincided in the average training case. They diverged in the specific deployment case. The system had no way to detect the divergence.

This is not a hallucination problem. The outputs in all three cases can be entirely accurate in a narrow sense. The code is syntactically correct. The email is factually unobjectionable. The clusters are internally coherent. The failure is not falseness — it is misalignment between what was optimized and what was needed.


What this means practically is that the verification burden sits entirely with the human in the loop. Every LLM output comes pre-packaged with high confidence and fluent presentation — both similarity properties — and zero signal about whether it is relevant to the specific situation at hand. The engineer must know the system well enough to see past the fluent implementation. The account manager must know the client well enough to see past the professional tone. The analyst must know the business well enough to see past the coherent clusters. The AI provides the shape of an answer. Relevance is still a human judgment.


Conclusion

None of this means the tools are not useful. Similarity to good outputs is a genuinely valuable prior. 

A coding assistant that produces implementations similar to idiomatic, working code accelerates the developer who knows the system. 

An email assistant that produces text similar to professional correspondence accelerates the writer who knows the relationship.

The similarity machinery handles the generic, leaving the expert to handle the specific.

The error is the frame — treating outputs that scored high on similarity as if they had been evaluated for relevance. 

They have not been. They cannot be, because relevance requires the deployment context that was absent at training time. 

The model is excellent at finding what is close. Determining whether what is close is what is needed remains, stubbornly, a problem for the human who knows what is needed.

Confusion about this distinction is major issue. It is the source of an entire category of quiet, confident, professionally-formatted failures.


No comments:

Post a Comment