Showing posts with label LLM. Show all posts
Showing posts with label LLM. Show all posts

Friday, 27 March 2026

Similarity is not Relevance

There is a subtle confusion baked into every LLM-powered system in production today, and it is responsible for a larger fraction of failures than most teams realize. The confusion is this: we have built systems optimized for similarity, and we have shipped them as if they deliver relevance. They do not, and the difference is not academic.

Similarity is a geometric property. Two things are similar when they are close to each other in some metric space — cosine distance between embeddings, edit distance between strings, perplexity under a language model. It is computable, differentiable, and entirely indifferent to purpose. Relevance, by contrast, is teleological. Something is relevant if it advances a goal, reduces uncertainty, or changes what you should do next. Relevance is defined relative to an intention. Similarity is blind to intention.

Every major component of modern LLM stacks — retrieval, generation, alignment — is built on similarity. When they fail, they fail for the same reason: they found what was close, not what was needed.

The model is always correct about what is similar. It has no native mechanism for knowing what is needed.


Similarity ≠ Relevance








The autocomplete that always answers the wrong question


Consider a developer working on a distributed payment service. She types a function signature for retry logic with exponential backoff and asks the coding assistant to complete it. The assistant produces a clean, syntactically valid implementation — well-formatted, documented, handling the common cases. It looks exactly like the retry logic that appears in ten thousand open-source repositories.

What the assistant has done is retrieve and synthesize code that is maximally similar to retry logic in its training distribution. What the developer needed was retry logic that respects the idempotency contract already established elsewhere in the codebase, coordinates with the circuit-breaker state that her colleague committed last week, and avoids the cascading retry storm that their incident review identified two sprints ago. None of that information lives in the local similarity neighborhood of "exponential backoff implementation."

The assistant solved the problem it could measure. It optimized for syntactic and semantic proximity to known good code. But relevance in this context is defined by the surrounding system — the architecture decisions, the failure post-mortems, the implicit contracts between services. These are irreducibly contextual. They do not compress into an embedding.


Deeper problem is that similarity-based retrieval actively misleads by presenting confident outputs. A retrieved chunk with cosine similarity 0.91 feels authoritative. Developer accepts it, integrates it, and the failure surfaces in production three months later — not as an obvious crash, but as a subtle degradation under specific load patterns. 

Similarity score was high and relevance was near zero.



Fancy Word , Empty Head

Email generation is where the similarity-relevance gap is most visible and least discussed, because the outputs feel so undeniably correct. 

You ask the model to draft a follow-up email to a client who missed a deadline. It produces something professional, appropriately apologetic, clear in its next-step request, and tonally calibrated to business correspondence. 

Every sentence resembles what a senior professional would write in this situation.

But "resembles" is exactly the problem. 

The model has matched the surface pattern of the email genre. It does not know that this particular client is two weeks from the end of their annual contract and the conversation has been quietly tense since a pricing dispute in Q3. 

It does not know that the missed deadline was likely caused by a restructuring on their side that your account manager mentioned in passing on Slack. 

It does not know that a direct ask for a new timeline would land badly right now, while an offer of support would open the door. The relevant email is defined by that relational history, not by similarity to the genre of follow-up emails.

The model produces text that is similar in form to a good email. It has no mechanism for knowing whether the email is good for this situation.


What gets produced is a document that would earn an A in a business writing course and accomplish nothing — or worse, accelerate a deteriorating relationship by applying a generic professional register to a moment that required something specific. The failure is invisible because the output is fluent. Fluency is a similarity property. 

It measures proximity to well-formed text. It says nothing about whether the text does the right work in the right moment.

This is where RLHF compounds the problem. Human raters, presented with the email during training, reward it — because it looks like good professional writing. 

The model is trained to produce outputs that humans rate as high quality in isolated evaluation. But isolated evaluation cannot capture relational context. 

The model gets better at producing emails that resemble good emails. The gap between resemblance and genuine utility quietly widens.


When grouping by proximity destroys meaning


Clustering is the case that most directly exposes the architectural assumption underneath the whole stack. When you cluster documents, support tickets, or customer feedback using LLM embeddings, you are grouping by geometric proximity in the embedding space. The algorithm puts similar things together. This is, on its face, exactly what clustering should do.

Except that the purpose of clustering is never geometry. The purpose is always analytical — you are trying to understand the structure of a problem, identify actionable segments, or surface patterns that inform a decision. And those analytical goals define what "same group" means, independently of what "similar text" means.

A support ticket that reads "the dashboard is slow" and a ticket that reads "the API is timing out" might be semantically distant in embedding space — different vocabulary, different technical register, different surface description. But if both are caused by the same database query bottleneck, they belong in the same bucket for the engineering team. Conversely, two tickets that both say "I can't log in" might be superficially identical but one is a password reset issue and one is an account suspension, and routing them to the same team is actively harmful.



Similarity clusters by surface. Relevant clusters by cause. The right grouping depends on what you intend to do with the groups.


The geometry of the embedding space does not know that your goal is actionable routing. It knows word co-occurrence patterns. Sometimes those align. When the stakes are low, the alignment is good enough. When you are making resource allocation decisions, prioritizing engineering work, or segmenting customers for intervention, the gap between what is similar and what is relevant determines whether the analysis was worth running.

The seductive part is that similarity-based clusters look coherent. The topics within each cluster feel related. The outputs pass a plausibility check. But plausibility is another similarity property — it measures whether the output resembles something true. It does not measure whether the groupings actually serve the analytical purpose for which the clustering was run.




Three faces but single failure

Across all three cases — the code that fits the genre but breaks the system, the email that sounds right but says the wrong thing, the clusters that are coherent but not actionable — the failure has identical structure. 

The model or the pipeline optimized for a measurable proxy (syntactic similarity, surface fluency, geometric proximity) and produced an output that scores well on that proxy. 

The proxy and the goal coincided in the average training case. They diverged in the specific deployment case. The system had no way to detect the divergence.

This is not a hallucination problem. The outputs in all three cases can be entirely accurate in a narrow sense. The code is syntactically correct. The email is factually unobjectionable. The clusters are internally coherent. The failure is not falseness — it is misalignment between what was optimized and what was needed.


What this means practically is that the verification burden sits entirely with the human in the loop. Every LLM output comes pre-packaged with high confidence and fluent presentation — both similarity properties — and zero signal about whether it is relevant to the specific situation at hand. The engineer must know the system well enough to see past the fluent implementation. The account manager must know the client well enough to see past the professional tone. The analyst must know the business well enough to see past the coherent clusters. The AI provides the shape of an answer. Relevance is still a human judgment.


Conclusion

None of this means the tools are not useful. Similarity to good outputs is a genuinely valuable prior. 

A coding assistant that produces implementations similar to idiomatic, working code accelerates the developer who knows the system. 

An email assistant that produces text similar to professional correspondence accelerates the writer who knows the relationship.

The similarity machinery handles the generic, leaving the expert to handle the specific.

The error is the frame — treating outputs that scored high on similarity as if they had been evaluated for relevance. 

They have not been. They cannot be, because relevance requires the deployment context that was absent at training time. 

The model is excellent at finding what is close. Determining whether what is close is what is needed remains, stubbornly, a problem for the human who knows what is needed.

Confusion about this distinction is major issue. It is the source of an entire category of quiet, confident, professionally-formatted failures.


Sunday, 22 March 2026

Induced Demand Loop: Anthropic Sells You the Problem, Then the Solution

Anthropic built Claude Code to write your software. They have done awesome job to make it the most preferred agentic coding tool. It makes sure that you generate best code at first time or with shorter loops.

Now it sells Claude to review what Claude wrote. The snake has found its tail — and this is not an accident.


There is a pattern in business history that feels, the first time you notice it, like a conspiracy. A company creates a category of problem, then creates the solution, then collects rent from the gap between the two. 

Security consultancies who audited the systems they also architected. 

ERP vendors who sold implementation services for the complexity they introduced. 

Management consultants who institutionalized the inefficiencies they were paid to eliminate.

The AI era has produced its own version of this. It is more elegant than the historical ones — structurally self-reinforcing in a way the older models could only approximate. And Anthropic, with the quiet launch of code review as a product category following Claude Code, has demonstrated the loop with unusual clarity.

First, They Shipped the Generator

Claude Code is, at its core, an autonomous coding agent. It reads your codebase, writes implementations, refactors modules, scaffolds tests, and submits pull requests with the confidence of a senior engineer who has never experienced the social cost of a bad review. It is fast, tireless, and cheap. It is also — and this matters — statistically wrong in ways that are difficult to detect without reading every line it produces.



The product was sold, correctly, as a productivity multiplier. The pitch was straightforward: software engineering is bottlenecked on implementation speed, and Claude Code removes that bottleneck. Ship faster. Do more with fewer engineers. The implementation is no longer the hard part.

What this framing quietly omitted was the second-order effect. If you remove the implementation bottleneck, you do not get the same system running faster — you get a different system running under entirely new constraints. The bottleneck shifts. And the new bottleneck, almost inevitably, is verification.


The speed of generation outpaces the speed of comprehension. Code review was already the slowest lane on the engineering highway. Claude Code just added ten more lanes of traffic.

Every line that Claude Code writes must be read by someone who understands it well enough to sign off on it. That person is, in most organizations, increasingly rare. 

The engineers who remain after a round of AI-enabled headcount reduction are the ones reviewing output, not producing it. They were already stretched. Now they are reviewing five times as much code per day. Quality degrades. Bugs ship. Technical debt accumulates at the speed of token generation.


Then, They Shipped the Reviewer

The code review product is the second half of the loop. It reads the code — implicitly, the code that Claude Code wrote — and identifies issues, suggests improvements, flags security concerns, enforces architectural consistency. It is, in essence, an AI that reviews the output of a different AI trained by the same company, sold to the same customer, billed on the same invoice.

The symmetry is so clean it almost obscures the mechanism. But the mechanism is precise: Claude Code created the supply of unreviewed code. Code review created the demand for reviewing it. The company captures value on both ends of the transaction. The customer pays twice for a problem they did not have before they adopted the first product.


The Pattern, Precisely

This is not identical to the older consulting-firm model, where the problem was manufactured through advice. Here, the problem is an emergent property of the product itself. Claude Code does not intend to create review debt — it simply does, structurally, as a consequence of its own efficiency. It is the rational response to a real problem. The fact that the same company profits from both sides is not malfeasance. It is alignment.



This is what i call the induced demand pattern — AI tools that structurally generate the conditions for their own expansion. The code generation category is the clearest instance yet. Generate more code, create more review surface, sell more review tooling, use that revenue to train better generation models, which generate more code. The loop is not just self-sustaining. It is self-accelerating.


Why the Snake Eats Its Own Tail

The ancient image of a serpent consuming itself — was originally a symbol of cyclical renewal. The snake does not die; it feeds itself, perpetually. This is an accurate metaphor for what Anthropic has constructed.

The model that reviews the code learns from what it reviews. The patterns it flags become training signal for the model that writes the code next time. The review product improves the generation product, which increases the volume of code requiring review, which expands the market for the review product. There is no exterior — no part of this loop that does not feed back into the loop itself.






Compare this to the classical tech platform flywheel, where more users attract more sellers who attract more users. That loop is linear in its dependencies — it requires external participants at every node. The AI coding loop is tighter. The only external participant is the engineer, and even the engineer's role is progressively compressed as each generation of the model improves. The loop internalizes its own demand generation.


Implication for Engineers

The engineer who adopts Claude Code and then adopts the code review product has not automated away two separate problems. They have enrolled in a subscription to a problem-solution pair that is jointly managed by a vendor whose revenue depends on both sides of it remaining necessary. This is not a reason to reject the tools — the productivity gains are real, and the competitive pressure to adopt them is overwhelming. But it is a reason to be precise about what is actually happening.

The skills that used to be valuable in this workflow — the ability to write clean code quickly, to hold an architectural pattern in your head while implementing it — are being hollowed out from below. The skills that survive this compression are the ones at the top of the evaluation chain: the ability to read code written by someone else (or something else) and judge it accurately. The ability to know what a correct system feels like before you have built it. The ability to detect subtle errors in logic that no statistical model will flag because no statistical model has ever understood what the code is supposed to do.


The review product is not your ally in this dynamic. It is a product that profits most when the gap between what gets generated and what is actually correct remains large enough to require continuous attention.

This is the tension that no product announcement will name directly. Code review tooling, like all automated verification, has an incentive structure that is subtly misaligned with actually closing the verification gap. 

A perfect reviewer would put itself out of business. A profitable reviewer finds just enough to flag that you keep paying — while the deeper architectural drift, the slow divergence between what the system does and what it should do, accumulates beneath the surface of any automated check.


What the Pattern Predicts

If the induced demand pattern holds — and structurally, I believe it will — the next several years of AI developer tooling will follow a predictable shape. Every tool that accelerates a phase of the engineering lifecycle will create a corresponding tool that manages the debt that acceleration produces. Test generation will be followed by test quality analysis. Documentation generation will be followed by documentation accuracy verification. Architecture suggestion will be followed by architecture review.

Each pair will be sold by the same vendors, or by vendors whose incentives are structurally identical. Each pair will be presented as the solution to a problem, while quietly sustaining the conditions that make the problem recur. The stack will grow upward, each layer extracting value from the gap created by the layer below it.

The engineers who navigate this without becoming permanently dependent on it are the ones who maintain a clear model of what the system is supposed to do — not just what it currently does. That model is not a product. It cannot be sold, automated, or subscribed to. It is built slowly, through exposure to consequences, through the experience of being wrong in ways that matter and learning why.

Judgment compounds. Skills depreciate.

human judgment as cloud Function

Anthropic is not cynically manufacturing problems. The induced demand here is emergent, not engineered. But emergent does not mean neutral. The structure rewards continued dependence, punishes the development of in-house evaluation capability, and gradually transfers the judgment function — the most valuable thing an engineering team possesses — to a vendor whose model of your system is forever incomplete.

The snake eats its tail. The tail grows back. The snake is always hungry.



Friday, 11 April 2025

Model Context Protocol - Old wine in new bottle

 

Inspirations Behind the Model Context Protocol

Model Context Protocol appears to draw inspiration from several established protocols and architectural patterns in software engineering. Some of the key inspirations and the concepts MCP has adopted from them:



The brilliance of MCP is in how it combines these inspirations into a cohesive protocol specifically designed for the unique challenges of LLM context integration. Rather than reinventing the wheel, it takes established patterns that have proven successful in other domains and adapts them to the emerging requirements of AI applications.

What makes MCP unique is its focus on the specific needs of LLM applications, including:

  • Clear security boundaries for sensitive data
  • Standardised resource descriptions optimised for LLM consumption
  • Bidirectional sampling capabilities that enable agentic patterns

Combination of established patterns with AI-specific requirements creates a protocol that feels familiar to developers while addressing the novel challenges of LLM integration.



What is the Model Context Protocol?

Model Context Protocol is a JSON-RPC based protocol designed to standardize communication between AI models and external systems. It enables AI models to access contextual information, tools, and resources from different providers through a unified interface. MCP essentially serves as a bridge, allowing models to extend their capabilities beyond their core training.

Key Parties in MCP

MCP involves two primary parties:

  1. Client - Typically represents the AI model or the application hosting the model. The client initiates connections, makes requests for information, and may also receive requests from the server.
  2. Server - Provides resources, tools, and contextual information that the client can access. Servers can be specialized providers of specific functionality or broader ecosystem components.





Core Concepts in MCP



Key Parties in MCP



MCP Workflow



How does tools work

Tool is key abstraction in MCP that connect LLM to real world and give it capability to have information from outside world and also to take action.



Tool matching 

You might have question how what happens before tool invocation. How does LLM select which tool to use, now this could be very specific to underlying LLM but it will use some algorithm based on few core ideas.


Conclusion

Model Context Protocol (MCP) has captured widespread attention, highlighted by Google's recent Agent-2-Agent protocol release. The buzz around this is palpable, with LLM tools and companies making significant investments, anticipating it as the next major leap in Generative AI with the potential to unlock numerous use cases for working with Large Language Models.

While MCP undoubtedly solves an important integration challenge for LLMs, the fundamental question remains: what capabilities will these MCP servers or other implementations actually expose in terms of manipulating and enriching the interaction with LLMs? If these capabilities address only trivial or low-impact problems, our focus should arguably be on leveraging these transformative technologies to build truly innovative ("zero to one") capabilities that fundamentally change how we work with LLMs, rather than simply creating new interfaces for existing ones. 

So, the answer to my question about "old wine in a new bottle" is yes. The bottle is indeed shiny, creating a strong desire for it, much like the latest tech gadget.

Thursday, 3 April 2025

Teaching LLMs to Reason: The Journey from Basic Prompting to Self-Generated Examples

In recent years, Large Language Models (LLMs) have made remarkable strides in their ability to reason—to break down complex problems, apply logic systematically, and arrive at well-justified conclusions. This post explores the fascinating evolution of reasoning mechanisms in LLMs, tracking the progression from basic pattern-matching to sophisticated reasoning techniques that approach human-like problem-solving abilities.




The evolution of reasoning in Large Language Models from pattern matching to advanced reasoning techniques

The Major Breakthroughs in LLM Reasoning

DateResearchKey InnovationImpact
Jan 2023Chain-of-Thought Prompting (Wei et al.)Breaking problems into explicit stepsDoubled performance on complex reasoning tasks
March 2023Self-Consistency (Wang et al.)Multiple reasoning paths with majority voting+10-18% improvement across reasoning tasks
March 2023LLMs as Prompt Engineers (Zhou et al.)Models generating and optimizing their own promptsOutperformed human-crafted prompts
March 2024Analogical Reasoning (ICLR 2024)Self-generated examples for new problemsEliminated need for human-created examples




Reasoning Challenge in LLMs

Early LLMs excelled at pattern recognition but struggled with multi-step reasoning. When faced with complex problems requiring logical deduction or mathematical calculation,

these models would often:

  • Jump directly to incorrect conclusions
  • Fail to break down problems into manageable steps
  • Show inconsistent reasoning abilities
  • Struggle with problems requiring more than one or two logical steps
Gap between pattern matching in traditional LLMs and the requirements of multi-step reasoning tasks


This limitation wasn't surprising. Traditional training objectives didn't explicitly reward step-by-step reasoning—they simply encouraged models to predict the next token
based on patterns in their training data.

Chain-of-Thought: The Breakthrough

The introduction of Chain-of-Thought (CoT) prompting by Wei et al. in 2022 marked a pivotal moment in LLM reasoning capabilities.

This technique demonstrated that large language models could perform complex reasoning when prompted to show their work.

How Chain-of-Thought Works

CoT prompting exists in two primary forms:

Few-Shot CoT: Providing explicit examples that include intermediate
reasoning steps

Zero-Shot CoT: Simply instructing the model to "think step by step"

Key Findings About Chain-of-Thought

The research on Chain-of-Thought revealed several important insights:

Reasoning as an Emergent Ability
CoT reasoning is an emergent capability that appears only in sufficiently large models (typically ~100B+ parameters).

Dramatic Performance Improvements
On complex reasoning tasks like GSM8K (math word problems), performance more than doubled for large models using CoT prompting.

No Fine-tuning Required
This capability was achieved through prompting alone, without model modifications.

Enabling Multi-step Problem Solving
CoT allows models to break complex problems into manageable chunks.


Self-Consistency: Enhancing Chain-of-Thought

While CoT represented a breakthrough, it still had limitations. The follow-up research by Wang et al. (2022) on "Self-Consistency" addressed a
critical weakness: reliance on a single reasoning path.

The Self-Consistency Approach

Rather than generating a single chain of thought, Self-Consistency:
  1. Samples multiple diverse reasoning paths for the same problem
  2. Lets each path reach its own conclusion
  3. Takes the most consistent answer across all paths as the final answer




This approach mimics how humans gain confidence in solutions—when multiple different
approaches lead to the same answer, we trust that result more.


LLMs as Analogical Reasoners

The next evolution in LLM reasoning came from understanding these models as analogical reasoners, introduced in research presented at ICLR 2024.
This approach mirrors how humans tackle unfamiliar problems—by recalling similar challenges we've solved before.

The Analogical Prompting Method

Analogical prompting instructs LLMs to:

  1. Self-generate relevant examples related to the current problem
  2. Generate high-level conceptual knowledge about the problem domain
  3. Apply this knowledge to solve the original problem



Key Advantages of Self-Generated Examples

This approach offers several benefits:

No manual labeling needed: Unlike few-shot CoT, no human needs to create examples

Problem-specific relevance: The examples are tailored to each specific problem type

Adaptability across domains: The technique works across mathematics, coding, and other domains

Implementation simplicity: Everything happens in a single prompt


From Reasoning to Meta-Reasoning: LLMs as Prompt Engineers

The most fascinating development is the discovery that LLMs can function as their own prompt engineers. Research by Zhou et al. on "Automatic Prompt Engineering" (APE)
demonstrates that LLMs can generate and optimize instructions for other LLMs to follow.




This creates a meta-reasoning capability where:

  1. One LLM generates candidate instructions based on examples
  2. These instructions are tested on their effectiveness
  3. The best-performing instructions are selected
  4. The process iterates toward optimal prompting strategies

The Evolution of Reasoning Prompts

Through this research, we've seen a remarkable progression in the prompts used

to elicit reasoning:

Basic CoT: Let's think step by step

Refined CoT: Let's work this out in a step by step way to be sure we have the right answer

Analogical CoT: Recall three relevant problems and their solutions followed by problem-solving

APE-generated prompts: Complex, automatically optimized instructions

Implications for AI Development

These advances in LLM reasoning have profound implications:

Emergent Capabilities: Reasoning appears to emerge at certain model scales, suggesting other cognitive abilities might similarly emerge with scale.

Human-Like Problem Solving: The success of analogical reasoning and self-consistency suggests LLMs might be modeling aspects of human cognition more
closely than previously thought.

Reduced Need for Fine-Tuning: Many reasoning improvements come from better prompting rather than model modifications, potentially reducing the computational
costs of improvement.

Meta-Learning Potential: LLMs' ability to generate effective prompts for themselves hints at meta-learning capabilities that could lead to more autonomous
AI systems.

Conclusion

The evolution of reasoning in LLMs—from simple pattern matching to chain-of-thought to analogical reasoning and beyond—represents one of the most exciting trajectories
in AI research. These advances have not only improved performance on benchmark tasks but have
also deepened our understanding of how these models function.

As research continues, we can expect further refinements in how we elicit reasoning from LLMs, potentially unlocking even more sophisticated
problem-solving capabilities.

The boundary between pattern recognition and true reasoning continues to blur, bringing us closer to AI systems that can tackle the full spectrum of human reasoning tasks.

What's particularly exciting is that many of these techniques are accessible to practitioners today through careful prompt engineering, making advanced reasoning capabilities
available without requiring specialized model training or massive computational resources.

Welcome to Inference time compute! New Market that is getting created. This should give
idea around deepseek moment :-)

Sunday, 7 July 2024

Top large language model to watch

The LLM landscape is exploding! With the immense potential of large language models, competition is fierce as companies race to develop the most powerful and innovative models. Training these models presents a lucrative business opportunity, attracting major players and startups alike.

Keeping track of the leaders is challenging. The LLM space is highly competitive, making it difficult to identify a single frontrunner. New versions are released constantly, pushing the boundaries of what's possible. While some might see this as a race to the bottom, it's more accurate to view it as rapid innovation that will ultimately benefit everyone.


Top company as of July,2024





Above diagram is in 2 groups , one for commercial ones and other one for hybrid(commercial/open weights) 

Commercial

OpenAI

This is poster child of LLMs, it has series of GPT* models. First large scale provider consumer LLMs.



GPT4-O is flagship model and all the models are available via API. This is very well funded and microsoft is behind this.

More details about model can be found at Open AI Model 

Research paper talking about GPT4 Model is available at 

GPT-4 Technical Report 

 GPT 1.0

GPT 2.0

Language Models are Few-Shot Learners

Evaluating Large Language Models Trained on Code

Amazon

Amazon has family of models called "Titan". Amazon Titan family of models incorporates Amazon’s 25 years of experience innovating with AI and machine learning across its business. Amazon Titan foundation models (FMs) provide customers with a breadth of high-performing image, multimodal, and text model choices, via a fully managed API.


More details about model can be found at Amazon Models

No research papers are available about amazon LLM model details. It is all propriety to keep competitive edge.


Antropic

Antropic is cofounded by some of ex Open AI employee. 


Anthropic's latest offering, Claude 3.5 Sonnet, has generated significant buzz. This powerful language model builds upon their previous success with Claude 3 Opus and is claimed to outperform OpenAI's GPT-4o, particularly in coding tasks.
Antropic is also very well funded, Amazon and google are major investor.

More details about model can be found at Antropic Models

Antropic models will be based on Open-AI type of architecture but they are focused on few research principal like 
AI as Systematic Science , safety and scaling 

One of the popular research paper from antropic is mapping-mind-language-model

MoasicML

MosaicML, co-founded by an MIT alumnus and a professor, made deep-learning models faster and more efficient. It was acquired by Databricks. 

Mosaic Pretrained Transformers (MPT) are GPT-style models with some special features -- Flash Attention for efficiency, ALiBi for context length extrapolation, and stability improvements to mitigate loss spikes.


More details about model can be found at mosaic ml

Some popular research papers are Train Short, Test Long and Flash attention


InflectionAI

Inflection AI focuses on developing a large language model (LLM) for personal use called Inflection.



Not much details is available about how model was trained but they claim - world's top empathetic Large Language Model (LLM)

More details about model can be found at inflection-2-5


Hybrid/Open Source

Google

Google inventor of famous paper Attention Is All You Need that became kernel of all the LLMs we see today. 
Google has been releasing LLM to community before Chatgpt came, Bert was one of the first model based on encoder/decoder and become foundation for many LLM that we see.







Google offers large language models (LLMs) across a spectrum of availability. Some models are fully commercial with open weights, meaning the underlying code is proprietary but the model outputs are accessible.

The Gemini family exemplifies this, with variants like Ultra, Pro (introduced in v1.5), Flash, and Nano catering to different needs in terms of size and processing power.

In contrast, Gemma is Google's open-source LLM family. It's designed for developers and researchers and comes in various sizes (e.g., Gemma 2B and 7B) for flexibility


Lots of reading material is available from google on LLM and Gemma models, some of the popular ones are 


Meta

Meta builds LLama series of model, these are open source and Meta designed Llama to be efficient, achieving good performance while being trained on publicly available datasets.



Llama3 is most recent and state of art. These models are trained by meta and made available via various hosting platform. Llama3 is is extended by other vendors like Gradient , Nvidia , dolphin etc.

Details about model is available at llama3

Meta has publish lots of paper from first version of model, some of the popular ones are 




Mistral

Mistral is french based company and they release all model weights under Apache 2.0.
Mistral strives to create efficient models that require less computational power compared to some competitors. This makes them more accessible to a wider range of users.

Mistral innovation is around Grouped Query Attention (GQA). Some of the recent models are based on Mixture Of Expert.




More details about model is available at Mistral models



DataBricks

Databricks is building open source model that are based on MOE. Most recent and state of the art model is DBRX.





Details about model is available at introducing-dbrx-new-state-art-open-llm


Some of popular research papers are 

Cohere

Cohere is canadian based company. They build model called CommandR, it is a state-of-the-art RAG-optimized model designed to tackle enterprise-grade workloads.



More details about model can be found at Command-R

Some of popular research papers are RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

 

Microsoft

While Microsoft leverages OpenAI's powerful GPT-4 language models for some functionalities, they've also made significant contributions to open-source AI with the Phi-3 family of models.

Phi-3 models are a type of small language model (SLM), specifically designed for efficiency and performance on mobile devices and other resource-constrained environments.



 More details about model can be found at phi-3

Some of popular research papers related to Phi series model are Textbooks Are All You Need , Textbooks Are All You Need II and Phi-3 Technical Report


Conclusion

We are witnessing an interesting time where many large language model (LLM) models are available for building apps, accessible to both consumers and developers. Predicting the dominant player is difficult due to the rapidly changing landscape.

One key concept to grasp is that the GENAI stack is multifaceted. Foundation models are just one layer, and they can be quite expensive due to hardware requirements. Training a foundation model can easily cost millions of dollars, making it difficult for companies to maintain a competitive edge.

As software engineers, we need to leverage this technology by selecting the best model for each specific use case. Defining "best" can be subjective, and the answer often depends on various factors.

Here's a crucial consideration: while using the top-performing LLM might be tempting, it's vital to maintain a flexible architecture. This allows you to easily switch to newer LLMs, similar to how we switch between databases or other vendor-specific technologies.

In the next part of this blog, I'll explore the inference side of LLMs, a fascinating area that will ultimately determine the return on investment (ROI) for companies making significant investments in this technology.