Large language models seem to become more varied, complex and accessible by the day. Meta’s recent Llama 3.2 release came just as options like Microsoft’s Phi 3 and Google’s Gemma 2 began to get into their stride. The opportunities, businesses are frequently told, are limitless.

There’s some truth to this claim. It’s no exaggeration to say that AI has the potential to transform almost every industry it touches. Many companies, however, find themselves paralysed by the sheer breadth of options available to them. How can enterprises possibly know which models are best suited for their needs?

The pitfalls of traditional metrics 

Most easily available metrics are inadequate for business purposes. At best, they’re often irrelevant and, at worst, misleading. This isn’t usually deliberate – many standard approaches are conducted with extreme scientific rigour. The problem is that many metrics simply aren’t testing for businesses’ specific use cases.

A common measurement is an enterprise’s BLEU (Bilingual Evaluation Understudy) score. Initially designed for AI translation, the scope of BLEU has expanded in recent years. Language models use BLEU to judge outputs based on reference material. This can be useful but also counterproductive if implemented in certain contexts. An enterprise aiming to develop a lifelike chatbot, for example, requires a level of creativity and flexibility from its automated advisor; strict adherence to reference texts, therefore, is not conducive to producing human-like interactions.

Perplexity is another widely used evaluation metric that, despite its popularity in academia, often falls short for businesses. It measures how well a model predicts sample text, but enterprises are usually more interested in how well models can understand and offer insights, often from highly specialised content.

As a result, many businesses aren’t getting the best out of large language models. They may wonder why models that achieve outstanding BLEU or Perplexity scores aren’t functioning as they need them to. This is unsurprising – after all, use cases can vary hugely between businesses, even within the same industry. Additionally, models often need to understand complex concepts and peculiar jargon. Rather than relying on misleading metrics, enterprises should be equipped with the tools and skills to evaluate models based on what matters to them.

The data dilemma: synthetic or specialised?

Misleading metrics aren’t the only challenge decision-makers face when selecting the right AI models. The majority of open-source Large Language Models (LLMs) primarily use synthetic data (usually generated by advanced models such as GPT-4). This presents several hurdles to enterprises looking to deploy AI effectively.

One immediately obvious potential problem here is that models create a feedback loop. Models trained on data generated by other LLMs run the risk of simply mimicking patterns from the original model. This presents a major barrier to models picking up an understanding of nuanced concepts. 

Worse, this can be extremely difficult to identify right away. Evaluation metrics can show these models as performing well when, in fact, they’re simply aping the style and amplifying the biases of the model that generated the synthetic data. The result? Businesses may end up using AI models that use poor-quality data for extended periods without even knowing.

To be truly effective for most enterprises, LLMs need training on bespoke, domain-specific data, in addition to data from their industry as a whole. Models trained in this way will invariably demonstrate far better performance on specialised tasks. This, however, is far from straightforward. It requires quality data and technical expertise to ensure it’s being used effectively.

An AI-generated image of a bumbling robot at a drive-thru fast food joint, used to illustrate a story about adapting large language models correctly.
All operational elements – from infrastructure considerations to scalability – must be carefully considered before a company deploys an AI in the real world. (Image: Shutterstock)

RAG and Context Sensitivity Evaluation

Context sensitivity evaluation within a retrieval-augmented generation (RAG) setting can help cut through these challenges and give businesses a better idea of how AI models will actually perform. RAG is a technique to improve the reliability of generative AI models, fetching facts from external sources, and providing extra context, with context sensitivity evaluation judging how well a model is following this context.  

The key is understanding how well models incorporate this contextual understanding into their outputs, with different models having varying abilities to follow the context provided. Models that do well in synthetic environments may perform poorly in the face of complicated business communications.

Phi models, for example, occasionally deviate from given instructions, a benefit for some creative tasks, but unacceptable for others where strict regulation or safety considerations are paramount. Gemma models are considered reliable for a number of general applications – but often stumble when used for tasks requiring deep knowledge and specialised skills. Llama models, on the other hand, perform well under context sensitivity evaluation and are notable for their ability to perform complex tasks requiring contextual knowledge and extended reasoning. For organisations in legal or medical fields, a thorough understanding of complex topics is essential. Applications like technical customer support also benefit from this kind of consistency and extended reasoning.

Towards an evaluation framework for large language models

How, then, can enterprises develop best practices for evaluating AI models? Some traditional performance metrics may work well for some business use cases, but for many, they may not be sufficient. For most businesses to accurately judge how effective a model might be, a comprehensive approach is needed.

Models should be assessed based on tailored scenarios that map directly to a business’s particular needs. All operational elements – from infrastructure considerations to scalability – must be carefully considered. For regulated industries that handle sensitive information from customers and clients, this granular detail is even more vital, as choosing the wrong model may present risks associated with data protection.

Once deployed, frequent monitoring can ensure that any problems are spotted and dealt with as soon as possible. This is another way to go beyond initial benchmark scores and can also improve a business’s ability to create its own tests. Using real-world scenarios rather than standardised tests will almost always give enterprises a more accurate reflection of how useful a model will be for their needs.

With so many models on offer, the AI landscape can feel overwhelming for businesses. However, it would be a mistake to simply pick the most convenient or well-known model based on traditional benchmarks. No single evaluation approach will suit every business. A systematic approach that harnesses domain-specific data and real-world scenarios, while being guided by clearly laid out business requirements can allow firms to go beyond the most widely used benchmarks – and finally unlock the potential of AI for their company.

Victor Botev is the CTO and co-Founder of Iris.ai.

Read more: What will carpet bomb attacks mean for security teams in 2025?