
Conventional wisdom dictates that AI should be making life easier, and agentic AI should be making it easier still. However, some users are finding that experimenting with the latest LLMs only highlights the elegance and simplicity of earlier models.
That, at least, is the experience of Ian Makgill, founder of the global tender search site Open Opportunities. Founded by Makgill to provide a central hub for businesses to explore government and public sector procurement opportunities worldwide, the firm churns through approximately 10,000 documents a day from local and national administrations around the world.
What’s more, those documents aren’t always designed to be easily digestible. Language is, naturally, an issue, but many are also inconsistent in their terminology and resist machine analysis and the imposition of search protocols.
“They’re not trying to be seen,” says Makgill. “They’re just putting stuff out to tick a box.” Making this material more accessible is, he says, “traditional, difficult, fiddly work.”
This all offers plenty of opportunities for his firm to add value for its users. And it seems obvious that AI could help.
In fact, it’s been obvious for years. Makgill and his team were early adopters of machine learning. Back in 2019, they began using BERT, Google’s groundbreaking language model for classifying documents. This early LLM was awe-inspiring when it launched in 2018, with its base model boasting over 100 million parameters. And, as Makgill explains, BERT was optimized for the sort of tasks his firm carries out. Open Opportunities was able to fine-tune it further, because “It was an open source model, and it was small enough to put into our infrastructure.”
But, of course, successive models since have raised the game, or at least expectations, about what can be achieved with AI. So, for the last 18 months, Makgill and his team have been experimenting with newer tools and models.
The problem is, “they behave strangely, and they come back with hallucinations. They come back with references that don’t exist in your database.”
One project used RAG to underpin a chatbot designed to translate natural language into queries for the Open Opportunities’ database. Getting it to work properly meant translating language into queries which became such a convoluted process, “that actually, you’d rather have filled in a form.”

Part of the problem was, says Makgill, “The intent of the model, the ambition of the model which is always to serve the user.” That intent seemed to come at the expense of accuracy around counting or resulted in references that didn’t actually exist in the database. As Makgill puts it, “if you’re doing data analysis and you can’t count, you’ve got a problem.”
What does the model want to do?
Another project sought to use GenAI to build scripts to collect data, in effect an Agentic AI-type application. “We thought, okay, we’ll give you a list of sources,” recalls Makgill. “You can go out and write bots that collate those sources provided we give you the right inputs.”
However, it turned out the model could only be instructed to write one script at a time. That was “OK,” says Makgill, but testing and evaluating the code meant “we really weren’t saving a huge amount of time.”
Where the AI did come in useful was helping solve bug issues. “Using it as a pair programming example meant that we were spending less time on fiddly problem-solving challenges. But you couldn’t give it a whole task.”
Another effort to use LLMs such as Claude and ChatGPT for charting also came to nought.
“One of the first problems we found is that we would get different charts for the same data,” he explained. Another problem was inconsistent formatting, for example, on axis labelling or scales.
The team looked to solve this by defining a template and then telling the LLM what data to pour into it to dynamically create new charts. “That was kind of fine, except that we found that it was trying to rewrite the code,” recalls Makgill. “Some of the nuances that you want that a human brings to a data visualization were just getting lost.”
Not all of Open Opportunities’ LLM experiments failed. “We’re introducing dynamic translation, which is really successful,” says Makgill. “We’re doing more enhancements of the data.”
But overall, he said, to get reliable outputs “the tasks that we give to AI have to be incredibly simple.”
Time will, perhaps, solve some of these problems. Michael Azoff, chief analyst at Omdia’s Cloud and Data Center Research Practice, says the evidence suggests that LLMs are continually improving. This might not get us to AGI any time soon, says Azoff, but “it will be highly disruptive and useful in the enterprise application space.”
The problem of hallucinations is best reduced through context awareness and savvy prompt engineering, he added.
Azoff said that it is difficult to get hard data on agentic AI systems. “But,” he says, “my impression is that a lot of agentic AI is still in R&D.”
For Makgill, while LLMs may ultimately deliver on their promise, “no one knows because no one understands what it’s doing under the hood.” As far as he is concerned, BERT still reigns supreme: for Open Opportunities’ purposes at least, it’s “still better than everything that has been created since.”