WHO urges caution over generative AI in healthcare

The World Health Organisation (WHO) has issued a stark warning against the use of artificial intelligence, particularly large language models, without taking appropriate precautions over bias and misdiagnosis risks. The warning comes as large tech companies tout the benefits to healthcare from their large language AI models. Google has published a version of its PaLM 2 model specifically for healthcare and OpenAI says GPT-4, the model behind ChatGPT, passed a string of medical exams.

Companies are exploring ways to embed AI in healthcare to help with diagnosis and treatement(Photo: Zapp2Photo/Shutterstock)

Users are turning to AI to help in their diagnosis before visiting a doctor, and there are limited trials of using generative AI tools in therapy, the WHO says. But the problem is that if the data used to train the model lacks diversity it could result in misdiagnosis or bias against certain groups. It could also lead to misuse if used by people not trained to understand the results.

The WHO says it is enthusiastic about the potential a large language model holds in supporting healthcare professionals, patients, researchers and scientists. This is particularly the case in improving access to health information, as a decision-supporting tool and to enhance diagnostic capacity – but warns that risks “must be examined carefully.”

“There is concern that caution that would normally be exercised for any new technology is not being exercised consistently with LLMs,” the organisation warned. “This includes widespread adherence to key values of transparency, inclusion, public engagement, expert supervision, and rigorous evaluation.”

The speed of adoption is one of the key risks highlighted by the WHO. OpenAI’s ChatGPT was released in November last year and within four months became one of the fastest growing consumer applications in history. It has sparked a revolution in the tech industry with vendors rushing to incorporate generative AI tools into their software.

Google released a version of its new PaLM 2 large language model, known as MedPaLM 2 in April. The company said: “Industry-tailored LLMs, like Med-PaLM 2, are part of a burgeoning family of generative AI technologies that have the potential to significantly enhance healthcare experiences.”

Microsoft, a major investor in OpenAI, had its research division put GPT-4 to the test against a series of US medical exams. “Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score by over 20 points and outperforms earlier general-purpose models as well as models specifically fine-tuned on medical knowledge.”

This suggests, according to MSFT’s researchers, that there are “potential uses of GPT-4 in medical education, assessment, and clinical practice”, adding that it would have to be done “with appropriate attention to challenges of accuracy and safety.”

LLMs for healthcare need rigorous testing and review

But the WHO said: “Precipitous adoption of untested systems could lead to errors by health-care workers, cause harm to patients, erode trust in AI and thereby undermine (or delay) the potential long-term benefits and uses of such technologies around the world.”

Its main concerns are around the data used to train the model, particularly over the risk it could be biased and generate misleading or innacurate information. This could lead to health, equity and inclusiveness in healthcare, the organisation warned. It was also concerned over the fact it often produces “hallucinations”, or inaccurate information that sounds legitimate to someone unfamiliar with the subject matter.

Other concerns include the use of training data not gathered with appropriate consent, particularly sensitive health information and the potential to generate convincing disinformation that could appear to be reliable health content. WHO recommends policy-makers put patient safety and protections at the heart of any legislation on the use of LLMs. It also proposes that clear evidence be presented and measured before they are approved for widespread use in routine health care and medicine.

A study of the ethics of LLMs in medicine by Hanzhou Li of Emory University School of Medicine found that the technology’s use, regardless of the model or approach, raises “crucial ethical issues” around trust, bias, authorship, equitability and privacy. “Although it is undeniable that this technology has the power to revolutionise medicine and medical research, being mindful of its potential consequences is essential,” Li wrote.

Published last month in medical journal The Lancet, Li said: “An outright ban on the use of this technology would be short-sighted. Instead, establishing guidelines that aim to responsibly and effectively use LLMS is crucial.”

Regulatory oversight likely for AI in healthcare

The final evaluation of whether AI in healthcare is safe is likely to come down to regulators. Where there is a high risk, or if something is classed as a medical device then it would normally need to go through a series of trials before it could be used in diagnosis or any aspect of healthcare.

In the UK, the Medicines and Healthcare products Regulatory Agency (MHRA) published a blog post in March on the potential of these models, and chatbots like Bard or ChatGPT as a medical tool. It found that while a general purpose chatbot not aimed at being used for diagnosis is unlikely to be a medical device. “However, LLMs that are developed for, or adapted, modified or directed toward specifically medical purposes are likely to qualify as medical devices,” wrote Johan Ordish, head of software and AI at the MHRA.

It goes further than this though, as even if it hasn’t been specifically designed or adapted for medical use, where a developer simply makes the claim it “could” be used for medical purposes than it would likely qualify as a medical device. It isn’t clear if OpenAI’s boast that GPT-4 passed medical exams comes under this rule or whether the claim about medical ability would need to be more explicit.

Ordish wrote that the regulation of LLMs, particularly in the medical field would be complex. In part due to difficulties in documentation but they would not be exempt from the medical-device requirements if it was found they were being used and promoted as usable for healthcare.

“The MHRA remains open-minded about how best to assure LLMs but any medical device must have evidence that it is safe under normal conditions of use and performs as intended, as well as comply with other applicable requirements of medical device regulation,” he explained. “We are committed to working collaboratively with all our stakeholders to find solutions where possible, and to keep communicating our regulatory updates to developers.”