More than half of all ChatGPT responses to questions about software engineering are wrong, according to a new study by Purdue University. The researchers also found 34% of users prefer answers to questions about programming issues created by ChatGPT to those posted by human users on Stack Overflow, despite the errors the AI-generated answers contained. One expert told Tech Monitor that the professional reputations of individual programmers are at risk if they continue to rely on ChatGPT to solve their coding dilemmas.
OpenAI launched its chatbot in November 2022, initially based on the GPT-3 large language model. It has since added a premium version with access to GPT-4, code interpretation and plugins from third parties. The underlying model is also used to power Microsoft’s Github coding assistant Copilot that is widely used by developers.
The study by Purdue is the first to comprehensively examine the characteristics and usability of ChatGPT’s answers to the types of questions regularly shared online. The team had the platform respond to 517 questions previously posted to Stack Overflow where there was a known correct answer from humans using the platform.
Earlier this year, as ChatGPT’s popularity rapidly increased, Stack Overflow banned responses generated by the AI. At the time, it described answers by ChatGPT as superficially good, but consistently incorrect. “The posting of answers created by ChatGPT and other generative AI technologies is substantially harmful to the site and to users who are asking questions and looking for correct answers,” explained a spokesperson from Stack Overflow at the time.
OpenAI has made incremental improvements to the platform and the underlying models since its first release, particularly through GPT-4, but it still isn’t accurate all of the time. Stack Overflow has also since embraced AI, although as a way to categorise its content rather than to answer questions.
The new study found that half of the answers were wrong because ChatGPT had not grasped the concept of the question correctly. “Even when it can understand the question, it fails to show an understanding of how to solve the problem,” the authors wrote. “[It] often focuses on the wrong part of the question or gives high-level solutions without fully understanding the minute details of a problem.” Researchers found that it also had limited reasoning capacity, which led to the creation of solutions, code and formula without any thought as to the outcome.
Users prefer ChatGPT answers on software
OpenAI has since added a code interpreter to ChatGPT, which allows the AI to run the code it creates in a sandbox to check for errors and assess the quality of its output. This, in turn, allows it to verify the final response, make changes and present a more accurate solution. However, this feature remains in beta, and is only available to ChatGPT Plus subscribers.
Despite its obvious drawbacks and the fact 77% of responses are wordier than those from human contributors, many users still rely on ChatGPT to answer their burning questions about code. “ChatGPT answers are still preferred 39.34% of the time due to their comprehensiveness and well-articulated language style,” the authors declared. “Our result implies the necessity of close examination and rectification of errors in ChatGPT, [while] at the same time creating awareness among its users of the risks associated with seemingly correct ChatGPT answers.”
Owen Morris, director of enterprise architecture at IT intelligence company Doherty Associates, told Tech Monitor there are numerous benefits but also disadvantages to using AI that users should always take into consideration before using platforms like ChatGPT. “Tools like ChatGPT offer insights based on the data on which they’re trained (including crawls of the internet and other sources) and will retain their biases, so human involvement remains essential for accuracy and value addition,” says Morris. “It’s important to remember to make use of your team so that they can contribute their own domain-specific knowledge and data to enhance the models’ applicability.”
He warned that without human oversight to contextualise and critically evaluate the responses generated by ChatGPT about software, “there’s a considerable risk that you’ll incorporate incorrect or harmful information into your work, jeopardising its quality and, more widely, your professional reputation.”
Read more: Most AI training data could be synthetic by next year – Gartner