as Judge P. Kevin Castel put it, ChatGPT produced a text filled with “bogus judicial decisions, with bogus quotes and bogus internal citations”. Similarly, when computer science researchers tested ChatGPT’s ability to assist in academic writing, they found that it was able to produce surprisingly comprehensive and sometimes even accurate text on biological subjects given the right prompts. But when asked to produce evidence for its claims, “it provided five references dating to the early 2000s. None of the provided paper titles existed, and all provided PubMed IDs (PMIDs) were of different unrelated papers” (Alkaissi and McFarland, 2023). These errors can “snowball”: when the language model is asked to provide evidence for or a deeper explanation of a false claim, it rarely checks itself; instead it confidently producesmore false but normal-sounding claims (Zhang et al. 2023). The accuracy problem for LLMs and other generative Ais is often referred to as the problem of “AI hallucination”: the chatbot seems to be hallucinating sources and facts that don’t exist. These inaccuracies are referred to as “hallucinations” in both technical (OpenAI, 2023) and popular contexts (Weise & Metz, 2023).
These errors are pretty minor if the only point of a chatbot is to mimic human speech or communication. But the companies designing and using these bots have grander plans: chatbots could replace Google or Bing searches with a more user-friendly conversational interface (Shah & Bender, 2022; Zhu et al., 2023), or assist doctors or therapists in medical contexts (Lysandrou, 2023). In these cases, accuracy is important and the errors represent a serious problem.
One attempted solution is to hook the chatbot up to some sort of database, search engine, or computational program that can answer the questions that the LLM gets wrong (Zhu et al., 2023). Unfortunately, this doesn’t work very well either. For example, when ChatGPT is connected to Wolfram Alpha, a powerful piece of mathematical software, it improves moderately in answering simple mathematical questions. But it still regularly gets things wrong, especially for questions which require multi-stage thinking (Davis & Aaronson, 2023). And when connected to search engines or other databases, the models are still fairly likely to provide fake information unless they are given very specific instructions–and even then things aren’t perfect (Lysandrou, 2023). OpenAI has plans to rectify this by training the model to do step by step reasoning (Lightman et al., 2023) but this is quite resource-intensive, and there is reason to be doubtful that it will completely solve the problem—nor is it clear that the result will be a large language model, rather than some broader form of AI.
Solutions such as connecting the LLM to a database don’t work is because, if the models are trained on the database, then the words in the database affect the probability that the chatbot will add one or another word to the line of text it is generating. But this will only make it produce text similar to the text in the database; doing so will make it more likely that it reproduces the information in the database but by no means ensures that it will.
On the other hand, the LLM can also be connected to the database by allowing it to consult the database, in a way similar to the way it consults or talks to its human interlocutors. In this way, it can use the outputs of the database as text which it responds to and builds on. Here’s one way this can work: when a human interlocutor asks the language model a question, it can then translate the question into a query for the database. Then, it takes the response of the database as an input and builds a text from it to provide back to the human questioner. But this can misfire too, as the chatbots might ask the database the wrong question, or misinterpret its answer (Davis & Aaronson, 2023). “GPT-4 often struggles to formulate a problem in a way that Wolfram Alpha can accept or that produces useful output.” This is not unrelated to the fact that when the language model generates a query for the database or computational module, it does so in the same way it generates text for humans: by estimating the likelihood that some output “looks like’’ the kind of thing the database will correspond with.
One might worry that these failed methods for improving the accuracy of chatbots are connected to the inapt metaphor of AI hallucinations. If the AI is misperceiving or hallucinating sources, one way to rectify this would be to put it in touch with real rather than hallucinated sources. But attempts to do so have failed.
The problem here isn’t that large language models hallucinate, lie, or misrepresent the world in some way. It’s that they are not designed to represent the world at all; instead, they are designed to convey convincing lines of text. So when they are provided with a database of some sort, they use this, in one way or another, to make their responses more convincing. But they are not in any real way attempting to convey or transmit the information in the database. As Chirag Shah and Emily Bender put it: “Nothing in the design of language models (whose training task is to predict words given context) is actually designed to handle arithmetic, temporal reasoning, etc. To the extent that they sometimes get the right answer to such questions is only because they happened to synthesize relevant strings out of what was in their training data. No reasoning is involved […] Similarly, language models are prone to making stuff up […] because they are not designed to express some underlying set of information in natural language; they are only manipulating the form of language” (Shah & Bender, 2022). These models aren’t designed to transmit information, so we shouldn’t be too surprised when their assertions turn out to be false.
Springer