Addressing AI hallucinations with retrieval-augmented generation

The hallucinations of large language models are mainly a result of deficiencies in the dataset and training. These can be mitigated with retrieval-augmented generation and real-time data.

shutterstock 264869639 traditional wooden Pinocchio toys in profile
luckyraccoon / Shutterstock

Artificial intelligence is poised to be perhaps the most impactful technology of modern times. The recent advances in transformer technology and generative AI have demonstrated a potential to unlock innovation and ingenuity at scale.

However, generative AI is not without its challenges, which can significantly hinder adoption and the value that can be created with such a transformative technology. As generative AI models grow in complexity and capability, they also present unique challenges, including the generation of outputs that are not grounded in the input data.

These so-called “hallucinations” are instances when models produce outputs that, though coherent, might be detached from factual reality or from the input’s context. This article will briefly survey the transformative effects of generative AI, examine the shortcomings and challenges of the technology, and discuss the techniques available to mitigate hallucinations.

The transformative effect of generative AI 

Generative AI models use a complex computing process known as deep learning to identify patterns in large sets of data and then use this information to create new, convincing outputs. The models do this by incorporating machine learning techniques known as neural networks, which are loosely inspired by the way the human brain processes and interprets information and then learns from it over time.

Generative AI models like OpenAI’s GPT-4 and Google’s PaLM 2 have the potential to accelerate innovations in automation, data analysis, and user experience. These models can write code, summarize articles, and even help diagnose diseases. However, the viability and ultimate value of these models depends on their accuracy and reliability. In critical sectors like healthcare, finance, or legal services, reliable accuracy is of paramount importance. But for all users, these challenges need to be addressed to unlock the full potential of generative AI.

Shortcomings of large language models

LLMs are fundamentally probabilistic and non-deterministic. They generate text based on the likelihood of a particular sequence of words appearing next. LLMs do not have a notion of knowledge and rely solely on navigating through the trained corpus of data as a recommendation engine. They generate text that generally follows the rules of grammar and semantics but that is solely based on satisfying statistical consistency with the prompt.

This probabilistic nature of the LLM can be both a strength and a weakness. If the goal is to produce a correct answer or make critical decisions based on the response, then hallucination is bad and could even be damaging. However, if the goal is a creative endeavor, then an LLM can be used to foster artistic creativity to produce art, storylines, and scripts relatively quickly.

However, regardless of the goal, not being able to trust an LLM model’s output can have serious consequences. It not only erodes trust in the capabilities of these systems but significantly diminishes the impact that AI can have on accelerating human productivity and innovation. 

Eventually, AI is only as good as the data it is trained on. The hallucinations of an LLM are mainly a result of the deficiencies of the dataset and training, including the following. 

  • Overfitting: Overfitting occurs when a model learns the training data too well, including its noise and outliers. Model complexity, noisy training data, or insufficient training data leads to overfitting. This causes low-quality pattern recognition and prevents the model from generalizing well to new data, leading to classification and prediction errors, factually incorrect output, output with a low signal-to-noise ratio, or outright hallucinations. 
  • Data quality: The mislabelling and miscategorization of data used for training can play a significant role in hallucinations. Biased data or the lack of relevant data can in fact lead to outputs of the model that may seem accurate but could prove to be harmful, depending on the decision-making scope of the model recommendations. 
  • Data sparsity: Data sparsity or the need for fresh or relevant data is one of the significant problems that leads to hallucinations and hinders the adoption of generative AI in enterprises. Refreshing data with the latest content and contextual data can help reduce hallucinations and biases. 

Addressing hallucinations in large language models

There are several ways to address hallucinations in LLMs, including techniques like fine-tuning, prompt engineering, and retrieval-augmented generation (RAG).

  • Fine-tuning refers to retraining the model with domain-specific datasets to more accurately generate content that is relevant to the domain. Retraining or fine-tuning the model, however, takes longer and in addition, without continuous training, the data can quickly become outdated. Also, retraining models come with a significant cost burden. 
  • Prompt engineering aims to help the LLM produce high-quality results by providing more descriptive and clarifying features in the input to the LLM as a prompt. Giving the model additional context and grounding it in truth makes it less likely to hallucinate.
  • Retrieval-augmented generation (RAG) is a framework that focuses on grounding the LLMs with the most accurate, up-to-date information. By feeding the model with facts from an external knowledge repository in real time, you can improve the LLM responses. 

Retrieval-augmented generation and real-time data

Retrieval-augmented generation is one of the most promising techniques for improving the accuracy of large language models. RAG coupled with real-time data has proven to significantly alleviate hallucinations.

RAG enables organizations to leverage LLMs with proprietary and contextual data that is fresh. In addition to mitigating hallucinations, RAG helps language models produce more accurate and contextually relevant responses by enriching the input with context-specific information. Fine-tuning is often impractical in a corporate setting, but RAG provides a low-cost, high-yield alternative for delivering personalized, well-informed user experiences.

To boost the RAG model’s effectiveness, it is necessary to combine RAG with an operational data store that has the capability to store data in the native language of LLMs—i.e., high-dimensional mathematical vectors called embeddings that encode the meaning of the text. The database transforms the user’s query to a numerical vector when asked. This enables the vector database to be queried for relevant text, regardless of whether they include the same terms.

A database that is highly available, performant, and capable of storing and querying massive amounts of unstructured data using semantic search is a critical component of the RAG process.

Rahul Pradhan is VP of product and strategy at Couchbase, provider of a leading modern database for enterprise applications. Rahul has 20 years of experience leading and managing both engineering and product teams focusing on databases, storage, networking, and security technologies in the cloud.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Copyright © 2023 IDG Communications, Inc.