I read a lot of AI-related content that ranges from ‘It’s the most important invention in the history of humanity’ to ‘it doesn’t work, and it never will’. I confess that I lean towards the former, but the one argument that I don’t agree with is that hallucinations are the reason AI / Natural Language Processing isn’t ready for prime time.
We humans have been conditioned over time to use things that don’t work reliably but somehow, since we’re all in it together, it becomes acceptable. As evidence, here are a few examples of other not-ready-for-prime-time technologies that don’t work reliably but are used by the majority of us every day:
Bluetooth
Most online meetings begin with at least one participant fighting with audio issues, many of which are Bluetooth headphone related. We’ve all had Bluetooth headphones switch from laptop to phone to tablet (or even the car if someone nearby starts it) and back again seemingly as though our headphones are sentient and doing it for their own amusement, just to torture us. Is Bluetooth secure? No. Is it deterministic? No. Have I ever mischievously streamed my music onto my kid’s Bluetooth speakers when they’re in their room, only because I can, and for my own amusement? Yes, yes I have.
Cell Phone Calls
The Verizon “Can you hear me now” campaign was over 20 years ago, but here we are. Cell phone calls are thoroughly unreliable, and the improvements to the system to make data faster (5G) seem to negatively impact simple voice calling. You know that spot on your commute where the calls have dropped for the past 15 years? It’s still there. How about the suspicious green box in your neighborhood that makes your calls cut out? Still going strong.
I’ve recently resorted to FaceTime audio or Microsoft Teams calling as a more reliable platform for voice communications (yes, I just used ‘Teams’ and ‘reliable’ in the same sentence and I regret nothing).
Automated Attendants (IVR)
Nothing gets the heart pumping and the blood boiling like the bottomless pit — unsolvable maze — I don’t want any of those options — I SAID OPERATOR! auto attendant. All you want to know is why it says your package has been delivered but you can’t find it on the front steps, next to the garage, under the deck, or in the mailbox. If you could ever see a bar chart of the most common responses from these systems “sorry, I didn’t get that” would be tallest bar by a wide margin yet we’re forced to use them every day by corporations that save money by employing them instead of humans. Quick aside: You need not look any deeper to figure out why AI will replace human jobs than this example.
In fact, combine all three of these: Use your cellphone and Bluetooth headphones to contact a call center and then let’s talk about AI hallucinations as the reason AI will not be widely adopted.
The unfortunate reality is that our expectations about things working are now permanently set at ‘sometimes’ for the vast majority of systems that we rely on in our daily lives. This is why I just can’t get on board with the notion that an error rate of greater than zero is disqualifying for AI/NLP.
At work, and in my personal life, anything I write of any consequence is checked by someone I trust. I make mistakes everyday but, thankfully, I have other humans to help me be clear about what I’m trying to say, using acceptable grammar, and not writing in a tone that is inadvertently offensive. We should expect the same when something / anything is generated by AI.
Hallucinations will remain when using AI for the foreseeable future but, in the meantime, there are meaningful ways to reduce them, including:
Effective Prompt Creation
The ever increasing context window means you have the opportunity to provide template examples of the desired format and content of the responses. This can help reduce hallucinations as well as making the responses more deterministic. This is not a great approach for more creative prompts, but what’s a hallucination when you’re asking the LLM to be whimsical? In addition, adding things like ‘only provide factual responses’ and ‘provide citations for all information provided’ can reduce hallucinations and make the human review process faster and easier.
A well-tuned Retrieval Augmented Generation (RAG) or Knowledge Graph system
Leveraging well structured data in a well constructed system can materially reduce hallucinations. It takes time, and work, but the results can be worth it. RAG is the perfect example of ‘five minutes to proof of concept / five months to production’ but once you understand your data set and how best to interact with it, performance will improve.
Agents
One of the reasons hallucinations have been so obvious to users is that they are talking (kind of) directly to the model. The increasing capabilities and productization of Agents that provide specialized functionality, memory, additional context, and a specifically crafted prompt template lead to a reduction in the flexibility afforded to the users, and an increase in the quality of the responses. Adding specific context to each prompt with or without the user’s knowledge can improve the resulting inference and reduce hallucinations.
Multi-agent Frameworks
Using systems like AgentZero, CrewAI, AutoGen, or OpenAI’s Swarmpresents opportunities to leverage one AI agent to check the work of another. Assigning the last agent(s) in a chain with responsibilities to perform fact checking, research assertions, and verify that all provided citations and URLs actually lead to the information referenced is quick way to improve the quality of the output. This is different than ’please check your work’ which isn’t likely to produce great results. Instead, take the opportunity to create a new task with something akin to ‘please research the following information for factuality, provide findings for your research, and provide a list of anything that can not be verified’.
You still need a human in the loop at the end but, as mentioned, that’s true of human generated content as well.
These are short term fixes that are likely to be replaced in time by improvements in the models and the ongoing research for better approaches to reducing hallucinations. However, even if it remained exactly like it is today, the technology is already incredibly useful and viable for many tasks despite its occasional foibles.
Ultimately, discussions about hallucinations will become less interesting, much in the way we don’t discuss our recently dropped phone calls, or how frequent travelers no longer discuss travel delays. At present, however, I still hear many people on web conferences talk about funny examples of AI making mistakes, but only after they fumble for a minute or two to get their headphones connected.
~~ If you disagree with this article, or to hear this menu again, please press pound or hang up. ~~