So, you’re interested in implementing AI in your organization, but you’re concerned about the reports and studies showing that the majority of businesses fail to obtain a return on their investment, or worse end up significantly harming their business?
Well, you’ve come to the right place. In this article, I will explain why the majority of businesses fail in their AI implementation and what you can do to avoid becoming one of them.
One of the most common reasons for AI project failure is a misunderstanding of the capabilities, strengths, and weaknesses of the technology.
First, let’s look back at traditional IT. Its main strength is its reliability: based on the same set of inputs, you always get the same output.
However, its primary disadvantage is a lack of flexibility. If you want the system to behave even slightly differently, you need developers to make changes to the software, which is both costly and time-consuming.
AI’s strengths and weaknesses are essentially the opposite. It has the ability to adapt and remain flexible without requiring manual code changes, and it can handle use cases that are simply impossible for traditional IT.
On the other hand, AI will occasionally generate "wrong" outputs that appear perfectly plausible but are factually incorrect. These errors are commonly known as "hallucinations."
Unlike traditional IT, which is predictable, AI is probabilistic. This means there is always a non-zero chance of it producing an error, even when it seems completely confident.
These hallucinations can significantly reduce the productivity gains from using AI, and in some cases, cause brand damage or even legal liabilities. Therefore, the most critical aspect of implementing AI in your organization is finding ways to reduce the rate at which they occur.
There are two vital factors that must be taken into consideration:
For example, assume you have a process that requires 10 AI operations for a single use case, and each of those operations has a 10% hallucination rate. The overall failure rate for your use case would be approximately 65%. This is known as compound failure. (For more details on this problem, see this article: The Math That's Killing Your AI Agent).
In these scenarios, reducing the hallucination rate becomes even more critical. However, on the bright side, having 10 steps means you have 10 opportunities to implement safeguards. Every small reduction in the error rate at each step is highly effective, as it significantly reduces the overall compounding effect.
Prompt engineering is the most commonly used method to improve AI results because it is the easiest to implement.
However, it is also quite limited and can be very "fragile." Any model change, such as moving to a new version of the same model, can render previous prompt optimizations useless. Because of this, while your team should perform some level of prompt engineering, you should keep it limited.
This will be one of your most powerful capabilities in the fight against hallucinations.
You should implement as much logic as possible to automatically validate the outputs from AI models. For example:
Additionally, when a validation fails, the system should automatically and transparently retry the prompt. These retries should include specific details on why the previous attempt failed, though you should always set a maximum number of retry attempts to avoid infinite loops.
While there will be limits to how much automatic validation you can perform, and in some cases, it may not be possible at all, you should strive to implement as much as you can.
Manual Validation
Also known as "Human-in-the-Loop," this is (similarly to automated validation) an extremely important technique to mitigate hallucinations.
However, it is also the most expensive way to reduce errors, and this cost must be factored into your return-on-investment (ROI) calculations. Data that has been corrected by a human can be stored alongside the original inputs to serve as "high-quality data" for use in more advanced approaches later on.
Another strategy for manual validation is to run multiple AI models against the same prompt and compare the results. Any divergences between the models are flagged as potential errors that require a re-prompt. While this approach is not effective in all cases, it significantly increases your AI model costs. You can, however, mitigate these costs by combining a commercial SaaS model, such as Gemini or Claude, with a more cost-effective on-premise model.
This is the technique of dynamically adding specific, relevant information to a prompt. The most common approach for this is RAG (Retrieval-Augmented Generation).
The main advantage of RAG is that it is extremely easy to use once the necessary infrastructure and data integrations are implemented. However, it has several significant downsides:
Another approach is to build custom logic that retrieves only the specific information you know will be needed for each prompt. Additionally, including relevant examples, such as high-quality data captured during the human validation step, can be very effective at reducing hallucinations in appropriate use cases.
Important Note: AI models have a tendency to "forget" or "ignore" information when a prompt is too large. This means context engineering can sometimes backfire, so you must be careful and closely monitor how the models handle your prompts.
Last but definitely not least is model fine-tuning.
As mentioned earlier, AI models hallucinate when you prompt them with data that is not similar to the data they were trained on. The most fundamental way to address this is to add your specific data to the model. This is the same approach that AI companies use to improve their own models.
Fine-tuning involves taking large quantities of high-quality, relevant data and combining it with a base model to create a new, specialized version that includes your specific information. However, obtaining this data can often be challenging within an organization.
There are two primary approaches you can use to remediate this:
Because AI models are "black boxes" for all practical purposes, a lot of the "engineering" around them involves educated guesses. Many "prompt best practices" are incorrect or frequently taken out of context.
Therefore, it is vital to be data-driven when using AI. You should capture as many metrics as possible to monitor the error and hallucination rates on an ongoing basis. It is also highly recommended to use A/B testing techniques to evaluate prompt changes or to compare how different AI models behave in each specific use case.
One issue we've found is that implementing all those techniques can be rather time-consuming, and require highly skilled resources.
Because of that we've decided to build a product that is designed to make all those techniques much easier to implement: Aeon AI Manager
We've already implemented a subset of those features which have proven to provide very significant benefits to AI implements, however we will only release the product publicly once we've build all the features required to cover all areas this article covers.
That being said, we do are providing access to a private release of this software to a few companies to get feature feedback, so if you are interested in being one those please contact us