Like soufflés in the oven and the weather on Mount Everest, generative AI (GenAI) calls for vigilance. Companies get that, but their precautions aren’t always preventive. Many organizations assume that humans in the loop will catch any problems and, fail-safe in place, they’ll deploy GenAI carefree. Yet while human oversight is crucial for mitigating the risks of GenAI, it’s still only one part of a solution. And the typical approach—assigning people to review output—carries risks of its own. The problem is that organizations rely on human oversight without designing human oversight. They have good intent but lack a good model.
That model isn’t elusive. But it has several components that must be designed alongside the GenAI system.
Human oversight works best—which means it actually works—when it is combined with system design elements and processes that make it easier for people to identify and escalate potential problems. It also needs to be paired with other key ingredients of GenAI vigilance, including testing and evaluation, clear articulation of use cases (to ensure that GenAI systems don’t deviate from their intended use), and response planning. Getting this right means thinking about human oversight at the product conception and design stage, when organizations are building a business case for a GenAI solution. Tacking it on during implementation—or worse, just prior to deployment—is too late.
Human oversight works best—which means it actually works—when combined with system design elements and processes that make it easier for people to identify and escalate problems.
A Fail-Safe That Often Fails
One of the unique traits of GenAI is that it can err in the same way humans err: by creating offensive content, demonstrating bias, and exposing sensitive data, for instance. So having humans check the output would seem a logical countermeasure. But there are a number of reasons why simply putting a human in the loop isn’t the fail-safe that organizations envision.
- Automation Bias. In effect, success breeds complacency. Humans will review initial outputs, see no errors, and quickly come to trust the system. Appraisals becomes cursory or even nonexistent. Consider something as commonplace as GPS navigation. A driver may have entered addresses hundreds, even thousands of times and in each case, the system successfully directed them to the destination. So when the system takes them on an unpaved path, without an expected landmark in sight, there’s a natural inclination to trust that the technology “knows what it’s doing” and keep taking those turns. We’ve all heard the stories: drivers navigating to the water’s edge or past it; trips that go on and on. Often there’s a simple explanation, such as an address that matches multiple locations (think 15 Main Street) or road work not yet reflected in the software. But the system’s track record created a hyperconfidence in its capabilities. And vital interventions never happened.
- Missing Context. GenAI systems often produce output without any additional information, such as supporting evidence. This lack of context can make it hard for reviewers to determine whether the answer is accurate or appropriate. As a result, reviewers face two choices: conduct additional research, cancelling out any efficiency gains from the system, or, more likely, accept the output at face value if it seems generally correct. Evaluation based on vibes rather than facts isn’t a viable risk mitigation approach.
- Lack of Counterevidence. Even when systems provide supporting evidence, that information justifies only why the output is correct. Few systems also present counterfactual evidence. So, while reviewers see the case in favor of the output, they don’t see the case against the output. Consider a GenAI solution that reviews whether permit applications are complete. The system produces a “yes” response because it sees that an application and the two required supporting documents have been filed. What the system should also share is that one of the supporting documents may be incomplete.
- A Disincentive Structure. GenAI is often employed to drive efficiency. Systems like co-pilots, chatbots, and customer self-service are all about simplifying processes and boosting productivity. But thoroughly evaluating GenAI output takes time—in many cases more than the system’s designers envisioned when they set efficiency targets. Managers are often held to these targets, creating pressure on teams, intended or not, to keep the efficiencies coming. Concerned about the negative repercussions of slowing things down, people are likely to perform only cursory reviews of system outputs.
- Escalation Roadblocks. Many GenAI systems lack mechanisms to flag responses that the user believes are incorrect, leading to uncertainty about what to do next. But an even bigger problem, perhaps, is the general skepticism that often permeates the review process. There’s an assumption that the system is right and anyone claiming otherwise had better have a rock-solid case. Reviewers often need to take tedious administrative steps to justify their belief that the response is wrong. And that’s assuming an escalation process even exists. Some organizations may have a policy requiring that users accept the output.
- Misunderstanding GenAI Capabilities. Knowledge about how GenAI works—its capabilities and limits—can vary wildly, even within a single organization. Hype, combined with the evolving nature of GenAI, often skews perceptions. As a result, many users view it as an almost magical technology and will second-guess themselves before questioning the GenAI system when they see a result that doesn’t seem quite right.
- Focusing on Accuracy but Not Scope. More than with other technologies, GenAI can be taken “off track” if designers focus on what the system should do but not also on what it should not do. Output might be technically correct yet still “bad” if the system deviates from its intended use. Reviewers, however, aren’t always briefed on use case boundaries. Their focus is on accuracy. So out-of-scope responses often go unquestioned.
Evaluation based on vibes rather than facts isn’t a viable risk mitigation approach.
Human Oversight by Design
Together, these factors paint a pretty grim picture of human oversight. And that’s the point: simply telling people to watch over the AI isn’t a solution. Eventually, a problem will go unnoticed—and then attract all too much notice. When that happens, the “we had human oversight” defense isn’t going to fly with shareholders, customers, or the news crews camped outside the office.
Subscribe to our Risk Management and Compliance E-Alert.
Don’t count on it flying with regulators or the courts, either. In December 2023, the Court of Justice of the European Union issued an opinion in a case related to assessing the creditworthiness of individuals. The court found that decisions to approve or deny credit applications, ostensibly made by humans, were effectively automated, as the humans routinely relied solely on algorithm-generated scores. This was a textbook case of automation bias. Put simply: the oversight was meaningless; the score was all that mattered.
Meaningful oversight requires more than putting humans in the loop. Companies need to treat oversight as an integral part of GenAI, not an add-on. They need to integrate it into the system’s design and surrounding business processes and develop the procedures and organizational culture that enable people to identify problems—and do something about them. This may seem an onerous task, but in our experience, some best practices can guide the way.
Meaningful oversight requires more than putting humans in the loop. Companies need to treat oversight as an integral part of GenAI, not an add-on.
Define a process around human oversight. Guidelines are better than vibes. Without a structured rubric to evaluate system outputs, human reviewers are often left to rely on hunches and intuition. That may work well for detectives on British television, but it’s less than ideal for GenAI oversight. What should humans be looking for in the output? What are the red flags to consider? The idea is to develop a cookbook for the person interacting with the system that shows them, step by step, how to thoughtfully evaluate results. Like a recipe, the process is well defined and can be performed in a consistent way from one person to the next.
It’s also important to specify, clearly and precisely, reviewer qualifications. Evaluators should have experience that’s relevant to the output. For example, for a system that simplifies insurance claims processing—which includes technical tasks like assessing repair and replacement estimates—a skilled claims adjuster should handle the review.
Human reviewers also need an effective way to escalate errors. Simplicity works best. Companies should design steps that not only enable reporting and response but spark and accelerate it. In our experience, organizations that successfully scale AI devote 70% of their effort to people and processes. For human oversight, this means identifying issues that may hinder error escalation, whether it’s metrics, policies, incentives, or all of the above. And it means designing GenAI solutions in a human-centered way, engaging with users to create features that facilitate output review, such as a “report” button built into the user interface. Finally, the process should include a roadmap for how the organization will respond, in terms of escalation, remediation, and communication, when a reviewer flags a potential GenAI failure.
Design systems to give evidence for and against outputs. GenAI development teams are valuable partners in human oversight. By considering how to drive the review process as they make design decisions, developers can improve the accuracy and efficiency of evaluations. We’ve found that one of the most powerful enablers is context. To that end, GenAI systems should generate a simple summary of both “for” and “against” evidence, giving reviewers a clearer way to decide whether to accept or flag output. There’s an added bonus with this approach: when users ask a GenAI system to make the case, pro and con, for its response, the quality of that response tends to improve.
Track response rejection rates. Human oversight shouldn’t work in isolation but rather as part of a holistic risk mitigation solution. A key component of this integrated approach is a comprehensive test-and-evaluation (T&E) process, one that leverages the strengths of both humans and automation. Robust T&E gives product development teams a good view of a system’s accuracy. Once the GenAI system is in the field, organizations should compare the in-use rejection rate with the rejection rate observed during T&E. Dramatically different rates may indicate that human oversight isn’t working properly (or, just as critically, it may reveal issues with system performance). For example, if you expect the system to produce incorrect answers 20% of the time but reviewers are flagging only 5% of the output, the disparity likely indicates automation bias, pressure to review outputs quickly, or one of the other factors that hinder human oversight.
Establish a quality control process. Oversight can also be enhanced by regularly assessing the quality of human reviews (are they identifying correct and incorrect outputs accurately?) and looking for evidence of automation bias (are reviewers actually assessing the outputs?). One quality control technique is to introduce intentional errors every so often. If a reviewer fails to flag the response as incorrect, they’re alerted that this was a test, and they failed to catch the error. These periodic nudges are often enough to ensure that reviewers will carefully evaluate outputs and that automation bias doesn’t take over. One caveat: organizations need to strike a careful balance, using these test cases in just the right measure. Too many, and the GenAI system’s efficiency gains will suffer. Too few, and the process may have little impact on reviewer performance.
Build review time into the business case. Company leaders—and, ultimately, reviewers—often come to see oversight as a drain on a GenAI solution’s value potential. Evaluations take time, they slow down the works, and they keep organizations from realizing all the gains they anticipated. By factoring review time into the solution’s business case, companies set more realistic expectations for value. This makes it less likely that oversight gets thrown under the bus. And if the value potential is lower as a result, that’s okay, too, as the richer cost-benefit analysis helps companies better prioritize solutions to develop. An up-front analysis also lets leaders know early on that a business case no longer closes. That way, they can cut bait before making an investment they won’t recoup.
Take a risk-differentiated approach. Human oversight is time well spent—usually. An all-in, always-on, leave-no-output-unturned approach can cancel out all efficiency gains that a GenAI system can bring. But by designing oversight in a risk-differentiated way, companies can strike the right balance between review and efficiency. Differentiation can take different forms. It might be based on the purpose of the system (some systems, for instance, may need more review than others) or, drilling down deeper, the specific decision at play, with more review required for higher-risk output.
By designing oversight in a risk-differentiated way, companies can strike the right balance between review and efficiency.
Leverage GenAI for oversight. In areas like data management, GenAI is already proving to be its own enabler. Perhaps the same could be true for oversight. For example, systems designers could build in a self-assessment capability. In effect, the GenAI system critiques itself, providing an assessment of confidence in its answer (say, through a confidence score), with lower certainty triggering more human review. GenAI-based review systems offer the advantages of speed, scale, and immunity to disincentives (they don’t worry about the consequences of pushing the “escalate” button). Organizations that think carefully about how to combine GenAI-based and human reviewers may find themselves with the most effective, most efficient kind of oversight.
Organizations that think carefully about how to combine GenAI-based and human reviewers may find themselves with the most effective, most efficient kind of oversight.
Educate users. Robust human oversight is fueled by knowledge: not only evidence for and against the output but also an understanding of the strengths and limitations of GenAI technologies and the implications of a system’s different risks. Educating users also means sharing results from the T&E phase and providing insight into when the system performs well and when there tend to be gaps. And it means articulating—clearly and precisely—the purpose of a GenAI system or use case, so reviewers can identify not only inaccurate results, but also deviations from the intended function. Organizations should ensure that each GenAI system has a system card—documentation summarizing capabilities, intended use, limitations, risks, and T&E results—and make it easily accessible to all users.
Human oversight helps keep GenAI’s value coming and its perils at bay. But it only works when it is carefully designed, not casually delegated. Companies that get oversight right make it an integral component of both GenAI systems design and risk mitigation. They foster vigilance where and when it matters most. And they empower reviewers to say something when they see something. GenAI systems aren’t perfect; neither are humans. But with robust oversight, both technology and people can realize their potential—safely, reliably, and fully.