You Won’t Get GenAI Right If You Get Human Oversight Wrong

By  Steven Mills Noah Broestl, and  Anne Kleppe
Article 12 MIN read

Key Takeaways

Generative AI presents risks, but the go-to solution—humans reviewing the output—isn’t as straightforward as executives think. Oversight needs to be designed, not delegated.
  • Human review is often undermined by automation bias, escalation roadblocks and disincentives, and evaluations steered by vibes rather than guidelines.
  • Oversight works when organizations integrate it into product design, instead of tacking it on at launch, and pair it with other components of GenAI vigilance, such as testing and evaluation.
  • Best practices include developing a structured rubric to evaluate outputs, designing GenAI systems to provide evidence for—and against—their responses, and taking a risk-differentiated approach, where organizations strike a balance between review and efficiency.
Saved To My Saved Content

Like soufflés in the oven and the weather on Mount Everest, generative AI (GenAI) calls for vigilance. Companies get that, but their precautions aren’t always preventive. Many organizations assume that humans in the loop will catch any problems and, fail-safe in place, they’ll deploy GenAI carefree. Yet while human oversight is crucial for mitigating the risks of GenAI, it’s still only one part of a solution. And the typical approach—assigning people to review output—carries risks of its own. The problem is that organizations rely on human oversight without designing human oversight. They have good intent but lack a good model.

That model isn’t elusive. But it has several components that must be designed alongside the GenAI system.

Human oversight works best—which means it actually works—when it is combined with system design elements and processes that make it easier for people to identify and escalate potential problems. It also needs to be paired with other key ingredients of GenAI vigilance, including testing and evaluation, clear articulation of use cases (to ensure that GenAI systems don’t deviate from their intended use), and response planning. Getting this right means thinking about human oversight at the product conception and design stage, when organizations are building a business case for a GenAI solution. Tacking it on during implementation—or worse, just prior to deployment—is too late.

Human oversight works best—which means it actually works—when combined with system design elements and processes that make it easier for people to identify and escalate problems.

A Fail-Safe That Often Fails

One of the unique traits of GenAI is that it can err in the same way humans err: by creating offensive content, demonstrating bias, and exposing sensitive data, for instance. So having humans check the output would seem a logical countermeasure. But there are a number of reasons why simply putting a human in the loop isn’t the fail-safe that organizations envision.

Evaluation based on vibes rather than facts isn’t a viable risk mitigation approach.

Human Oversight by Design

Together, these factors paint a pretty grim picture of human oversight. And that’s the point: simply telling people to watch over the AI isn’t a solution. Eventually, a problem will go unnoticed—and then attract all too much notice. When that happens, the “we had human oversight” defense isn’t going to fly with shareholders, customers, or the news crews camped outside the office.

Subscribe to our Risk Management and Compliance E-Alert.

Don’t count on it flying with regulators or the courts, either. In December 2023, the Court of Justice of the European Union issued an opinion in a case related to assessing the creditworthiness of individuals. The court found that decisions to approve or deny credit applications, ostensibly made by humans, were effectively automated, as the humans routinely relied solely on algorithm-generated scores. This was a textbook case of automation bias. Put simply: the oversight was meaningless; the score was all that mattered.

Meaningful oversight requires more than putting humans in the loop. Companies need to treat oversight as an integral part of GenAI, not an add-on. They need to integrate it into the system’s design and surrounding business processes and develop the procedures and organizational culture that enable people to identify problems—and do something about them. This may seem an onerous task, but in our experience, some best practices can guide the way.

Meaningful oversight requires more than putting humans in the loop. Companies need to treat oversight as an integral part of GenAI, not an add-on.

Define a process around human oversight. Guidelines are better than vibes. Without a structured rubric to evaluate system outputs, human reviewers are often left to rely on hunches and intuition. That may work well for detectives on British television, but it’s less than ideal for GenAI oversight. What should humans be looking for in the output? What are the red flags to consider? The idea is to develop a cookbook for the person interacting with the system that shows them, step by step, how to thoughtfully evaluate results. Like a recipe, the process is well defined and can be performed in a consistent way from one person to the next.

It’s also important to specify, clearly and precisely, reviewer qualifications. Evaluators should have experience that’s relevant to the output. For example, for a system that simplifies insurance claims processing—which includes technical tasks like assessing repair and replacement estimates—a skilled claims adjuster should handle the review.

Human reviewers also need an effective way to escalate errors. Simplicity works best. Companies should design steps that not only enable reporting and response but spark and accelerate it. In our experience, organizations that successfully scale AI devote 70% of their effort to people and processes. For human oversight, this means identifying issues that may hinder error escalation, whether it’s metrics, policies, incentives, or all of the above. And it means designing GenAI solutions in a human-centered way, engaging with users to create features that facilitate output review, such as a “report” button built into the user interface. Finally, the process should include a roadmap for how the organization will respond, in terms of escalation, remediation, and communication, when a reviewer flags a potential GenAI failure.

Design systems to give evidence for and against outputs. GenAI development teams are valuable partners in human oversight. By considering how to drive the review process as they make design decisions, developers can improve the accuracy and efficiency of evaluations. We’ve found that one of the most powerful enablers is context. To that end, GenAI systems should generate a simple summary of both “for” and “against” evidence, giving reviewers a clearer way to decide whether to accept or flag output. There’s an added bonus with this approach: when users ask a GenAI system to make the case, pro and con, for its response, the quality of that response tends to improve.

Track response rejection rates. Human oversight shouldn’t work in isolation but rather as part of a holistic risk mitigation solution. A key component of this integrated approach is a comprehensive test-and-evaluation (T&E) process, one that leverages the strengths of both humans and automation. Robust T&E gives product development teams a good view of a system’s accuracy. Once the GenAI system is in the field, organizations should compare the in-use rejection rate with the rejection rate observed during T&E. Dramatically different rates may indicate that human oversight isn’t working properly (or, just as critically, it may reveal issues with system performance). For example, if you expect the system to produce incorrect answers 20% of the time but reviewers are flagging only 5% of the output, the disparity likely indicates automation bias, pressure to review outputs quickly, or one of the other factors that hinder human oversight.

Establish a quality control process. Oversight can also be enhanced by regularly assessing the quality of human reviews (are they identifying correct and incorrect outputs accurately?) and looking for evidence of automation bias (are reviewers actually assessing the outputs?). One quality control technique is to introduce intentional errors every so often. If a reviewer fails to flag the response as incorrect, they’re alerted that this was a test, and they failed to catch the error. These periodic nudges are often enough to ensure that reviewers will carefully evaluate outputs and that automation bias doesn’t take over. One caveat: organizations need to strike a careful balance, using these test cases in just the right measure. Too many, and the GenAI system’s efficiency gains will suffer. Too few, and the process may have little impact on reviewer performance.

Build review time into the business case. Company leaders—and, ultimately, reviewers—often come to see oversight as a drain on a GenAI solution’s value potential. Evaluations take time, they slow down the works, and they keep organizations from realizing all the gains they anticipated. By factoring review time into the solution’s business case, companies set more realistic expectations for value. This makes it less likely that oversight gets thrown under the bus. And if the value potential is lower as a result, that’s okay, too, as the richer cost-benefit analysis helps companies better prioritize solutions to develop. An up-front analysis also lets leaders know early on that a business case no longer closes. That way, they can cut bait before making an investment they won’t recoup.

Take a risk-differentiated approach. Human oversight is time well spent—usually. An all-in, always-on, leave-no-output-unturned approach can cancel out all efficiency gains that a GenAI system can bring. But by designing oversight in a risk-differentiated way, companies can strike the right balance between review and efficiency. Differentiation can take different forms. It might be based on the purpose of the system (some systems, for instance, may need more review than others) or, drilling down deeper, the specific decision at play, with more review required for higher-risk output.

By designing oversight in a risk-differentiated way, companies can strike the right balance between review and efficiency.

Leverage GenAI for oversight. In areas like data management, GenAI is already proving to be its own enabler. Perhaps the same could be true for oversight. For example, systems designers could build in a self-assessment capability. In effect, the GenAI system critiques itself, providing an assessment of confidence in its answer (say, through a confidence score), with lower certainty triggering more human review. GenAI-based review systems offer the advantages of speed, scale, and immunity to disincentives (they don’t worry about the consequences of pushing the “escalate” button). Organizations that think carefully about how to combine GenAI-based and human reviewers may find themselves with the most effective, most efficient kind of oversight.

Organizations that think carefully about how to combine GenAI-based and human reviewers may find themselves with the most effective, most efficient kind of oversight.

Educate users. Robust human oversight is fueled by knowledge: not only evidence for and against the output but also an understanding of the strengths and limitations of GenAI technologies and the implications of a system’s different risks. Educating users also means sharing results from the T&E phase and providing insight into when the system performs well and when there tend to be gaps. And it means articulating—clearly and precisely—the purpose of a GenAI system or use case, so reviewers can identify not only inaccurate results, but also deviations from the intended function. Organizations should ensure that each GenAI system has a system card—documentation summarizing capabilities, intended use, limitations, risks, and T&E results—and make it easily accessible to all users.


Human oversight helps keep GenAI’s value coming and its perils at bay. But it only works when it is carefully designed, not casually delegated. Companies that get oversight right make it an integral component of both GenAI systems design and risk mitigation. They foster vigilance where and when it matters most. And they empower reviewers to say something when they see something. GenAI systems aren’t perfect; neither are humans. But with robust oversight, both technology and people can realize their potential—safely, reliably, and fully.

Authors

Managing Director &amp; Partner<br/>Chief AI Ethics Officer

Steven Mills

Managing Director & Partner
Chief AI Ethics Officer
Washington, DC

Partner and Associate Director, Responsible AI

Noah Broestl

Partner and Associate Director, Responsible AI
Brooklyn

Managing Director & Partner

Anne Kleppe

Managing Director & Partner
Berlin

Related Content

Saved To My Saved Content
Saved To My Saved Content