Even with comprehensive testing and evaluation, the risk of system failure with GenAI will never be zero. Organizations need to quickly sense and respond when failures inevitably occur. They should prepare an extensive escalation response strategy that addresses how to:
  • Detect failures and communicate them to the organization, users, and regulators as necessary. 
  • Debug and correct the issue. 
  • Create contingency plans to address issues that are not easily fixed. 


Subscribe

Tech + Us: Monthly insights for harnessing the full potential of AI and tech.

" "

Key Takeaways

Even with comprehensive testing and evaluation, the risk of system failure with GenAI will never be zero. Organizations need to quickly sense and respond when failures inevitably occur. They should prepare an extensive escalation response strategy that addresses how to:
  • Detect failures and communicate them to the organization, users, and regulators as necessary. 
  • Debug and correct the issue. 
  • Create contingency plans to address issues that are not easily fixed. 


Even with comprehensive testing and evaluation, the risk of system failure with GenAI will never be zero. Organizations need to quickly sense and respond when failures inevitably occur. They should prepare an extensive escalation response strategy that addresses how to:
  • Detect failures and communicate them to the organization, users, and regulators as necessary. 
  • Debug and correct the issue. 
  • Create contingency plans to address issues that are not easily fixed. 


For all its powerful potential, generative AI (GenAI) can generate incorrect outputs, produce harmful or offensive content, and expose organizations to new security vulnerabilities. Before launching GenAI-powered services, organizations should conduct comprehensive testing and evaluation to identify and mitigate these risks. That evaluation should rely on human testers augmented by automated platforms.

But even with comprehensive testing and evaluation, the risk of system failure with GenAI will never be zero. GenAI systems are complex, and the results they generate are nondeterministic. Humans cannot trace input to output through every step of the system. Residual risks, including those that were never anticipated or identified, will always remain and could materialize at any time. 

Traditional predict-and-control approaches, such as testing and evaluation, will fall short. Organizations will need to quickly sense and respond to failures when they occur. They will need a comprehensive monitoring, escalation, response, and recovery strategy.

Why GenAI Is Prone to Failure

Two strengths of GenAI create novel challenges for product owners and senior leaders.

  • General Capabilities. GenAI systems can perform a wide range of tasks, including many that are still emerging. This range is a feature, not a bug. And while these capabilities are a key source of value, they make it impossible to anticipate every risk for a GenAI product. Even if it were possible to map the entire risk landscape, the cost to thoroughly test for every identified risk would be prohibitive. 
  • Nondeterministic Output. Systems produce subtly different answers to the same question. While this feature provides engaging and creative responses, it also means the system will produce answers that are inconsistent or differ materially from the intentions of the product owners. The volume of testing needed to detect these behaviors is often time- or cost-prohibitive.   

Residual risks related to these two challenges of GenAI will always exist and are significantly larger than the risks of systems that produce more predictable and consistent responses. As the complexity of GenAI systems grows, product owners must also be vigilant for these and other unanticipated or unidentified risks.

Residual risks, including those that were never anticipated or identified, will always remain and could materialize at any time.

How to Address the Reality of Failure

Organizations need to be prepared for their GenAI systems to fail. For example, what will you do if your systems disparage a competitor’s products, make a libelous comment, or produce responses that violate regulations?  

Although it’s important to test your GenAI systems prior to deployment, that won’t be sufficient for effectively managing risks. Organizations need a response plan for every product that identifies how they will:

Detect failures. Organizations should be continually monitoring systems performance. When a system is operating outside normal parameters—for example, by generating suspiciously long outputs—it could indicate a failure. Users and employees need a mechanism to report problems, and companies should monitor social media chatter for issues. Periodic red teaming and testing can also help identify nuanced or emerging issues that were otherwise undetected or unanticipated.

Communicate any problems to the organization and to users. Relevant stakeholders, from leadership to development teams, need to know about a potential failure as soon as it occurs. There should be push (email and messaging alerts) and pull mechanisms (real-time dashboards) to allow internal stakeholders to respond uniformly and quickly. Different organizations will take different approaches to whether and how they inform their users of a failure. Some companies fix the issue but do not tell users. Others fix the error and notify users who brought the failure to their attention. Still others notify all users. All these approaches are valid. An ad hoc, seat-of-the-pants response is not. Companies should make a coordinated communication strategy for both internal and external stakeholders integral to their response plans.

Notify regulators and other critical stakeholder groups. Emerging AI regulations require disclosures of noncompliance. Before launching any GenAI system, organizations need to agree on the senior leaders who should be involved and a formal disclosure process. When a failure that warrants disclosure occurs, a well-executed outreach to regulators and other stakeholders instills confidence and demonstrates a commitment to doing the right thing.

Debug and correct the issue. System logging mechanisms, debugging tools, and other mechanisms can help uncover the root causes of an issue. Once an issue has been identified, it must be fixed without affecting overall quality or introducing new risks. Organizations then need a plan to thoroughly test the updated system prior to deployment. They should establish a clear order of operations and decision rights. And they need an overall framework or approach to decide when to implement a temporary point solution and when a system needs to be taken offline to create a more comprehensive fix.

What will you do if your systems disparage a competitor’s products, make a libelous comment, or produce responses that violate regulations?  

Plan for operational resilience. As organizations increasingly depend on GenAI and other systems for core business processes, they need contingency/resiliency plans to continue operating even at degraded levels. If, for example, GenAI is handling three-quarters of customer service responses, what is the plan if the GenAI system needs to be taken down for several weeks to fix an issue? These plans are especially critical in highly automated areas where the workforce has been reduced or redeployed. After accounting for the costs of operational resilience, the proposed solution may no longer be attractive.

As organizations increasingly depend on GenAI and other systems for core business processes, they need contingency/resiliency plans to continue operating even at degraded levels.

Just as with cybersecurity issues, organizations should not wait until they experience a GenAI product failure to figure out how to minimize the fallout. Executives and teams need to develop a response plan during the initial stages of product design. A thoughtful sense-and-respond approach can change a product failure from a crisis to a controlled event that passes quickly without brand damage, loss of customer trust, or regulatory infractions.

Testing and evaluation and escalation response planning are not either/or components of responsible product development strategy. They are complementary approaches. They are both necessary. And given the rush to adopt GenAI, many organizations may not be giving them the attention they deserve to address the systemic risks of GenAI.

Tech + Us: Monthly insights for harnessing the full potential of AI and tech.