The False Hope of Explainable AI

Many folks view explainability as a crucial next step for AI, but a new Lancet paper from a team of AI heavyweights argues that explainability might do more harm than good in the short-term, and AI stakeholders would be better off increasing their focus on validation.

The Old Theory – For as long as we’ve been covering AI, really smart and well-intentioned people have warned about the “black-box” nature of AI decision making and forecasted that explainable AI will lead to more trust, less bias, and greater adoption.

The New Theory – These black-box concerns and explainable AI forecasts might be logical, but they aren’t currently realistic, especially for patient-level decision support. Here’s why:

  • Explainability methods describe how AI systems work, not how decisions are made
  • AI explanations can be unreliable and/or superficial
  • Most medical AI decisions are too complex to explain in an understandable way
  • Humans over-trust computers, so explanations can hurt their ability to catch AI mistakes
  • AI explainability methods (e.g heat maps) require human interpretation, risking confirmation bias
  • Explainable AI adds more potential error sources (AI tool + AI explanation + human interpretation)
  • Although we still can’t fully explain how acetaminophen works, we don’t question whether it works, because we’ve tested it extensively

The Explainability Alternative – Until suitable explainability methods emerge, the authors call for “rigorous internal and external validation of AI models” to make sure AI tools are consistently making the right recommendations. They also advised clinicians to remain cautious when referencing AI explanations and warned that policymakers should resist making explainability a requirement. 

Explability’s Short-Term Role – Explainability definitely still has a role in AI safety, as it’s “incredibly useful” for model troubleshooting and systems audits, which can improve model performance and identify failure modes or biases.

The Takeaway – It appears we might not be close enough to explainable AI to make it a part of short-term AI strategies, policies, or procedures. That might be hard to accept for the many people who view the need for AI explainability as undebatable, and it makes AI validation and testing more important than ever.

Who Owns AI Evaluation and Monitoring?

Imaging AI evaluation and monitoring just became even hotter topics, following a particularly revealing Twitter thread and a pair of interesting new papers.

Rads Don’t Work for AI – A Mayo Clinic Florida neuroradiologist took his case to Twitter after an FDA-approved AI tool missed 6 of 7 hemorrhages in a single shift and he was asked to make extra clicks to help improve the algorithm. No AI tool is perfect, but many folks commenting on this thread didn’t take kindly to the idea of being asked to do pro-bono work to improve an algorithm that they already paid for. 

AI Takes Work – A few radiologists with strong AI backgrounds clarified that this “extra work” is intended to inform the FDA about postmarket performance, while monitoring healthcare tools and providing feedback is indeed physicians’ job. They also argued that radiology practices should ensure that they have the bandwidth to monitor AI before deciding to adopt it.

The ACR DSI Gets It – Understanding that “AI algorithms may not work as expected when used beyond the institutions in which they were trained, and model performance may degrade over time” the ACR Data Science Institute (DSI) released a helpful paper detailing how radiologists can evaluate AI before and during clinical use. In an unplanned nod to the above Twitter thread, the DSA paper also noted that AI evaluation/monitoring is “ultimately up to the end users” although many “practices will not be able to do this on their own.” The good news is the ACR DSI is developing tools to help them.

DLIR Needs Evaluation Too – Because measuring whether DL-reconstructed scans “look good” or allow reduced dosage exams won’t avoid errors (e.g. false tumors or removed tumors), a Washington University in St. Louis-led team is developing a framework for evaluating DLIR tools before they are introduced into clinical practice. The new framework comes from some big-name intuitions (WUSTL, NIH, FDA, Cleveland Clinic, UBC), all of whom also appear to agree that AI evaluation is up to the users.

The Takeaway – At least among AI insiders it’s clear that AI users are responsible for algorithm evaluation and monitoring, even if bandwidth is limited and many evaluation/monitoring tools are still being developed. Meanwhile, many AI users (who are crucial for AI to become mainstream) want their FDA-approved algorithms to perform correctly and aren’t eager to do extra work to help improve them. That’s a pretty solid conflict, but it’s also a silver lining for AI vendors who get good at streamlining evaluations and develop low-labor ways to monitor performance.

Get every issue of The Imaging Wire, delivered right to your inbox.

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team