The False Hope of Explainable AI

Many folks view explainability as a crucial next step for AI, but a new Lancet paper from a team of AI heavyweights argues that explainability might do more harm than good in the short-term, and AI stakeholders would be better off increasing their focus on validation.

The Old Theory – For as long as we’ve been covering AI, really smart and well-intentioned people have warned about the “black-box” nature of AI decision making and forecasted that explainable AI will lead to more trust, less bias, and greater adoption.

The New Theory – These black-box concerns and explainable AI forecasts might be logical, but they aren’t currently realistic, especially for patient-level decision support. Here’s why:

  • Explainability methods describe how AI systems work, not how decisions are made
  • AI explanations can be unreliable and/or superficial
  • Most medical AI decisions are too complex to explain in an understandable way
  • Humans over-trust computers, so explanations can hurt their ability to catch AI mistakes
  • AI explainability methods (e.g heat maps) require human interpretation, risking confirmation bias
  • Explainable AI adds more potential error sources (AI tool + AI explanation + human interpretation)
  • Although we still can’t fully explain how acetaminophen works, we don’t question whether it works, because we’ve tested it extensively

The Explainability Alternative – Until suitable explainability methods emerge, the authors call for “rigorous internal and external validation of AI models” to make sure AI tools are consistently making the right recommendations. They also advised clinicians to remain cautious when referencing AI explanations and warned that policymakers should resist making explainability a requirement. 

Explability’s Short-Term Role – Explainability definitely still has a role in AI safety, as it’s “incredibly useful” for model troubleshooting and systems audits, which can improve model performance and identify failure modes or biases.

The Takeaway – It appears we might not be close enough to explainable AI to make it a part of short-term AI strategies, policies, or procedures. That might be hard to accept for the many people who view the need for AI explainability as undebatable, and it makes AI validation and testing more important than ever.

ImageBiopsy Lab & UCB’s AI Alliance

Global pharmaceutical company UCB recently licensed its osteoporosis AI technology to MSK AI startup ImageBiopsy Lab, representing an interesting milestone for several emerging AI business models.

The UCB & ImageBiopsy Lab Alliance – ImageBiopsy Lab will use UCB’s BoneBot AI technology to develop and commercialize a tool that screens CT scans for “silent” spinal fractures to identify patients who should be receiving osteoporosis treatments. The new tool will launch by 2023 as part of ImageBiopsy Lab’s ZOO MSK platform.

UCB’s AI Angle – UCB produces an osteoporosis drug that would be prescribed far more often if detection rates improve (over 2/3 of vertebral fractures are currently undiagnosed). That’s why UCB developed and launched BoneBot AI in 2019 and it’s why the pharma giant is now working with ImageBiopsy Lab to bring it into clinical use.

The PharmaAI Trend – We’re seeing a growing trend of drug and device companies working with AI developers to help increase treatment demand. The list is getting pretty long, including quite a few PharmaAI alliances targeting lung cancer treatment (Aidence & AstraZeneca, Qure.ai & AstraZeneca, Huma & Bayer, Optellum & J&J) and a diverse set of AI alliances with medical device companies (Imbio & Olympus for emphysema, Aidoc & Inari for PE, Viz.ai & Medtronic for stroke).

The Population Health AI Trend – ImageBiopsy Lab’s BoneBot AI licensing is also a sign of AI’s growing momentum in population health, following increased interest from academia and major commercial efforts from Cleerly (cardiac screening) and Zebra Medical Vision (cardiac and osteoporosis screening… so far). This alliance also introduces a new type of population health AI beneficiary (pharma companies), in addition to risk holders and patients.

The Takeaway – ImageBiopsy Lab and UCB’s new alliance didn’t get a lot of media attention last week, but it tells an interesting story about how AI business models are evolving beyond triage, and how those changes are bringing some of healthcare’s biggest names into the imaging AI arena.

Who Owns AI Evaluation and Monitoring?

Imaging AI evaluation and monitoring just became even hotter topics, following a particularly revealing Twitter thread and a pair of interesting new papers.

Rads Don’t Work for AI – A Mayo Clinic Florida neuroradiologist took his case to Twitter after an FDA-approved AI tool missed 6 of 7 hemorrhages in a single shift and he was asked to make extra clicks to help improve the algorithm. No AI tool is perfect, but many folks commenting on this thread didn’t take kindly to the idea of being asked to do pro-bono work to improve an algorithm that they already paid for. 

AI Takes Work – A few radiologists with strong AI backgrounds clarified that this “extra work” is intended to inform the FDA about postmarket performance, while monitoring healthcare tools and providing feedback is indeed physicians’ job. They also argued that radiology practices should ensure that they have the bandwidth to monitor AI before deciding to adopt it.

The ACR DSI Gets It – Understanding that “AI algorithms may not work as expected when used beyond the institutions in which they were trained, and model performance may degrade over time” the ACR Data Science Institute (DSI) released a helpful paper detailing how radiologists can evaluate AI before and during clinical use. In an unplanned nod to the above Twitter thread, the DSA paper also noted that AI evaluation/monitoring is “ultimately up to the end users” although many “practices will not be able to do this on their own.” The good news is the ACR DSI is developing tools to help them.

DLIR Needs Evaluation Too – Because measuring whether DL-reconstructed scans “look good” or allow reduced dosage exams won’t avoid errors (e.g. false tumors or removed tumors), a Washington University in St. Louis-led team is developing a framework for evaluating DLIR tools before they are introduced into clinical practice. The new framework comes from some big-name intuitions (WUSTL, NIH, FDA, Cleveland Clinic, UBC), all of whom also appear to agree that AI evaluation is up to the users.

The Takeaway – At least among AI insiders it’s clear that AI users are responsible for algorithm evaluation and monitoring, even if bandwidth is limited and many evaluation/monitoring tools are still being developed. Meanwhile, many AI users (who are crucial for AI to become mainstream) want their FDA-approved algorithms to perform correctly and aren’t eager to do extra work to help improve them. That’s a pretty solid conflict, but it’s also a silver lining for AI vendors who get good at streamlining evaluations and develop low-labor ways to monitor performance.

Aidoc & Riverain’s Platform Partnership

Aidoc and Riverain Technologies announced a new partnership that will make Riverain’s ClearRead CT and ClearRead Xray solutions available on the Aidoc platform, while advancing the companies’ respective platform strategies. 

The Chest AI Package – In addition to offering Riverain’s AI tools individually, Aidoc will provide them as part of an ‘integrated chest AI package’ that also includes Aidoc’s modules for PE, incidental PE, and rib fractures. 

Riverain’s Platform Push – Riverain has amassed a solid network of AI marketplace and OEM partners over the years, and it now appears to be expanding its channel to complementary AI vendors. Riverain’s new Aidoc alliance comes just a few weeks after a similar partnership with Volpara that combines ClearRead CT with the Volpara Lung platform.

Aidoc’s Platform Portfolio – After years of building out its homegrown AI portfolio (7 products) and customer base (600 health centers), Aidoc is evolving into an AI platform company. Over the last year, Aidoc has assembled a solid AI portfolio that combines its own triage products with solutions that it doesn’t offer (Imbio, Icometrix, Subtle Medical, Riverain), allowing its clients to expand their AI stack without overhauling their infrastructure with each new tool.

The Takeaway – We’re at an interesting time in the AI space with a small handful of diversified AI players (e.g. Aidoc, Qure.ai), a group of focused category leaders (e.g. Riverain w/ thoracic, ScreenPoint w/ mammography), and an AI customer base that would prefer not to support multiple AI infrastructures. Although marketplaces also solve this problem, it’s easy to see how complementary vendor partnerships like these could play a growing role in how AI is delivered going forward.

Aidoc and Subtle Medical’s End-to-End Alliance

Aidoc and Subtle Medical launched an interesting new partnership that will make Subtle’s image acquisition / enhancement software available on the Aidoc AI platform.

End-to-End Partnership – The addition of SubtlePET and SubtleMR to the Aidoc AI platform will create what Aidoc called an “end-to-end” solution and “the first joint offering of AI for both image acquisition and triage.” Some folks might mistake that to mean that they will create new combined image acquisition+triage solutions, but they won’t be specifically linked (Aidoc doesn’t have MRI or PET tools yet anyway).

Aidoc, a Platform Company – Aidoc seems to be increasingly positioning itself as an AI platform company, which is an understandable strategy given users’ need for comprehensive / consistent AI workflows. Aidoc’s initial partnerships also allow the triage-focused vendor to offer a far more comprehensive value proposition (Subtle for acquisition, icometrix for stroke analysis/assessment).

Subtle Upsides – The alliance introduces Subtle Medical to Aidoc’s sizable list of clients (used at >500 medical centers, a high profile partnership w/ RP), and adds to Subtle’s current alliances with AI marketplace vendors (e.g. Blackford, Nuance, Incepto) and complementary solutions companies (e.g. Cortechs.ai).

The Takeaway – Although AI platform alliance stories don’t usually earn a spot at the top of The Imaging Wire, this alliance is pretty notable given what it suggests about Aidoc’s AI platform strategy and about the growing trend towards complementary AI alliances. It’s also a nice way for Subtle Medical to expand its reach.

Bad AI Goes Viral

A recent mammography AI study review quickly evolved from a “study” to a “story” after a single tweet from Eric Topol (to his 521k followers), calling mammography AI’s accuracy “very disappointing” and prompting a new flow of online conversations about how far imaging AI is from achieving its promise. However, the bigger “story” here might actually be how much AI research needs to evolve.

The Study Review: A team of UK-based researchers reviewed 12 digital mammography screening AI studies (n = 131,822 women). The studies analyzed DM screening AI’s performance when used as a standalone system (5 studies), as a reader aid (3 studies), or for triage (4 studies).

The AI Assessment: The biggest public takeaway was that 34 of the 36 AI systems (94%) evaluated in three of the studies were less accurate than a single radiologist, and all were less accurate than the consensus of two or more radiologists. They also found that AI modestly improved radiologist accuracy when used as a reader aid and eliminated around half of negative screenings when used for triage (but also missed some cancers).

The AI Research Assessment: Each of the reviewed studies were “of poor methodological quality,” all were retrospective, and most studies had high risks of bias and high applicability concerns. Unsurprisingly, these methodology-focused assessments didn’t get much public attention.

The Two Takeaways: The authors correctly concluded that these 12 poor-quality studies found DM screening AI to be inaccurate, and called for better quality research so we can properly judge DM screening AI’s actual accuracy and most effective use cases (and then improve it). However, the takeaway for many folks was that mammography screening AI is worse than radiologists and shouldn’t replace them, which might be true, but isn’t very scientifically helpful.

Unsupervised COVID AI

MGH’s new pix2surv AI system can accurately predict COVID outcomes from chest CTs, and it uses an unsupervised design that appears to solve some major COVID AI training and performance challenges.

Background – COVID AI hasn’t exactly earned the best reputation (short history + high annotation labor > leading to bad data > creating generalization issues), limiting most real world COVID analysis to logistic regression.

Designing pix2surv – pix2surv’s weakly unsupervised design and use of a generative adversarial network avoids these COVID AI pitfalls. It was directly trained with CTs from MGH’s COVID workflow (no labeling, no supervised training) and accurately estimates patient outcomes directly from their chest CTs.

pix2surv Performance – pix2surv accurately predicted the time of each patient’s ICU admission or death and applied the same analysis to stratify patients into high and low-risk groups. More notably, it “significantly outperformed” current laboratory tests and image-based methods with both predictions.

Applications – The MGH researchers believe pix2surv can be expanded to other COVID use cases (e.g. predicting Long COVID), as well as “other diseases” that are commonly diagnosed in medical images and might be hindered by annotation labor.

The Takeaway – pix2surv will require a lot more testing, and its chance of maintaining this type of performance across other sites and diseases might be a longshot (at least right away). However, pix2surv’s streamlined training and initial results are notable, and it would be very significant if a network like this was able to bring pattern-based unsupervised AI into clinical use.

Veye Validation

A team of Dutch radiologists analyzed Aidence’s Veye Chest lung nodule detection tool, finding that it works “very well,” while outlining some areas for improvement.

The Study – After using Veye Chest for 1.5 years, the researchers analyzed 145 chest CTs with the AI tool and compared its performance against three radiologists’ consensus reads, finding that:

  • Veye Chest detected 130 nodules (80 true positive, 11 false negative, 39 false positives)
  • That’s 88% sensitivity, a 1.04 mean FP per-scan rate, and 95% negative predictive value
  • The radiologists and Veye Chest had different size measurements for 23 nodules
  • Veye Chest tended to overestimate nodule size (bigger than rads w/ 19 of the 23)
  • Veye Chest and the rads’ nodule composition measurements had a 95% agreement rate

The Verdict – The researchers found that Veye Chest “performs very well” and matched Aidence’s specifications. They also noted that the tool is “not good enough to replace the radiologist” and its nodule size overestimations could lead to unnecessary follow-up exams.

The Takeaway – This is a pretty positive study, considering how poorly many recent commercial AI studies have gone and understanding that no AI vendor would dare propose that their AI tools “replace the radiologist.” Plus, it provides the feedback that Aidence and other AI developers need to keep getting better. Given the lack of AI clinical evidence, let’s hope we see a lot more studies like this.

Get every issue of The Imaging Wire, delivered right to your inbox.

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team

You're all set!