The Case for Algorithmic Audits

A new Lancet Digital Health study could have become one of the many “AI rivals radiologists” papers that we see each week, but it instead served as an important lesson that traditional performance tests might not prove that AI models are actually safe for clinical use.

The Model – The team developed their proximal femoral fracture detection DL model using 45.7k frontal X-rays performed at Australia’s Royal Adelaide Hospital (w/ 4,861 fractures).

The Validation – They then tested it against a 4,577-exam internal set (w/ 640 fractures), 400 of which were also interpreted by five radiologists (w/ 200 fractures), and against an 81-image external validation set from Stanford.

The Results – All three tests produced results that a typical study might have viewed as evidence of high-performance: 

  • The model outperformed the five radiologists (0.994 vs. 0.969 AUCs)
  • It beat the best performing radiologist’s sensitivity (95.5% vs. 94.5%) and specificity (99.5% vs 97.5%)
  • It generalized well with the external Stanford data (0.980 AUC)

The Audit – Despite the strong results, a follow-up audit revealed that the model might make some predictions for the wrong reasons, suggesting that it is unsafe for clinical deployment:

  • One false negative X-ray included an extremely displaced fracture that human radiologists would catch
  • X-rays featuring abnormal bones or joints had a 50% false negative rate, far higher than the reader set’s overall false negative rate (2.5%)
  • Salience maps showed that AI decisions were almost never based on the outer region of the femoral neck, even with images where that region was clinically relevant (but it still often made the right diagnosis)
  • The model scored a high AUC with the Stanford data, but showed a substantial model operating point shift

The Case for Auditing – Although the study might have not started with this goal, it ended up becoming an argument for more sophisticated preclinical auditing. It even led to a separate paper outlining their algorithmic auditing process, which among other things suggested that AI users and developers should co-own audits.

The Takeaway

Auditing generally isn’t the most exciting topic in any field, but this study shows that it’s exceptionally important for imaging AI. It also suggests that audits might be necessary for achieving the most exciting parts of AI, like improving outcomes and efficiency, earning clinician trust, and increasing adoption.A new Lancet Digital Health study could have become one of the many “AI rivals radiologists” papers that we see each week, but it instead served as an important lesson that traditional performance tests might not prove that AI models are actually safe for clinical use.

Radiology’s AI ROI Mismatch

A thought-provoking JACR editorial by Emory’s Hari Trivedi MD suggests that AI’s slow adoption rate has little to do with its quality or clinical benefits, and a lot to do with radiology’s misaligned incentives.

After interviewing 25 clinical and industry leaders, the radiology professor and co-director of Emory’s HITI Lab detailed the following economic mismatches:

  • Private Practices value AI that improves radiologist productivity, allowing them to increase reading volumes without equivalent increases in headcount. That makes triage or productivity-focused AI valuable, but gives them no economic justification to purchase AI that catches incidentals, ensures follow-ups, or reduces unnecessary biopsies.
  • Academic centers or hospitals that own radiology groups have far more to gain from AI products that detect incidental/missed findings and then drive internal admissions, referrals, and procedures. That means their highest-ROI AI solutions often drive revenue outside of the radiology department, while creating more radiologist labor.
  • Community hospital emergency departments value AI that allows them to discharge or treat emergency patients faster, although this often doesn’t economically benefit their radiology departments or partner practices.
  • Payor/provider health systems (e.g. the VA, Intermountain, Kaiser) can be open to a broad range of AI, but they especially value AI that reduces costs by avoiding unnecessary tests or catching early signs of diseases.


The Takeaway

People tend to paint imaging AI with a wide brush (AI is… all good, all bad, a job stealer, or the future) and we’ve seen a similar approach to AI adoption barrier editorials (AI just needs… trust, reimbursements, integration, better accuracy, or the killer app). However, even if each of these adoption barriers are solved, it’s hard to see how AI could achieve widespread adoption if the groups paying for AI aren’t economically benefiting from it.

Because of that, Dr. Trivedi encourages vendors to develop AI that provides “returns” to the same groups that make the “investments.” That might mean that few AI products achieve widespread adoption on their own, but a diverse group of specialized AI products achieve widespread use across all radiology settings.

Creating a Cancer Screening Giant

A few days after shocking the AI and imaging center industries with its acquisitions of Aidence and Quantib, RadNet’s Friday investor briefing revealed a far more ambitious AI-enabled cancer screening strategy than many might have imagined.

Expanding to Colon Cancer – RadNet will complete its AI screening platform by developing a homegrown colon cancer detection system, estimating that its four AI-based cancer detection solutions (breast, prostate, lung, colon) could screen for 70% of cancers that are imaging-detectable at early stages.

Population Detection – Once its AI platform is complete, RadNet plans to launch a strategy to expand cancer screening’s role in population health, while making prostate, lung, and colon cancer screening as mainstream as breast cancer screening.

Becoming an AI Vendor – RadNet revealed plans to launch an externally-focused AI business that will lead with its multi-cancer AI screening platform, but will also create opportunities for RadNet’s eRAD PACS/RIS software. There are plenty of players in the AI-based cancer detection arena, but RadNet’s unique multi-cancer platform, significant funding, and training data advantage would make it a formidable competitor.

Geographic Expansion – RadNet will leverage Aidence and Quantib’s European presence to expand its software business internationally, as well as into parts of the US where RadNet doesn’t own imaging centers (RadNet has centers in just 7 states).

Imaging Center Upsides – RadNet’s cancer screening AI strategy will of course benefit its core imaging center business. In addition to improving operational efficiency and driving more cancer screening volumes, RadNet believes that the unique benefits of its AI platform will drive more hospital system joint ventures.

AI Financials – The briefing also provided rare insights into AI vendor finances, revealing that DeepHealth has been running at a $4M-$5M annual loss and adding Aidence / Quantib might expand that loss to $10M- $12M (seems OK given RadNet’s $215M EBITDA). RadNet hopes its AI division will become cash flow neutral within the next few years as revenue from outside companies ramp up.

The Takeaway

RadNet has very big ambitions to become a global cancer screening leader and significantly expand cancer screening’s role in society. Changing society doesn’t come fast or easy, but a goal like that reveals how much emphasis RadNet is going to place on developing and distributing its AI cancer screening platform going forward.

IBM Sells Watson Health

IBM is selling most of its Watson Health division to private equity firm Francisco Partners, creating a new standalone healthcare entity and giving both companies (IBM and the former Watson Health) a much-needed fresh start. 

The Details – Francisco Partners will acquire Watson Health’s data and analytics assets (including imaging) in a deal that’s rumored to be worth around $1B and scheduled to close in Q2 2022. IBM is keeping its core Watson AI tech and will continue to support its non-Watson healthcare clients.

Francisco’s Plans – Francisco Partners seems optimistic about its new healthcare company, revealing plans to maintain the current Watson Health leadership team and help the company “realize its full potential.” That’s not always what happens with PE acquisitions, but Francisco Partners has a history of growing healthcare companies (e.g. Availity, Capsule, GoodRx, Landmark Health) and there are a lot of upsides to Watson Health (good products, smart people, strong client list, a bargain M&A multiple, seems ideal for splitting up).

A Necessary Split – Like most Watson Health stories published over the last few years, news coverage of this acquisition overwhelmingly focused on Watson Health’s historical challenges. However, that approach seems lazy (or at least unoriginal) and misses the point that this split should be good news for both parties. IBM now has another $1B that it can use towards its prioritized hybrid cloud and AI platform strategy, and the new Watson Health company can return to growth mode after several years of declining corporate support.

Imaging Impact – IBM and Francisco Partners’ announcements didn’t place much focus on Watson Health’s imaging business, but it seems like the imaging group will also benefit from Francisco Partners’ increased support and by distancing itself from a brand that’s lost its shine. Even losing the core Watson AI tech should be ok, given that the Merge PACS team has increasingly shifted to a partner-focused AI strategy. That said, this acquisition’s true imaging impact will be determined by where the imaging group lands if/when Francisco Partners decides to eventually split up and sell Watson Health’s various units.

The Takeaway – The IBM Watson Health story is a solid reminder that expanding into healthcare is exceptionally hard, and it’s even harder when you wrap exaggerated marketing around early-stage technology and high-multiple acquisitions. Still, there’s plenty of value within the former Watson Health business, which now has an opportunity to show that value.

Right Diagnoses, Wrong Reasons

An AJR study shared new evidence of how X-ray image labels influence deep learning decision making, while revealing one way developers can address this issue.

Confounding History – Although already well known by AI insiders, label and laterality-based AI shortcuts made headlines last year when they were blamed for many COVID algorithms’ poor real-world performance. 

The Study – Using 40k images from Stanford’s MURA dataset, the researchers trained three CNNs to detect abnormalities in upper extremity X-rays. They then tested the models for detection accuracy and used a heatmap tool to identify the parts of the images that the CNNs emphasized. As you might expect, labels played a major role in both accuracy and decision making.

  • The model trained on complete images (bones & labels) achieved an 0.844 AUC, but based 89% of its decisions on the radiographs’ laterality/labels.
  • The model trained without labels or laterality (only bones) detected abnormalities with a higher 0.857 AUC and attributed 91% of its decision to bone features.
  • The model trained with only laterality and labels (no bones) still achieved an 0.638 AUC, showing that AI interprets certain labels as a sign of abnormalities. 

The Takeaway – Labels are just about as common on X-rays as actual anatomy, and it turns out that they could have an even greater influence on AI decision making. Because of that, the authors urged AI developers to address confounding image features during the curation process (potentially by covering labels) and encouraged AI users to screen CNNs for these issues before clinical deployment.

The False Hope of Explainable AI

Many folks view explainability as a crucial next step for AI, but a new Lancet paper from a team of AI heavyweights argues that explainability might do more harm than good in the short-term, and AI stakeholders would be better off increasing their focus on validation.

The Old Theory – For as long as we’ve been covering AI, really smart and well-intentioned people have warned about the “black-box” nature of AI decision making and forecasted that explainable AI will lead to more trust, less bias, and greater adoption.

The New Theory – These black-box concerns and explainable AI forecasts might be logical, but they aren’t currently realistic, especially for patient-level decision support. Here’s why:

  • Explainability methods describe how AI systems work, not how decisions are made
  • AI explanations can be unreliable and/or superficial
  • Most medical AI decisions are too complex to explain in an understandable way
  • Humans over-trust computers, so explanations can hurt their ability to catch AI mistakes
  • AI explainability methods (e.g heat maps) require human interpretation, risking confirmation bias
  • Explainable AI adds more potential error sources (AI tool + AI explanation + human interpretation)
  • Although we still can’t fully explain how acetaminophen works, we don’t question whether it works, because we’ve tested it extensively

The Explainability Alternative – Until suitable explainability methods emerge, the authors call for “rigorous internal and external validation of AI models” to make sure AI tools are consistently making the right recommendations. They also advised clinicians to remain cautious when referencing AI explanations and warned that policymakers should resist making explainability a requirement. 

Explability’s Short-Term Role – Explainability definitely still has a role in AI safety, as it’s “incredibly useful” for model troubleshooting and systems audits, which can improve model performance and identify failure modes or biases.

The Takeaway – It appears we might not be close enough to explainable AI to make it a part of short-term AI strategies, policies, or procedures. That might be hard to accept for the many people who view the need for AI explainability as undebatable, and it makes AI validation and testing more important than ever.

ImageBiopsy Lab & UCB’s AI Alliance

Global pharmaceutical company UCB recently licensed its osteoporosis AI technology to MSK AI startup ImageBiopsy Lab, representing an interesting milestone for several emerging AI business models.

The UCB & ImageBiopsy Lab Alliance – ImageBiopsy Lab will use UCB’s BoneBot AI technology to develop and commercialize a tool that screens CT scans for “silent” spinal fractures to identify patients who should be receiving osteoporosis treatments. The new tool will launch by 2023 as part of ImageBiopsy Lab’s ZOO MSK platform.

UCB’s AI Angle – UCB produces an osteoporosis drug that would be prescribed far more often if detection rates improve (over 2/3 of vertebral fractures are currently undiagnosed). That’s why UCB developed and launched BoneBot AI in 2019 and it’s why the pharma giant is now working with ImageBiopsy Lab to bring it into clinical use.

The PharmaAI Trend – We’re seeing a growing trend of drug and device companies working with AI developers to help increase treatment demand. The list is getting pretty long, including quite a few PharmaAI alliances targeting lung cancer treatment (Aidence & AstraZeneca, Qure.ai & AstraZeneca, Huma & Bayer, Optellum & J&J) and a diverse set of AI alliances with medical device companies (Imbio & Olympus for emphysema, Aidoc & Inari for PE, Viz.ai & Medtronic for stroke).

The Population Health AI Trend – ImageBiopsy Lab’s BoneBot AI licensing is also a sign of AI’s growing momentum in population health, following increased interest from academia and major commercial efforts from Cleerly (cardiac screening) and Zebra Medical Vision (cardiac and osteoporosis screening… so far). This alliance also introduces a new type of population health AI beneficiary (pharma companies), in addition to risk holders and patients.

The Takeaway – ImageBiopsy Lab and UCB’s new alliance didn’t get a lot of media attention last week, but it tells an interesting story about how AI business models are evolving beyond triage, and how those changes are bringing some of healthcare’s biggest names into the imaging AI arena.

Who Owns AI Evaluation and Monitoring?

Imaging AI evaluation and monitoring just became even hotter topics, following a particularly revealing Twitter thread and a pair of interesting new papers.

Rads Don’t Work for AI – A Mayo Clinic Florida neuroradiologist took his case to Twitter after an FDA-approved AI tool missed 6 of 7 hemorrhages in a single shift and he was asked to make extra clicks to help improve the algorithm. No AI tool is perfect, but many folks commenting on this thread didn’t take kindly to the idea of being asked to do pro-bono work to improve an algorithm that they already paid for. 

AI Takes Work – A few radiologists with strong AI backgrounds clarified that this “extra work” is intended to inform the FDA about postmarket performance, while monitoring healthcare tools and providing feedback is indeed physicians’ job. They also argued that radiology practices should ensure that they have the bandwidth to monitor AI before deciding to adopt it.

The ACR DSI Gets It – Understanding that “AI algorithms may not work as expected when used beyond the institutions in which they were trained, and model performance may degrade over time” the ACR Data Science Institute (DSI) released a helpful paper detailing how radiologists can evaluate AI before and during clinical use. In an unplanned nod to the above Twitter thread, the DSA paper also noted that AI evaluation/monitoring is “ultimately up to the end users” although many “practices will not be able to do this on their own.” The good news is the ACR DSI is developing tools to help them.

DLIR Needs Evaluation Too – Because measuring whether DL-reconstructed scans “look good” or allow reduced dosage exams won’t avoid errors (e.g. false tumors or removed tumors), a Washington University in St. Louis-led team is developing a framework for evaluating DLIR tools before they are introduced into clinical practice. The new framework comes from some big-name intuitions (WUSTL, NIH, FDA, Cleveland Clinic, UBC), all of whom also appear to agree that AI evaluation is up to the users.

The Takeaway – At least among AI insiders it’s clear that AI users are responsible for algorithm evaluation and monitoring, even if bandwidth is limited and many evaluation/monitoring tools are still being developed. Meanwhile, many AI users (who are crucial for AI to become mainstream) want their FDA-approved algorithms to perform correctly and aren’t eager to do extra work to help improve them. That’s a pretty solid conflict, but it’s also a silver lining for AI vendors who get good at streamlining evaluations and develop low-labor ways to monitor performance.

Bad AI Goes Viral

A recent mammography AI study review quickly evolved from a “study” to a “story” after a single tweet from Eric Topol (to his 521k followers), calling mammography AI’s accuracy “very disappointing” and prompting a new flow of online conversations about how far imaging AI is from achieving its promise. However, the bigger “story” here might actually be how much AI research needs to evolve.

The Study Review: A team of UK-based researchers reviewed 12 digital mammography screening AI studies (n = 131,822 women). The studies analyzed DM screening AI’s performance when used as a standalone system (5 studies), as a reader aid (3 studies), or for triage (4 studies).

The AI Assessment: The biggest public takeaway was that 34 of the 36 AI systems (94%) evaluated in three of the studies were less accurate than a single radiologist, and all were less accurate than the consensus of two or more radiologists. They also found that AI modestly improved radiologist accuracy when used as a reader aid and eliminated around half of negative screenings when used for triage (but also missed some cancers).

The AI Research Assessment: Each of the reviewed studies were “of poor methodological quality,” all were retrospective, and most studies had high risks of bias and high applicability concerns. Unsurprisingly, these methodology-focused assessments didn’t get much public attention.

The Two Takeaways: The authors correctly concluded that these 12 poor-quality studies found DM screening AI to be inaccurate, and called for better quality research so we can properly judge DM screening AI’s actual accuracy and most effective use cases (and then improve it). However, the takeaway for many folks was that mammography screening AI is worse than radiologists and shouldn’t replace them, which might be true, but isn’t very scientifically helpful.

Unsupervised COVID AI

MGH’s new pix2surv AI system can accurately predict COVID outcomes from chest CTs, and it uses an unsupervised design that appears to solve some major COVID AI training and performance challenges.

Background – COVID AI hasn’t exactly earned the best reputation (short history + high annotation labor > leading to bad data > creating generalization issues), limiting most real world COVID analysis to logistic regression.

Designing pix2surv – pix2surv’s weakly unsupervised design and use of a generative adversarial network avoids these COVID AI pitfalls. It was directly trained with CTs from MGH’s COVID workflow (no labeling, no supervised training) and accurately estimates patient outcomes directly from their chest CTs.

pix2surv Performance – pix2surv accurately predicted the time of each patient’s ICU admission or death and applied the same analysis to stratify patients into high and low-risk groups. More notably, it “significantly outperformed” current laboratory tests and image-based methods with both predictions.

Applications – The MGH researchers believe pix2surv can be expanded to other COVID use cases (e.g. predicting Long COVID), as well as “other diseases” that are commonly diagnosed in medical images and might be hindered by annotation labor.

The Takeaway – pix2surv will require a lot more testing, and its chance of maintaining this type of performance across other sites and diseases might be a longshot (at least right away). However, pix2surv’s streamlined training and initial results are notable, and it would be very significant if a network like this was able to bring pattern-based unsupervised AI into clinical use.

Get every issue of The Imaging Wire, delivered right to your inbox.

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team