Bad AI Goes Viral

A recent mammography AI study review quickly evolved from a “study” to a “story” after a single tweet from Eric Topol (to his 521k followers), calling mammography AI’s accuracy “very disappointing” and prompting a new flow of online conversations about how far imaging AI is from achieving its promise. However, the bigger “story” here might actually be how much AI research needs to evolve.

The Study Review: A team of UK-based researchers reviewed 12 digital mammography screening AI studies (n = 131,822 women). The studies analyzed DM screening AI’s performance when used as a standalone system (5 studies), as a reader aid (3 studies), or for triage (4 studies).

The AI Assessment: The biggest public takeaway was that 34 of the 36 AI systems (94%) evaluated in three of the studies were less accurate than a single radiologist, and all were less accurate than the consensus of two or more radiologists. They also found that AI modestly improved radiologist accuracy when used as a reader aid and eliminated around half of negative screenings when used for triage (but also missed some cancers).

The AI Research Assessment: Each of the reviewed studies were “of poor methodological quality,” all were retrospective, and most studies had high risks of bias and high applicability concerns. Unsurprisingly, these methodology-focused assessments didn’t get much public attention.

The Two Takeaways: The authors correctly concluded that these 12 poor-quality studies found DM screening AI to be inaccurate, and called for better quality research so we can properly judge DM screening AI’s actual accuracy and most effective use cases (and then improve it). However, the takeaway for many folks was that mammography screening AI is worse than radiologists and shouldn’t replace them, which might be true, but isn’t very scientifically helpful.

Unsupervised COVID AI

MGH’s new pix2surv AI system can accurately predict COVID outcomes from chest CTs, and it uses an unsupervised design that appears to solve some major COVID AI training and performance challenges.

Background – COVID AI hasn’t exactly earned the best reputation (short history + high annotation labor > leading to bad data > creating generalization issues), limiting most real world COVID analysis to logistic regression.

Designing pix2surv – pix2surv’s weakly unsupervised design and use of a generative adversarial network avoids these COVID AI pitfalls. It was directly trained with CTs from MGH’s COVID workflow (no labeling, no supervised training) and accurately estimates patient outcomes directly from their chest CTs.

pix2surv Performance – pix2surv accurately predicted the time of each patient’s ICU admission or death and applied the same analysis to stratify patients into high and low-risk groups. More notably, it “significantly outperformed” current laboratory tests and image-based methods with both predictions.

Applications – The MGH researchers believe pix2surv can be expanded to other COVID use cases (e.g. predicting Long COVID), as well as “other diseases” that are commonly diagnosed in medical images and might be hindered by annotation labor.

The Takeaway – pix2surv will require a lot more testing, and its chance of maintaining this type of performance across other sites and diseases might be a longshot (at least right away). However, pix2surv’s streamlined training and initial results are notable, and it would be very significant if a network like this was able to bring pattern-based unsupervised AI into clinical use.

Veye Validation

A team of Dutch radiologists analyzed Aidence’s Veye Chest lung nodule detection tool, finding that it works “very well,” while outlining some areas for improvement.

The Study – After using Veye Chest for 1.5 years, the researchers analyzed 145 chest CTs with the AI tool and compared its performance against three radiologists’ consensus reads, finding that:

  • Veye Chest detected 130 nodules (80 true positive, 11 false negative, 39 false positives)
  • That’s 88% sensitivity, a 1.04 mean FP per-scan rate, and 95% negative predictive value
  • The radiologists and Veye Chest had different size measurements for 23 nodules
  • Veye Chest tended to overestimate nodule size (bigger than rads w/ 19 of the 23)
  • Veye Chest and the rads’ nodule composition measurements had a 95% agreement rate

The Verdict – The researchers found that Veye Chest “performs very well” and matched Aidence’s specifications. They also noted that the tool is “not good enough to replace the radiologist” and its nodule size overestimations could lead to unnecessary follow-up exams.

The Takeaway – This is a pretty positive study, considering how poorly many recent commercial AI studies have gone and understanding that no AI vendor would dare propose that their AI tools “replace the radiologist.” Plus, it provides the feedback that Aidence and other AI developers need to keep getting better. Given the lack of AI clinical evidence, let’s hope we see a lot more studies like this.

Get every issue of The Imaging Wire, delivered right to your inbox.

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team