Grading AI Report Quality

One of the most exciting new use cases for medical AI is in generating radiology reports. But how can you tell whether the quality of a report generated by an AI algorithm is comparable to that of a radiologist?

In a new study in Patterns, researchers propose a technical framework for automatically grading the output of AI-generated radiology reports, with the ultimate goal of producing AI-generated reports that are indistinguishable from those of radiologists. 

Most radiology AI applications so far have focused on developing algorithms to identify individual pathologies on imaging exams. 

  • While this is useful, helping radiologists streamline the production of their main output – the radiology report – could have a far greater impact on their productivity and efficiency. 

But existing tools for measuring the quality of AI-generated narrative reports are limited and don’t match up well with radiologists’ evaluations. 

  • To improve that situation, the researchers applied several existing automated metrics for analyzing report quality and compared them to the scores of radiologists, seeking to better understand AI’s weaknesses. 

Not surprisingly, the automated metrics fell short in several ways, including false prediction of findings, omitting findings, and incorrectly locating and predicting the severity of findings. 

  • These shortcomings point out the need for better scoring systems for gauging AI performance. 

The researchers therefore proposed a new metric for grading AI-generated report quality, called RadGraph F1, and a new methodology, RadCliQ, to predict how well an AI report would measure up to radiologist scrutiny. 

  • RadGraph F1 and RadCliQ could be used in future research on AI-generated radiology reports, and to that end the researchers have made the code for both metrics available as open source.

Ultimately, the researchers see the construction of generalist medical AI models that could perform multiple complex tasks, such as conversing with radiologists and physicians about medical images. 

  • Another use case could be applications that are able to explain imaging findings to patients in everyday language. 

The Takeaway

It’s a complex and detailed paper, but the new study is important because it outlines the metrics that can be used to teach machines how to generate better radiology reports. Given the imperative to improve radiologist productivity in the face of rising imaging volume and workforce shortages, this could be one more step on the quest for the Holy Grail of AI in radiology.

Does ‘Automation Neglect’ Limit AI’s Impact?

Radiologists ignored AI suggestions in a new study because of “automation neglect,” a phenomenon in which humans are less likely to trust algorithmic recommendations. The findings raise questions about whether AI really should be used as a collaborative tool by radiologists. 

How radiologists use AI predictions has become a growing area of research as AI moves into the clinical realm. Most use cases see radiologists employing AI in a collaborative role as a decision-making aid when reviewing cases. 

But is that really the best way to use AI? In a paper published by the National Bureau of Economic Research, researchers from Harvard Medical School and MIT explored the effectiveness of radiologist performance when assisted by AI, in particular its impact on diagnostic quality.

They ran an experiment in which they manipulated radiologist access to predictions from the CheXpert AI algorithm for 324 chest X-ray cases, and then analyzed the results. They also assessed radiologist performance with and without clinical context. The 180 radiologists participating in the study were recruited from US teleradiology firms, as well as from a health network in Vietnam. 

It was expected that AI would boost radiologist performance, but instead accuracy remained unchanged:

  • AI predictions were more accurate than two-thirds of the radiologists
  • Yet, AI assistance failed to improve the radiologists’ diagnostic accuracy, as readers underweighted AI findings by 30% compared to their own assessments
  • Radiologists took 4% longer to interpret cases when either AI or clinical context were added
  • Adding clinical context to cases had a bigger impact on radiologist performance than adding AI interpretations

The findings show automation neglect can be a “major barrier” to human-AI collaboration. Interestingly, the new article seems to run counter to a previous study finding that radiologists who received incorrect AI results were more likely to follow the algorithm’s suggestions – against their own judgment. 

The Takeaway

The authors themselves admit the new findings are “puzzling,” but they do have intriguing ramifications. In particular, the researchers suggest that there may be limitations to the collaborative model in which humans and AI work together to analyze cases. Instead, it may be more effective to assign AI exclusively to certain studies, while radiologists work without AI assistance on other cases.

Can You Believe the AI Hype?

Can you believe the hype when it comes to marketing claims made for AI software? Not always. A new review in JAMA Network Open suggests that marketing materials for one-fifth of FDA-cleared AI applications don’t agree with the language in their regulatory submissions. 

Interest in AI for healthcare has exploded, creating regulatory challenges for the FDA due to the technology’s novelty. This has left many AI developers guessing how they should comply with FDA rules, both before and after products get regulatory clearance.

This creates the possibility for discrepancies between products the FDA has cleared and how AI firms promote them. To investigate further, researchers from NYU Langone Health analyzed content from 510(k) clearance summaries and accompanying marketing materials for 119 AI- and machine learning (ML)-enabled devices cleared from November 2021 to March 2022. Their findings included:

  • Overall, AI/ML marketing language was consistent with 510(k) summaries for 80.67% of devices
  • Language was considered “discrepant” for 12.61% and “contentious” for 6.72% 
  • Most of the AI/ML devices surveyed (63.03%) were developed for radiology use; these had a slightly higher rate of consistency (82.67%) than the entire study sample

The authors provided several examples illustrating when AI/ML firms went astray. In one case labeled as “discrepant,” a developer touted the “cutting-edge AI and advanced robotics” in its software for measuring and displaying cerebral blood flow with ultrasound. But the product’s 510(k) summary never discussed AI capabilities, and the algorithm isn’t included on the FDA’s list of AI/ML-enabled devices.

In another case labeled as “contentious,” marketing materials for an ECG mapping software application mention that it includes computation modeling and is a smart device, but require users to request a pamphlet from the developer for more information.

The Takeaway 

So, can you believe the AI hype? This study shows that most of the time you can, with a consistency rate of 80.67% – not bad for a field as new as AI (a fact acknowledged in an invited commentary on the paper). But the study’s authors suggest that “any level of discrepancy is important to note for consumer safety.” And for a technology that already has trust issues, it’s probably best that developers not push the envelope when it comes to marketing.

AI Investment Shift

VC investment in the AI medical imaging sector has shifted notably in the last couple years, says a new report from UK market intelligence firm Signify Research. The report offers a fascinating look at an industry where almost $5B has been raised since 2015. 

VC investment in the AI medical imaging sector has shifted in the last couple years, with money moving to later-stage companies.

Total Funding Value Drops – Both investors and AI independent software vendors (ISVs) have noticed reduced funding activity, and that’s reflected in the Signify numbers. VC funding of imaging AI firms fell 32% in 2022, to $750.4M, down from a peak of $1.1B in 2021.

Deal Volume Declines – The number of deals getting done has also fallen, to 42 deals in 2022, off 30% compared to 60 in 2021. In imaging AI’s peak year, 2020, 95 funding deals were completed. 

VC Appetite Remains Strong – Despite the declines, VCs still have a strong appetite for radiology AI, but funding has shifted from smaller early-stage deals to larger, late-stage investments. 

HeartFlow Deal Tips Scales – The average deal size has spiked this year to date, to $27.6M, compared to $17.9M in 2022, $18M in 2021, and $7.9M in 2020. Much of the higher 2023 number is driven by HeartFlow’s huge $215M funding round in April; Signify analyst Sanjay Parekh, PhD, told The Imaging Wire he expects the average deal value to fall to $18M by year’s end.

The Rich Get Richer – Much of the funding has concentrated in a dozen or so AI companies that have raised over $100M. Big winners include HeartFlow (over $650M), and Cleerly, Shukun Technology, and Viz.ai (over $250M). Signify’s $100M club is rounded out by Aidoc, Cathworks, Keya Medical, Deepwise Shenrui, Imagen Technologies, Perspectum, Lunit, and Annalise.ai.

US and China Dominate – On a regional basis, VC funding is going to companies in the US (almost $2B) and China ($1.1B). Following them are Israel ($513M), the UK ($310M), and South Korea ($255M).  

The Takeaway 

Signify’s report shows the continuation of trends seen in previous years that point to a maturing market for medical imaging AI. As with any such market, winners and losers are emerging, and VCs are clearly being selective about choosing which horses to put their money on.

Radiology Puts ChatGPT to Work

ChatGPT has taken the world by storm since the AI technology was first introduced in November 2022. In medicine, radiology is taking the lead in putting ChatGPT to work to address the specialty’s many efficiency and workflow challenges. 

Both ChatGPT and its newest iteration, GPT-4, are forms of AI known as large language models – essentially neural networks that are trained on massive volumes of unlabeled text and are able to learn on their own how to predict the structure and syntax of human language. 

A flood of papers have appeared in just the last week or so investigating ChatGPT’s potential:

  • ChatGPT could be used to improve patient engagement with radiology providers, such as by creating layperson reports that are more understandable, or by answering patient questions in a chatbot function, says an American Journal of Roentgenology article.
  • ChatGPT offered up accurate information about breast cancer prevention and screening to patients in a study in Radiology. But ChatGPT also gave some inappropriate and inconsistent recommendations – perhaps no surprise given that many experts themselves often disagree on breast screening guidelines.
  • ChatGPT was able to produce a report on a PET/CT scan of a patient – including technical terms like SUVmax and TNM stage – without special training, found researchers writing in Journal of Nuclear Medicine.
  • GPT-4 translated free-text radiology reports into structured reports that better lend themselves to standardization and data extraction for research in another paper published in Radiology. Best of all, the service cost 10 cents a report.

Where is all this headed? A review article on AI in medicine in New England Journal of Medicine gave the opinion – often stated in radiology – that AI has the potential to take over mundane tasks and give health professionals more time for human-to-human interactions. 

They compared the arrival of ChatGPT to the onset of digital imaging in radiology in the 1990s, and offered a tantalizing future in which chatbots like ChatGPT and GPT-4 replace outdated technologies like x-ray file rooms and lost images – remember those?

The Takeaway

Radiology’s embrace of ChatGPT and GPT-4 is heartening given the specialty’s initial skeptical response to AI in years past. As the most technologically advanced medical specialty, it’s only fitting that radiology takes the lead in putting this transformative technology to work – as it did with digital imaging.

Understanding AI’s Physician Influence

We spend a lot of time exploring the technical aspects of imaging AI performance, but little is known about how physicians are actually influenced by the AI findings they receive. A new Scientific Reports study addresses that knowledge gap, perhaps more directly than any other research to date. 

The researchers provided 233 radiologists (experts) and internal and emergency medicine physicians (non-experts) with eight chest X-ray cases each. The CXR cases featured correct diagnostic advice, but were manipulated to show different advice sources (generated by AI vs. by expert rads) and different levels of advice explanations (only advice vs. advice w/ visual annotated explanations). Here’s what they found…

  • Explanations Improve Accuracy – When the diagnostic advice included annotated explanations, both the IM/EM physicians and radiologists’ accuracy improved (+5.66% & +3.41%).
  • Non-Rads with Explainable Advice Rival Rads – Although the IM/EM physicians performed far worse than rads when given advice without explanations, they were “on par with” radiologists when their advice included explainable annotations (see Fig 3).
  • Explanations Help Radiologists with Tough Cases – Radiologists gained “limited benefit” from advice explanations with most of the X-ray cases, but the explanations significantly improved their performance with the single most difficult case.
  • Presumed AI Use Improves Accuracy – When advice was labeled as AI-generated (vs. rad-generated), accuracy improved for both the IM/EM physicians and radiologists (+4.22% & +3.15%).
  • Presumed AI Use Improves Expert Confidence – When advice was labeled as AI-generated (vs. rad-generated), radiologists were more confident in their diagnosis.

The Takeaway
This study provides solid evidence supporting the use of visual explanations, and bolsters the increasingly popular theory that AI can have the greatest impact on non-experts. It also revealed that physicians trust AI more than some might have expected, to the point where physicians who believed they were using AI made more accurate diagnoses than they would have if they were told the same advice came from a human expert.

However, more than anything else, this study seems to highlight the underappreciated impact of product design on AI’s clinical performance.

CXR AI’s Screening Generalizability Gap

A new European Radiology study detailed a commercial CXR AI tool’s challenges when used for screening patients with low disease prevalence, bringing more attention to the mismatch between how some AI tools are trained and how they’re applied in the real world.

The researchers used an unnamed commercial AI tool to detect abnormalities in 3k screening CXRs sourced from two healthcare centers (2.2% w/ clinically significant lesions), and had four radiology residents read the same CXRs with and without AI assistance, finding that the AI:

  • Produced a far lower AUROC than in its other studies (0.648 vs. 0.77–0.99)
  • Achieved 94.2% specificity, but just 35.3% sensitivity
  • Detected 12 of 41 pneumonia, 3 of 5 tuberculosis, and 9 of 22 tumors 
  • Only “modestly” improved the residents’ AUROCs (0.571–0.688 vs. 0.534–0.676)
  • Added 2.96 to 10.27 seconds to the residents’ average CXR reading times

The researchers attributed the AI tool’s “poorer than expected” performance to differences between the data used in its initial training and validation (high disease prevalence) and the study’s clinical setting (high-volume, low-prevalence, screening).

  • More notably, the authors pointed to these results as evidence that many commercial AI products “may not directly translate to real-world practice,” urging providers facing this kind of training mismatch to retrain their AI or change their thresholds, and calling for more rigorous AI testing and trials.

These results also inspired lively online discussions. Some commenters cited the study as proof of the problems caused by training AI with augmented datasets, while others contended that the AI tool’s AUROC still rivaled the residents and its “decent” specificity is promising for screening use.

The Takeaway

We cover plenty of studies about AI generalizability, but most have explored bias due to patient geography and demographics, rather than disease prevalence mismatches. Even if AI vendors and researchers are already aware of this issue, AI users and study authors might not be, placing more emphasis on how vendors position their AI products for different use cases (or how they train it).

Guerbet’s Big AI Investment

Guerbet took a big step towards advancing its AI strategy, acquiring a 39% stake in French imaging software company Intrasense, and revealing ambitious future plans for their combined technologies.

Through Intrasense, Guerbet gains access to a visualization and AI platform and a team of AI integration experts to help bring its algorithms into clinical use. The tie-up could also create future platform and algorithm development opportunities, and the expansion of their technologies across Guerbet’s global installed base.

The €8.8M investment (€0.44/share, a 34% premium) could turn into a €22.5M acquisition, as Guerbet plans to file a voluntary tender offer for all remaining shares.

Even though Guerbet is a €700M company and Intrasense is relatively small (~€3.8M 2022 revenue, 67 employees on LinkedIn), this seems like a significant move given and Guerbet’s increasing emphasis on AI:

What Guerbet was lacking before now (especially since ending its Merative/IBM alliance) was a future AI platform – and Intrasense should help fill that void. 

If Guerbet acquires Intrasense it would continue the recent AI consolidation wave, while adding contrast manufacturers to the growing list of previously-unexpected AI startup acquirers (joining imaging center networks, precision medicine analytics companies, and EHR analytics firms). 

However, contrast manufacturers could play a much larger role in imaging AI going forward, considering the high priority that Bayer is placing on its Calantic AI platform.

The Takeaway

Guerbet has been promoting its AI ambitions for several years, and this week’s Intrasense investment suggests that the French contrast giant is ready to transition from developing algorithms to broadly deploying them. That would take a lot more work, but Guerbet’s scale and imaging expertise makes it worth keeping an eye on if you’re in the AI space.

Prioritizing Length of Stay

A new study out of Cedars Sinai provided what might be the strongest evidence yet that imaging AI triage and prioritization tools can shorten inpatient hospitalizations, potentially bolstering AI’s economic and patient care value propositions outside of the radiology department.

The researchers analyzed patient length of stay (LOS) before and after Cedars Sinai adopted Aidoc’s triage AI solutions for intracranial hemorrhage (Nov 2017) and pulmonary embolism (Dec 2018), using 2016-2019 data from all inpatients who received noncontrast head CTs or chest CTAs.

  • ICH Results – Among Cedars Sinai’s 1,718 ICH patients (795 after ICH AI adoption), average LOS dropped by 11.9% from 10.92 to 9.62 days (vs. -5% for other head CT patients).
  • PE Results – Among Cedars Sinai’s 400 patients diagnosed with PE (170 after PE AI adoption), average LOS dropped by a massive 26.3% from 7.91 to 5.83 days (vs. +5.2% for other CCTA patients). 
  • Control Results – Control group patients with hip fractures saw smaller LOS decreases during the respective post-AI periods (-3% & -8.3%), while hospital-wide LOS seemed to trend upward (-2.5% & +10%).

The Takeaway

These results were strong enough for the authors to conclude that Cedars Sinai’s LOS improvements were likely “due to the triage software implementation.” 

Perhaps more importantly, some could also interpret these LOS reductions as evidence that Cedars Sinai’s triage AI adoption also improved its overall patient care and inpatient operating costs, given how these LOS reductions were likely achieved (faster diagnosis & treatment), the typical associations between hospital long stays and negative outcomes, and the fact that inpatient stays have a significant impact on hospital costs.

Prostate MR AI’s Experience Boost

A new European Radiology study showed that Siemens Healthineers’ AI-RAD Companion Prostate MR solution can improve radiologists’ lesion assessment accuracy (especially less-experienced rads), while reducing reading times and lesion grading variability. 

The researchers had four radiologists (two experienced, two inexperienced) assess lesions in 172 prostate MRI exams, with and without AI support, finding that AI-RAD Companion Prostate MR improved:

  • The less-experienced radiologists’ performance, significantly (AUCs: 0.66 to 0.80 & 0.68 to 0.80)
  • The experienced rads’ performance, modestly (AUCs: 0.81 to 0.86 & 0.81 to 0.84)
  • Overall PI-RADS category and Gleason score correlations (r = 0.45 to 0.57)
  • Median reading times (157 to 150 seconds)

The study also highlights Siemens Healthineers’ emergence as an AI research leader, leveraging its relationship / funding advantages over AI-only vendors and its (potentially) greater focus on AI research than its OEM peers to become one of imaging AI’s most-published vendors (here are some of its other recent studies).

The Takeaway

Given the role that experience plays in radiologists’ prostate MRI accuracy, and noting prostate MRI’s historical challenges with variability, this study makes a solid case for AI-RAD Companion Prostate MR’s ability to improve rads’ diagnostic performance (without slowing them down). It’s also a reminder that Siemens Healthineers is serious about supporting its homegrown AI portfolio through academic research.

Get every issue of The Imaging Wire, delivered right to your inbox.

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team

You're all set!