Radiology Puts ChatGPT to Work

ChatGPT has taken the world by storm since the AI technology was first introduced in November 2022. In medicine, radiology is taking the lead in putting ChatGPT to work to address the specialty’s many efficiency and workflow challenges. 

Both ChatGPT and its newest iteration, GPT-4, are forms of AI known as large language models – essentially neural networks that are trained on massive volumes of unlabeled text and are able to learn on their own how to predict the structure and syntax of human language. 

A flood of papers have appeared in just the last week or so investigating ChatGPT’s potential:

  • ChatGPT could be used to improve patient engagement with radiology providers, such as by creating layperson reports that are more understandable, or by answering patient questions in a chatbot function, says an American Journal of Roentgenology article.
  • ChatGPT offered up accurate information about breast cancer prevention and screening to patients in a study in Radiology. But ChatGPT also gave some inappropriate and inconsistent recommendations – perhaps no surprise given that many experts themselves often disagree on breast screening guidelines.
  • ChatGPT was able to produce a report on a PET/CT scan of a patient – including technical terms like SUVmax and TNM stage – without special training, found researchers writing in Journal of Nuclear Medicine.
  • GPT-4 translated free-text radiology reports into structured reports that better lend themselves to standardization and data extraction for research in another paper published in Radiology. Best of all, the service cost 10 cents a report.

Where is all this headed? A review article on AI in medicine in New England Journal of Medicine gave the opinion – often stated in radiology – that AI has the potential to take over mundane tasks and give health professionals more time for human-to-human interactions. 

They compared the arrival of ChatGPT to the onset of digital imaging in radiology in the 1990s, and offered a tantalizing future in which chatbots like ChatGPT and GPT-4 replace outdated technologies like x-ray file rooms and lost images – remember those?

The Takeaway

Radiology’s embrace of ChatGPT and GPT-4 is heartening given the specialty’s initial skeptical response to AI in years past. As the most technologically advanced medical specialty, it’s only fitting that radiology takes the lead in putting this transformative technology to work – as it did with digital imaging.

RadNet’s Path to AI Profit

There’s plenty of bold forecasts about imaging AI’s long term potential, but short term projections of when AI startups will reach profitability are rarely disclosed and almost never bold. That’s why RadNet’s quarterly investor calls are proving to be such a valuable bellwether for the business of AI, and its latest briefing was no exception.

RadNet entered the AI arena with its 2020 acquisition of DeepHealth (~$20M) and solidified its AI presence in early 2022 by acquiring Aidence and Quantib (~$85M), but its AI business generated just $4.4M in revenue and booked a $24.9M in pre-tax loss in 2022. 

Those numbers are likely typical for similar-sized AI companies. However, RadNet’s path towards AI revenue growth and breakeven operations might outpace most of its peers.

  • Looking into 2023, RadNet forecasts that its AI revenue will quadruple to between $16M and $18M, while its Adjusted EBITDA loss falls to between -$9M and -$11M.
  • By 2024, RadNet expects its AI division to generate at least $25M to $30M in revenue, allowing it to achieve AI profitability for the first time.

So how exactly is RadNet going to achieve 6x AI revenue growth and reach profitability within just two years? Patients are going to pay for it. 

RadNet expects its new direct-to-patient Enhanced Breast Cancer Detection (EBCD) service to generate between $11M and $13M in 2023 revenue, representing up to 72% of RadNet’s overall AI revenue and driving much of its AI profitability improvements. And EBCD’s nationwide rollout won’t be complete until Q3.

RadNet’s 2024 AI revenue and profit improvements will again rely on “substantial” EBCD growth, with some help from its Aidence and Quantib operations. Those improvements would offset delayed AI efficiency benefits that RadNet has “yet to really realize” due in part to slow radiologist adoption.

Takeaway

The fact that RadNet expects to become one of imaging’s largest and most profitable AI companies within the next two years might not be surprising. However, RadNet’s reliance on patient payments to drive that growth is astounding, and it’s something to keep an eye on as AI vendors and radiology groups work on their own AI monetization strategies.

Radiology NLP’s Efficiency and Accuracy Potential

The last week brought two high profile studies underscoring radiology NLP’s potential to improve efficiency and accuracy, showing how the language-based technology can give radiologists a reporting head-start and allow them to enjoy the benefits of AI detection without the disruptions.

AI + NLP for Nodule QA – A new JACR study detailed how Yale New Haven Hospital combined AI and NLP to catch and report more incidental lung nodules in emergency CT scans, without impacting in-shift radiologists. The quality assurance program used a CT AI algorithm to detect suspicious nodules and an NLP tool to analyze radiology reports, flagging only the cases that AI marked as suspicious but the NLP tool marked as negative.

  • The AI/NLP program processed 19.2k CT exams over an 8-month period, flagging just 50 cases (0.26%) for a second review.
  • Those flagged cases led to 34 reporting changes and 20 patients receiving follow-up imaging recommendations. 
  • Just as notably, this semi-autonomous process helped rads avoid “thousands of unnecessary notifications” for non-emergent nodules.

NLP Auto-Captions – JAMA highlighted an NLP model that automatically generates free-text captions describing CXR images, streamlining the radiology report writing process. A Shanghai-based team trained the model using 74k unstructured CXR reports labeled for 23 different abnormalities, and tested with 5,091 external CXRs alongside two other caption-generating models.

  • The NLP captions reduced radiology residents’ reporting times compared to when they used a normal captioning template or a rule-based captioning model (283 vs. 347 & 296 seconds), especially with abnormal exams (456 vs. 631 & 531 seconds). 
  • The NLP-generated captions also proved to be most similar to radiologists’ final reports (mean BLEU scores: 0.69 vs. 0.37 & 0.57; on 0-1 scale).

The Takeaway

These are far from the first radiology NLP studies, but the fact that these implementations improved efficiency (without sacrificing accuracy) or improved accuracy (without sacrificing efficiency) deserves extra attention at a time when trade-offs are often expected. Also, considering that everyone just spent the last month marveling at what ChatGPT can do, it might be a safe bet that even more impressive language and text-based radiology solutions are on the way.

Understanding AI’s Physician Influence

We spend a lot of time exploring the technical aspects of imaging AI performance, but little is known about how physicians are actually influenced by the AI findings they receive. A new Scientific Reports study addresses that knowledge gap, perhaps more directly than any other research to date. 

The researchers provided 233 radiologists (experts) and internal and emergency medicine physicians (non-experts) with eight chest X-ray cases each. The CXR cases featured correct diagnostic advice, but were manipulated to show different advice sources (generated by AI vs. by expert rads) and different levels of advice explanations (only advice vs. advice w/ visual annotated explanations). Here’s what they found…

  • Explanations Improve Accuracy – When the diagnostic advice included annotated explanations, both the IM/EM physicians and radiologists’ accuracy improved (+5.66% & +3.41%).
  • Non-Rads with Explainable Advice Rival Rads – Although the IM/EM physicians performed far worse than rads when given advice without explanations, they were “on par with” radiologists when their advice included explainable annotations (see Fig 3).
  • Explanations Help Radiologists with Tough Cases – Radiologists gained “limited benefit” from advice explanations with most of the X-ray cases, but the explanations significantly improved their performance with the single most difficult case.
  • Presumed AI Use Improves Accuracy – When advice was labeled as AI-generated (vs. rad-generated), accuracy improved for both the IM/EM physicians and radiologists (+4.22% & +3.15%).
  • Presumed AI Use Improves Expert Confidence – When advice was labeled as AI-generated (vs. rad-generated), radiologists were more confident in their diagnosis.

The Takeaway
This study provides solid evidence supporting the use of visual explanations, and bolsters the increasingly popular theory that AI can have the greatest impact on non-experts. It also revealed that physicians trust AI more than some might have expected, to the point where physicians who believed they were using AI made more accurate diagnoses than they would have if they were told the same advice came from a human expert.

However, more than anything else, this study seems to highlight the underappreciated impact of product design on AI’s clinical performance.

Acute Chest Pain CXR AI

Patients who arrive at the ED with acute chest pain (ACP) syndrome end up receiving a series of often-negative tests, but a new MGB-led study suggests that CXR AI might make ACP triage more accurate and efficient.

The researchers trained three ACP triage models using data from 23k MGH patients to predict acute coronary syndrome, pulmonary embolism, aortic dissection, and all-cause mortality within 30 days. 

  • Model 1: Patient age and sex
  • Model 2: Patient age, sex, and troponin or D-dimer positivity
  • Model 3: CXR AI predictions plus Model 2

In internal testing with 5.7k MGH patients, Model 3 predicted which patients would experience any of the ACP outcomes far more accurately than Models 2 and 1 (AUCs: 0.85 vs. 0.76 vs. 0.62), while maintaining performance across patient demographic groups.

  • At a 99% sensitivity threshold, Model 3 would have allowed 14% of the patients to skip additional cardiovascular or pulmonary testing (vs. Model 2’s 2%).

In external validation with 22.8k Brigham and Women’s patients, poor AI generalizability caused Model 3’s performance to drop dramatically, while Models 2 and 1 maintained their performance (AUCs: 0.77 vs. 0.76 vs. 0.64). However, fine-tuning with BWH’s own images significantly improved the performance of the CXR AI model (from 0.67 to 0.74 AUCs) and Model 3 (from 0.77 to 0.81 AUCs).

  • At a 99% sensitivity threshold, the fine-tuned Model 3 would have allowed 8% of BWH patients to skip additional cardiovascular or pulmonary testing (vs. Model 2’s 2%).

The Takeaway

Acute chest pain is among the most common reasons for ED visits, but it’s also a major driver of wasted ED time and resources. Considering that most ACP patients undergo CXR exams early in the triage process, this proof-of-concept study suggests that adding CXR AI could improve ACP diagnosis and significantly reduce downstream testing.

Bayer Establishes AI Platform Leadership with Blackford Acquisition

Six months after becoming radiology’s newest AI platform vendor, Bayer accelerated its path towards AI leadership with its acquisition of Blackford Analysis.

The acquisition might prove to be among the most significant in imaging AI’s short history, combining Blackford’s many AI advantages (tech, expertise, relationships) with Bayer’s massive radiology presence and AI ambitions. 

After closing later this year, Blackford will operate independently through Bayer’s well-established “arm’s length” model, allowing Blackford to preserve its entrepreneurial culture, while leveraging Bayer’s “experience, infrastructure and reach” to drive further expansion.

Bayer’s Calantic platform and team will operate separately from Blackford, providing Bayer customers with two distinct AI platforms to choose from, while giving Bayer two ways to drive its AI business forward. 

Although few would have predicted this acquisition, it makes sense given Bayer and Blackford’s relatively long history together and their complementary situations. 

  • Blackford was part of Bayer’s 2019 G4A digital health accelerator class
  • The companies have been working together to develop Calantic since 2020
  • Bayer has big AI goals, but its AI customer base and reputation were unestablished
  • Blackford’s AI customer base and reputation are solid, but it needed a new way to scale and a positive exit for its shareholders

Even fewer would have predicted that imaging contrast vendors would be the driving force behind AI’s next consolidation wave, noting that Guerbet invested in Intrasense just last week. However, imaging contrast and imaging AI could serve increasingly interrelated (or alternative) roles in the diagnostic process, and there’s surely advantages to being a leader in both areas for Bayer and Guerbet.

Speaking of AI consolidation, it appears that all those 2023 AI consolidation forecasts are proving to be correct, while bringing some of radiology’s largest companies into an AI segment that’s historically been dominated by startups. It wouldn’t be surprising if that trend continued.

The Takeaway

Bayer and Blackford have been working on their AI strategies for years, and this acquisition appears to give both companies a much better chance of achieving long-term AI leadership. Considering that AI is still in its infancy and could eventually play a dominant role in radiology (and across healthcare), AI leadership might be a far more significant market position in the future than many can imagine today.

CXR AI’s Screening Generalizability Gap

A new European Radiology study detailed a commercial CXR AI tool’s challenges when used for screening patients with low disease prevalence, bringing more attention to the mismatch between how some AI tools are trained and how they’re applied in the real world.

The researchers used an unnamed commercial AI tool to detect abnormalities in 3k screening CXRs sourced from two healthcare centers (2.2% w/ clinically significant lesions), and had four radiology residents read the same CXRs with and without AI assistance, finding that the AI:

  • Produced a far lower AUROC than in its other studies (0.648 vs. 0.77–0.99)
  • Achieved 94.2% specificity, but just 35.3% sensitivity
  • Detected 12 of 41 pneumonia, 3 of 5 tuberculosis, and 9 of 22 tumors 
  • Only “modestly” improved the residents’ AUROCs (0.571–0.688 vs. 0.534–0.676)
  • Added 2.96 to 10.27 seconds to the residents’ average CXR reading times

The researchers attributed the AI tool’s “poorer than expected” performance to differences between the data used in its initial training and validation (high disease prevalence) and the study’s clinical setting (high-volume, low-prevalence, screening).

  • More notably, the authors pointed to these results as evidence that many commercial AI products “may not directly translate to real-world practice,” urging providers facing this kind of training mismatch to retrain their AI or change their thresholds, and calling for more rigorous AI testing and trials.

These results also inspired lively online discussions. Some commenters cited the study as proof of the problems caused by training AI with augmented datasets, while others contended that the AI tool’s AUROC still rivaled the residents and its “decent” specificity is promising for screening use.

The Takeaway

We cover plenty of studies about AI generalizability, but most have explored bias due to patient geography and demographics, rather than disease prevalence mismatches. Even if AI vendors and researchers are already aware of this issue, AI users and study authors might not be, placing more emphasis on how vendors position their AI products for different use cases (or how they train it).

Guerbet’s Big AI Investment

Guerbet took a big step towards advancing its AI strategy, acquiring a 39% stake in French imaging software company Intrasense, and revealing ambitious future plans for their combined technologies.

Through Intrasense, Guerbet gains access to a visualization and AI platform and a team of AI integration experts to help bring its algorithms into clinical use. The tie-up could also create future platform and algorithm development opportunities, and the expansion of their technologies across Guerbet’s global installed base.

The €8.8M investment (€0.44/share, a 34% premium) could turn into a €22.5M acquisition, as Guerbet plans to file a voluntary tender offer for all remaining shares.

Even though Guerbet is a €700M company and Intrasense is relatively small (~€3.8M 2022 revenue, 67 employees on LinkedIn), this seems like a significant move given and Guerbet’s increasing emphasis on AI:

What Guerbet was lacking before now (especially since ending its Merative/IBM alliance) was a future AI platform – and Intrasense should help fill that void. 

If Guerbet acquires Intrasense it would continue the recent AI consolidation wave, while adding contrast manufacturers to the growing list of previously-unexpected AI startup acquirers (joining imaging center networks, precision medicine analytics companies, and EHR analytics firms). 

However, contrast manufacturers could play a much larger role in imaging AI going forward, considering the high priority that Bayer is placing on its Calantic AI platform.

The Takeaway

Guerbet has been promoting its AI ambitions for several years, and this week’s Intrasense investment suggests that the French contrast giant is ready to transition from developing algorithms to broadly deploying them. That would take a lot more work, but Guerbet’s scale and imaging expertise makes it worth keeping an eye on if you’re in the AI space.

Federated Learning’s Glioblastoma Milestone

AI insiders celebrated a massive new study highlighting a federated learning AI model’s ability to delineate glioblastoma brain tumors with high accuracy and generalizability, while demonstrating FL’s potential value for rare diseases and underrepresented populations.

The UPenn-led research team went big, as the study’s 71 sites in 6 continents made it the largest FL project to-date, its 6,314 patients’ mpMRIs created the biggest glioblastoma (GBM) dataset ever, and its nearly 280 authors were the most we’ve seen in a published study. 

The researchers tested their final GBM FL consensus model twice – first using 20% of the “local” mpMRIs from each site that weren’t used in FL training, and second using 590 “out-of-sample” exams from 6 sites that didn’t participate in FL development.

These FL models achieved significant improvements compared to an AI model trained with public data for delineating the three main GBM tumor sub-compartments that are most relevant for treatment planning.

  • Surgically targetable tumor core: +33% w/ local, +27% w/ out-of-sample
  • Enhancing tumor: +27% w/ local, +15% w/ out-of-sample
  • Whole tumor: +16% w/ local, +16% w/ out-of-sample data

The Takeaway

Federated learning’s ability to improve AI’s performance in new settings/populations while maintaining patient data privacy has become well established in the last few years. However, this study takes FL’s resume to the next level given its unprecedented scope and the significant complexity associated with mpMRI glioblastoma exams, suggesting that FL will bring a “paradigm shift for multi-site collaborations.”

The Mammography AI Generalizability Gap

The “radiologists with AI beat radiologists without AI” trend might have achieved mainstream status in Spring 2020, when the DM DREAM Challenge developed an ensemble of mammography AI solutions that allowed radiologists to outperform rads who weren’t using AI.

The DM DREAM Challenge had plenty of credibility. It was produced by a team of respected experts, combined eight top-performing AI models, and used massive training and validation datasets (144k & 166k exams) from geographically distant regions (Washington state, USA & Stockholm, Sweden).

However, a new external validation study highlighted one problem that many weren’t thinking about back then. Ethnic diversity can have a major impact on AI performance, and the majority of women in the two datasets were White.

The new study used an ensemble of 11 mammography AI models from the DREAM study (the Challenge Ensemble Model; CEM) to analyze 37k mammography exams from UCLA’s diverse screening program, finding that:

  • The CEM model’s UCLA performance declined from the previous Washington and Sweden validations (AUROCs: 0.85 vs. 0.90 & 0.92)
  • The CEM model improved when combined with UCLA radiologist assessments, but still fell short of the Sweden AI+rads validation (AUROCs: 0.935 vs. 0.942)
  • The CEM + radiologists model also achieved slightly lower sensitivity (0.813 vs. 0.826) and specificity (0.925 vs. 0.930) than UCLA rads without AI 
  • The CEM + radiologists method performed particularly poorly with Hispanic women and women with a history of breast cancer

The Takeaway

Although generalization challenges and the importance of data diversity are everyday AI topics in late 2022, this follow-up study highlights how big of a challenge they can be (regardless of training size, ensemble approach, or validation track record), and underscores the need for local validation and fine-tuning before clinical adoption. 

It also underscores how much we’ve learned in the last three years, as neither the 2020 DREAM study’s limitations statement nor critical follow-up editorials mentioned data diversity among the study’s potential challenges.

Get every issue of The Imaging Wire, delivered right to your inbox.

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team

You're all set!