The Mammography AI Generalizability Gap

The “radiologists with AI beat radiologists without AI” trend might have achieved mainstream status in Spring 2020, when the DM DREAM Challenge developed an ensemble of mammography AI solutions that allowed radiologists to outperform rads who weren’t using AI.

The DM DREAM Challenge had plenty of credibility. It was produced by a team of respected experts, combined eight top-performing AI models, and used massive training and validation datasets (144k & 166k exams) from geographically distant regions (Washington state, USA & Stockholm, Sweden).

However, a new external validation study highlighted one problem that many weren’t thinking about back then. Ethnic diversity can have a major impact on AI performance, and the majority of women in the two datasets were White.

The new study used an ensemble of 11 mammography AI models from the DREAM study (the Challenge Ensemble Model; CEM) to analyze 37k mammography exams from UCLA’s diverse screening program, finding that:

The CEM model’s UCLA performance declined from the previous Washington and Sweden validations (AUROCs: 0.85 vs. 0.90 & 0.92)
The CEM model improved when combined with UCLA radiologist assessments, but still fell short of the Sweden AI+rads validation (AUROCs: 0.935 vs. 0.942)
The CEM + radiologists model also achieved slightly lower sensitivity (0.813 vs. 0.826) and specificity (0.925 vs. 0.930) than UCLA rads without AI
The CEM + radiologists method performed particularly poorly with Hispanic women and women with a history of breast cancer

The Takeaway

Although generalization challenges and the importance of data diversity are everyday AI topics in late 2022, this follow-up study highlights how big of a challenge they can be (regardless of training size, ensemble approach, or validation track record), and underscores the need for local validation and fine-tuning before clinical adoption.

It also underscores how much we’ve learned in the last three years, as neither the 2020 DREAM study’s limitations statement nor critical follow-up editorials mentioned data diversity among the study’s potential challenges.

AI-Driven Lung Cancer Screening and Improving Patient Outcomes July 14, 2025

AI is reshaping clinical decision-making, optimizing resource allocation, and enhancing both patient outcomes and experience in CT lung cancer screening. Radiology providers are successfully integrating new AI software tools into hospital operations – supporting diagnostic accuracy and improving patient outcomes. At the center of this trend is Coreline Soft’s FDA-cleared AVIEW LCS Plus, a 3-in-1 […]

CT Lung Screening Chats Pay Off July 10, 2025

Patients who talked about CT lung cancer screening with their doctors were more likely to actually follow through on getting scanned. That’s according to a study this week in CHEST that offers support for shared decision making – a process that some screening proponents have criticized. The U.S. continues to see disappointing compliance rates for […]

Top 6 Radiology Trends from 2025’s First Half July 7, 2025

The first half of 2025 has drawn to a close, and once again it was an eventful period for radiology. As we do every year, we’ve compiled a list of the top six stories – one for each month – to help recap what was important in medical imaging. Consolidation in Imaging Services Radiology’s imaging […]

Digital Health Wire

Cardiac Wire

Get every issue of The Imaging Wire, delivered right to your inbox.

You might also like

AI-Driven Lung Cancer Screening and Improving Patient Outcomes July 14, 2025

CT Lung Screening Chats Pay Off July 10, 2025

Top 6 Radiology Trends from 2025’s First Half July 7, 2025

You might also like..

Digital Health Wire

Cardiac Wire

You're signed up!

You're all set!