The “radiologists with AI beat radiologists without AI” trend might have achieved mainstream status in Spring 2020, when the DM DREAM Challenge developed an ensemble of mammography AI solutions that allowed radiologists to outperform rads who weren’t using AI.
The DM DREAM Challenge had plenty of credibility. It was produced by a team of respected experts, combined eight top-performing AI models, and used massive training and validation datasets (144k & 166k exams) from geographically distant regions (Washington state, USA & Stockholm, Sweden).
However, a new external validation study highlighted one problem that many weren’t thinking about back then. Ethnic diversity can have a major impact on AI performance, and the majority of women in the two datasets were White.
The new study used an ensemble of 11 mammography AI models from the DREAM study (the Challenge Ensemble Model; CEM) to analyze 37k mammography exams from UCLA’s diverse screening program, finding that:
- The CEM model’s UCLA performance declined from the previous Washington and Sweden validations (AUROCs: 0.85 vs. 0.90 & 0.92)
- The CEM model improved when combined with UCLA radiologist assessments, but still fell short of the Sweden AI+rads validation (AUROCs: 0.935 vs. 0.942)
- The CEM + radiologists model also achieved slightly lower sensitivity (0.813 vs. 0.826) and specificity (0.925 vs. 0.930) than UCLA rads without AI
- The CEM + radiologists method performed particularly poorly with Hispanic women and women with a history of breast cancer
The Takeaway
Although generalization challenges and the importance of data diversity are everyday AI topics in late 2022, this follow-up study highlights how big of a challenge they can be (regardless of training size, ensemble approach, or validation track record), and underscores the need for local validation and fine-tuning before clinical adoption.
It also underscores how much we’ve learned in the last three years, as neither the 2020 DREAM study’s limitations statement nor critical follow-up editorials mentioned data diversity among the study’s potential challenges.