A new European Radiology study detailed a commercial CXR AI tool’s challenges when used for screening patients with low disease prevalence, bringing more attention to the mismatch between how some AI tools are trained and how they’re applied in the real world.
The researchers used an unnamed commercial AI tool to detect abnormalities in 3k screening CXRs sourced from two healthcare centers (2.2% w/ clinically significant lesions), and had four radiology residents read the same CXRs with and without AI assistance, finding that the AI:
- Produced a far lower AUROC than in its other studies (0.648 vs. 0.77–0.99)
- Achieved 94.2% specificity, but just 35.3% sensitivity
- Detected 12 of 41 pneumonia, 3 of 5 tuberculosis, and 9 of 22 tumors
- Only “modestly” improved the residents’ AUROCs (0.571–0.688 vs. 0.534–0.676)
- Added 2.96 to 10.27 seconds to the residents’ average CXR reading times
The researchers attributed the AI tool’s “poorer than expected” performance to differences between the data used in its initial training and validation (high disease prevalence) and the study’s clinical setting (high-volume, low-prevalence, screening).
- More notably, the authors pointed to these results as evidence that many commercial AI products “may not directly translate to real-world practice,” urging providers facing this kind of training mismatch to retrain their AI or change their thresholds, and calling for more rigorous AI testing and trials.
These results also inspired lively online discussions. Some commenters cited the study as proof of the problems caused by training AI with augmented datasets, while others contended that the AI tool’s AUROC still rivaled the residents and its “decent” specificity is promising for screening use.
The Takeaway
We cover plenty of studies about AI generalizability, but most have explored bias due to patient geography and demographics, rather than disease prevalence mismatches. Even if AI vendors and researchers are already aware of this issue, AI users and study authors might not be, placing more emphasis on how vendors position their AI products for different use cases (or how they train it).