A new Lancet Digital Health study could have become one of the many “AI rivals radiologists” papers that we see each week, but it instead served as an important lesson that traditional performance tests might not prove that AI models are actually safe for clinical use.
The Model – The team developed their proximal femoral fracture detection DL model using 45.7k frontal X-rays performed at Australia’s Royal Adelaide Hospital (w/ 4,861 fractures).
The Validation – They then tested it against a 4,577-exam internal set (w/ 640 fractures), 400 of which were also interpreted by five radiologists (w/ 200 fractures), and against an 81-image external validation set from Stanford.
The Results – All three tests produced results that a typical study might have viewed as evidence of high-performance:
- The model outperformed the five radiologists (0.994 vs. 0.969 AUCs)
- It beat the best performing radiologist’s sensitivity (95.5% vs. 94.5%) and specificity (99.5% vs 97.5%)
- It generalized well with the external Stanford data (0.980 AUC)
The Audit – Despite the strong results, a follow-up audit revealed that the model might make some predictions for the wrong reasons, suggesting that it is unsafe for clinical deployment:
- One false negative X-ray included an extremely displaced fracture that human radiologists would catch
- X-rays featuring abnormal bones or joints had a 50% false negative rate, far higher than the reader set’s overall false negative rate (2.5%)
- Salience maps showed that AI decisions were almost never based on the outer region of the femoral neck, even with images where that region was clinically relevant (but it still often made the right diagnosis)
- The model scored a high AUC with the Stanford data, but showed a substantial model operating point shift
The Case for Auditing – Although the study might have not started with this goal, it ended up becoming an argument for more sophisticated preclinical auditing. It even led to a separate paper outlining their algorithmic auditing process, which among other things suggested that AI users and developers should co-own audits.
The Takeaway
Auditing generally isn’t the most exciting topic in any field, but this study shows that it’s exceptionally important for imaging AI. It also suggests that audits might be necessary for achieving the most exciting parts of AI, like improving outcomes and efficiency, earning clinician trust, and increasing adoption.A new Lancet Digital Health study could have become one of the many “AI rivals radiologists” papers that we see each week, but it instead served as an important lesson that traditional performance tests might not prove that AI models are actually safe for clinical use.