One of the most exciting new use cases for medical AI is in generating radiology reports. But how can you tell whether the quality of a report generated by an AI algorithm is comparable to that of a radiologist?
In a new study in Patterns, researchers propose a technical framework for automatically grading the output of AI-generated radiology reports, with the ultimate goal of producing AI-generated reports that are indistinguishable from those of radiologists.
Most radiology AI applications so far have focused on developing algorithms to identify individual pathologies on imaging exams.
- While this is useful, helping radiologists streamline the production of their main output – the radiology report – could have a far greater impact on their productivity and efficiency.
But existing tools for measuring the quality of AI-generated narrative reports are limited and don’t match up well with radiologists’ evaluations.
- To improve that situation, the researchers applied several existing automated metrics for analyzing report quality and compared them to the scores of radiologists, seeking to better understand AI’s weaknesses.
Not surprisingly, the automated metrics fell short in several ways, including false prediction of findings, omitting findings, and incorrectly locating and predicting the severity of findings.
- These shortcomings point out the need for better scoring systems for gauging AI performance.
The researchers therefore proposed a new metric for grading AI-generated report quality, called RadGraph F1, and a new methodology, RadCliQ, to predict how well an AI report would measure up to radiologist scrutiny.
- RadGraph F1 and RadCliQ could be used in future research on AI-generated radiology reports, and to that end the researchers have made the code for both metrics available as open source.
Ultimately, the researchers see the construction of generalist medical AI models that could perform multiple complex tasks, such as conversing with radiologists and physicians about medical images.
- Another use case could be applications that are able to explain imaging findings to patients in everyday language.
The Takeaway
It’s a complex and detailed paper, but the new study is important because it outlines the metrics that can be used to teach machines how to generate better radiology reports. Given the imperative to improve radiologist productivity in the face of rising imaging volume and workforce shortages, this could be one more step on the quest for the Holy Grail of AI in radiology.