Addressing Hidden Stratification | EMRs Not Replacing Imaging IT

“If zebras were stone cold killing machines, you might want to exclude zebras first.”

Dr. Luke Oakden-Rayner with an addendum to the old “when you hear hoofbeats, think horses, not zebras” diagnostic advice, suggesting that AI models need to find a way to spot the rare and deadly subtypes regardless of what they “sound” like.

Imaging Wire Sponsors

  • Carestream – Focused on delivering innovation that is life changing – for patients, customers, employees, communities and other stakeholders
  • Focused Ultrasound Foundation – Accelerating the development and adoption of focused ultrasound
  • Medmo – Helping underinsured Americans save on medical scans by connecting them to imaging providers with unfilled schedule time
  • Nuance – AI and cloud-powered technology solutions to help radiologists stay focused, move quickly, and work smarter
  • Pocus Systems – A new Point of Care Ultrasound startup, combining a team of POCUS veterans with next-generation genuine AI technology to disrupt the industry
  • Qure.ai – Making healthcare more accessible by applying deep learning to radiology imaging

The Imaging Wire

Addressing Hidden Stratification

“Medical AI testing is unsafe, and that isn’t likely to change anytime soon.” That’s how Dr. Luke Oakden-Rayner’s latest blog starts out before diving into the prevalence and dangers of hidden stratification in AI testing and how these same challenges could actually lead to regulation improvements.

Hidden Stratification – There can be different subsets within any medical condition that are both clinically and visually distinct (e.g. typical or atypical pneumonia, solid or subsolid lung tumor) and have fundamentally different patient care implications. However, unlike human doctors who recognize visual variations, AI systems struggle with these variations since they train on coarsely defined class labels and generally don’t acknowledge unique cases in training and testing (thus, hiding the stratification).

Distinct Subsets = Inconsistent Performance – Dr. Luke’s main point in this post and his associated pre-print is “these visually distinct subsets can seriously distort the decision making of AI systems, potentially leading to a major difference between performance testing results and clinical utility.”

Clinical Safety ≠ Average Performance – “Being as good as a human on average is not a strong predictor of safety. What matters far more is specifically which cases the models get wrong.” Although AI’s “lack of common sense” has come to be accepted outside of medicine (e.g. identifying dogs in the snow as “wolves”), it’s much more serious in healthcare.

FDA Loophole – The trick is, “average performance” is good enough to gain FDA approval. For example, a model that achieves the same 95% sensitivity/recall as radiologists in a head-to-head reader study would qualify for FDA clearance. This is ok unless this AI model wasn’t trained to spot “a rare and visually distinct cancer subtype making up 5% of all disease” that is aggressive and requires quick treatment. Because of this, Dr. Luke suggests it may be valuable for AI models to check for rare/dangerous subtypes first before diagnosing more common subtypes.

Predicting Failure – One way to avoid poor performance with clinically important subsets is to identify, label, and test model performance with all possible variants. This stratified testing “tells us far more about the safety of this system than the overall or average performance for a medical task.” However, this can be difficult because there is a limited number of test cases and many subclasses, suggesting that developers should identify which subsets would be “underperformers” and specifically targeting them for further analysis.

Putting it into Practice – Operationalizing this will be hard at first, as doctors would have to “write out a complete schema for any and all medical AI tasks,” but it would only require rare updates once these schemas are created. When the schema is complete, developers would have to train their models on rare and high-risk subtypes in order to achieve FDA clearance.

No Panacea – Since “there will always be subclasses and edge cases that we simply can’t test preclinically,” post-deployment monitoring would also be required, using AI audits to identify the reason for AI errors once in the clinic. The good news is, any errors reviewed through an audit can be folded into the AI schema, making the model more complete.

EMRs Not Replacing Imaging IT

Signify Research reassured the radiology community that despite EMR’s increased presence and influence they will not replace imaging IT any time soon, suggesting that imaging IT’s role is secure due to:

  • Imaging’s complexity (numerous modalities / departments / subspecialties, complex workflows) and the 30-year head start that imaging IT solutions has over EMR imaging solutions.
  • Although EMRs have cannibalized some RIS functions (e.g. order entry and scheduling), this shift led to a new wave of standalone radiology products that moved workflow tools back towards radiology PACS.
  • Enterprise imaging strategies have been slow to develop, limiting the cases where enterprise-wide EMR integration is relevant.
  • Even as enterprise imaging matures, Signify argues that it can co-exist with EMRs and even be complementary (e.g. combining radiology software with longitudinal patient data from the EMR).

The Wire

  • UC San Diego Health began clinical evaluations of a new ultrasound-based and ultrasound-guided technique to break-up kidney stones, potentially serving as an alternative to shock wave lithotripsy (requires X-ray guidance and can have complications). Using SonoMotion’s Break Wave technology, the approach leverages acoustic energy to apply stress to certain points in the stone that cause it to fracture into small fragments without damaging surrounding tissue.
  • Xoran Technologies scored $8 million in funding from an NIH matching grant program that it will use to support its xCAT IQ mobile CT imaging system (for cranial OR/ICU imaging) and develop a new robotic / intraoperative imaging system. The NIH will provide $4 million in R&D funding during the next three years that will be matched by $4 million in revenue-based funding from Decathlon Capital Partners (payments based on future revenues, not equity).
  • University of Waterloo researchers, working with the University Health Network and the Vector Institute, developed new AI-based software for pneumothorax (collapsed lung) detection in chest X-rays. The software searches a database of 550k chest X-rays (30k with pneumothorax) to find cases most similar to a patient’s X-ray, correctly identifying 75% of patients with pneumothorax in a recent study (vs. 50% for clinicians using only X-rays). The researchers plan to achieve 90% accuracy and integrate the pneumothorax tool into the Coral Review quality-assurance system used by UHN hospitals within the next year, then expand to other hospitals that use Coral review if the UHN implementation is successful.
  • Nuance and Microsoft announced an expanded partnership to accelerate the delivery of ambient clinical intelligence (ACI) technologies used in “the exam room of the future where clinical documentation writes itself.” The partnership will combine Nuance’s healthcare speech recognition and processing solutions with Microsoft’s Azure, Azure AI, and Project EmpowerMD Intelligent Scribe Service, to develop ACI-based solutions that capture patient-clinician conversations, integrate that data with contextual information from the EHR, and auto-populate the patient’s medical record in the system. Nuance will roll-out the technology to an initial set of physician specialties in early 2020.
  • A report from IMV Medical Information Division published on AuntMinnie.com found that although radiation therapy is a stable market (2% avg annual procedure growth, 1.149m patients treated in 2019), RT imaging is growing at a faster pace. CT surpassed X-ray as the main RT imaging tool over the last decade, while the use of PET and MR for treatment planning has doubled during the same period to 30% and 24% of all treatment plans, respectively. Image-guided RT adoption also grew significantly over the last 15 years, from 15% of sites in 2004 to over 90% of RT sites today, with CT or conebeam CT (81% of sites), X-ray (60%), and electronic portal imaging devices (58%) leading IG-RT modalities.
  • NVIDIA and King’s College London are developing a new privacy-enhanced federated learning system that could allow imaging AI developers to train algorithms with data from multiple institutions without exposing private patient or site-specific data. This could help address the siloed nature of current imaging datasets that has challenged AI development until now. A model based on this federated learning system was able to perform brain tumor segmentation comparable to an algorithm trained on data from a centralized system, but without sharing institutional data.
  • A new JAMA study found that changes to CMS reimbursements intended to improve the value of diagnostic cardiovascular tests indeed led to a “considerable” decline in overall and low-value diagnostic CV testing (including imaging), while rates of high-value testing have increased slightly. The study of 5% of Medicare beneficiaries found that overall testing increased between 2000 and 2008 (275/1k patient years to 359/1k) and then declined through 2016 (316/1k), largely driven by changes in low-value tests like testing before low-risk surgeries (2.4% 2000, 3.8% in 2008, 2.5% in 2016).
  • Intelerad launched its new Odyssey Workflow Solution, which combines Zebra Medical Vision’s “All in one” (AI1) clinical AI engine/apps with Intelerad’s radiology worklist solution, to allow AI-based image review and worklist prioritization within a connected workflow. The companies are targeting AI’s cost barrier, offering Odyssey with a pay-per-study model (vs. the standard up-front flat-fee models) that they believe will encourage AI adoption among a wider range of providers.
  • New research published in the American Journal of Roentgenology found that DBT screening outperforms FFDM screening for breast cancer detection regardless of tumor type, size, or grade of cancer. The retrospective study reviewed DBT (n = 9817) and FFDM (n = 14,180) exams, finding that DBT had a higher cancer detection rate for invasive cancers (2.8 vs 1.3), minimal cancers (2.4 vs 1.2), estrogen receptor–positive invasive cancers (2.6 vs 1.1), and node-negative invasive cancers (2.3 vs 1.1), but a statistically similar rate for screen-detected invasive cancers to ductal carcinoma in situ (3.0 vs. 2.6).
  • Google continued its big-name hiring spree, appointing former Obama health official Karen DeSalvo to its new chief health officer role, where she will advise Google’s healthcare strategy within its Verily life sciences business. DeSalvo joins Google just a few weeks after former FDA commissioner Robert M. Califf came aboard to lead health strategy/policy and just a year after David Feinberg (former Geisinger / UCLA Health president & CEO) was hired to lead Google Health.

The Resource Wire

  • This Carestream Special Report details how providers can get the greatest ROI from their X-ray technology as radiography demands increase and budgets head the other direction.
  • Did you know that imaging patients are most likely to no-show for their procedures on Mondays and Saturdays? By partnering with Medmo, imaging centers can keep their schedules full, despite the inevitable Monday no-shows.

You might also like

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Imaging Wire team

You're all set!