CSE SPOTLIGHT: RETHINKING THE AORTIC VALVE, A NEW APPROACH TO A COMMON PROBLEM
344 - IMPROVING AORTIC STENOSIS CLASSIFICATION IN ECHOCARDIOGRAPHY: A MULTI-MODAL APPROACH
Saturday, October 26, 2024
2:55 PM – 3:00 PM PT
Room: 211
Background: For Aortic Stenosis (AS) patients, timely diagnosis is crucial for better outcomes. The diagnosis process typically involves Doppler measurements from echocardiography (echo). Existing research on automated AS diagnosis mostly rely on deep learning techniques that use image data. However, AS is clinically assessed using a combination of echocardiogram measurements and patient history. In addition, few deep learning studies on AS have investigated how model performance is impacted by observer variability related to image acquisition and interpretation. To improve this process, we propose MultiASNet, a multimodal approach that integrates data from echocardiogram reports with B-mode echo cine series during training to provide a more holistic assessment of AS. During test time, MultiASNet exclusively utilizes echo videos, streamlining the evaluation process. Additionally, we implement co-teaching, a well-known approach for handling noisy classification labels to improve model robustness and reliability.
METHODS AND RESULTS: We use clips of the parasternal long axis and short axis views at the level of the aortic valve from 2,627 echocardiograms from a tertiary care database. Relevant data from corresponding reports are converted into tabular form. During training, the model takes both tabular and video data as input. Separate transformer encoders are used for tabular and video inputs to extract hidden tabular and visual spatio-temporal features, respectively. Tabular features relevant to the video are combined using cross-attention, and a shared decoder classifies the AS severity using both video and video-tabular embeddings. The addition of video-tabular embeddings helps better align the embedding space with the ground-truth. At test time, we skip the cross-attention and only use the video pathway, obtaining the video classification. During training, we also use co-teaching to minimize the impact of inaccurate labels through collaborative learning between multiple models.
MultiASNet’s performance for AS severity classification was evaluated against image-based, video-based, and multimodal models. Compared to all baseline models, MultiASNet demonstrated superior study-level balanced accuracy of 80.4±0.05 and F1-score of 0.80 ± 0.01 (Table 1). This suggests that incorporating noise-robust and multimodal approaches may have the greatest impact at the study level, as both noisy labels and tabular data were collected and assessed in a study-level context.
Conclusion: Our study introduces MultiASNet, a comprehensive framework for improving AS severity classification on a study-level. By integrating additional echo parameters and video features using a novel cross-attention mechanism, this approach surpasses performance of current models and enhances the accuracy of AS severity classification.
Disclosure(s):
Andrea Fung, MSc: No financial relationships to disclose
Victoria Wu, BSc: No financial relationships to disclose