Talk

Physics-informed categorization of data modalities used in AI-assisted drug discovery

Abstract

The recent advances in machine learning models have created an inflection point for molecular design, demanding creative ways of presenting and processing vast amount of data. In this study, rather than focusing on data processing algorithms, we revisit molecular representations typically seen in the field. We propose categories of molecular representations (aka data modalities): sequential, topological, spatial, and temporal modalities, inspired by the categorization of primary, secondary, tertiary, and ensemble protein structures. Examples are provided from literature and our own research. For instance, graph representations currently span from topological graphs to structural graphs, depending on whether 3D structures are used to compute node and edge information. Beyond individual categories, fusion strategies (multimodal fusion) have been proposed to combine multiple representations with the definition from Stahlschmidt et al., highlighting often improvement of the models compared to single-modal methods. OmniMol is an late fusion strategy that utilizes representations such as images, texts and graphs. This physics-informed categorization of data modalities is proposed to assist researchers to extract relevant information systematically from available sources to accelerate molecular drug discovery.

Related