Workshop paper

Learning Resilient Molecular Representations with Dynamic Multi-Modal Fusion

Abstract

Recent advances in machine learning have transformed molecular property prediction, with large-scale representation models trained on diverse modalities such as SMILES, SELFIES, graph-based embeddings, etc. While multi-modal fusion offers richer insights than unimodal approaches, traditional fusion methods often assign static importance across modalities, leading to redundancy and poor robustness under missing-modality conditions. We introduce a Dynamic Multi-Modal Fusion framework, a self-supervised approach that adaptively integrates heterogeneous molecular embeddings. The framework employs intra-modal gating for feature selection, inter-modal attention for adaptive weighting, and cross-modal reconstruction to enforce information exchange across modalities. Training is guided by progressive modality masking, enabling the fused representation to remain informative even when some inputs are absent. Preliminary evaluations on the MoleculeNet benchmark demonstrate that our method improves reconstruction and modality alignment while achieving superior performance on downstream property prediction tasks compared to unimodal and naïve fusion baselines. These results highlight the importance of dynamic gating, entropy-regularized attention, and reconstruction-driven learning in building robust molecular fusion models.