Workshop paper

A Self-Supervised Framework for Robust Multi-Modal Molecular Representation Learning

Abstract

Molecular property prediction has greatly benefited from learned embeddings such as SMILES-based, SELFIES-based, and graph-derived representations. However, existing approaches often rely on a single modality or naïvely concatenating multiple modalities, limiting robustness and failing under missing-modality conditions. In this work, we propose a novel self-supervised fusion framework - dynamic fusion, that dynamically integrates multiple molecular embeddings. The proposed framework employs intra-modal gating for feature selection, inter-modal attention for adaptive weighting, and cross-modal reconstruction to ensure information exchange. Through progressive modality masking during training, the dynamic fusion approach learns to generate fused embeddings resilient to missing modalities. We conducted preliminary evaluations of the proposed approach on MoleculeNet benchmarks, and demonstrate a superior performance in reconstruction, modality alignment, and downstream property prediction tasks compared to unimodal baselines. Our findings highlight the importance of feature-level gating, entropy-regularized attention, and cross-modal reconstruction in achieving robust fusion.