Workshop paper

Carbon-m1: a Massive, Multi-Modal Synthetic Dataset for Complex Polymeric Materials

Abstract

Development of transformational AI models for polymers has been greatly hindered by the lack of large, comprehensive, multi-modal datasets that are licensed for research and commercial use. The primary aim of this proposal is to address this unmet need through the creation of carbon-m1, a massive, multi-modal synthetic dataset for polymers and polymer containing materials for release under an Apache 2.0 license. Carbon-m1 will seek to capture critical structural, sequence, and stochastic features of polymers as well as their characterization data, two crucial features missing from existing efforts to tackle data challenges within polymer AI development.