IBM at Interspeech 2025
- | Live
- Rotterdam, Netherlands
The American Chemical Society's fall meeting will convene thousands of chemistry professionals and technologists to discuss the latest trends and advances in the field. This year's event theme is "harnessing the power of data” and IBM will present papers, posters, and demos of how we create technology that will help chemists and material scientists to harness the power of data.
Topics include large scale data ingestion and analysis, leveraging the power of foundation models for prediction, generation and building assistants, to automated chemical synthesis executed in autonomous labs and how these advances are driving research in designing more sustainable materials and in drug discovery.
We invite attendees to visit our booths to speak with IBM Researchers and interact with demonstrations of our work:
For presentation times of workshops, demos, and papers see the agenda section below. Note: All times are displayed in your local time.
We look forward to seeing you and telling you more about our latest work and career opportunities at IBM Research.
Learn more about Accelerated Discovery here.
Visit us at the IBM Research booth in the exhibitor area to meet with IBM Researchers to speak about our work and future job opportunities.
Keep up with emerging research and scientific developments from IBM Research. Subscribe to the Future Forward Newsletter.
Symposium/Session: Algorithm Development and Data Analysis in Chemical Space
Abstract: Advances in deep learning and machine learning models combined with high-throughput experimentation have shown potential to accelerate chemical and materials discovery and highlighted the benefits of AI-assisted research practices. The recent advent of multi-domain and multi-task models trained by self-supervision, so-called foundation models, bears also promises for extending learnt representations across multiple fields, thus counteracting the reduced data availability in certain applications and benefiting from information exchange across domains. We propose extending this approach to chemical sensing. In this context, we leverage transfer learning based on fingerprints pretrained in other domains to model new instrument/sensor data representations. Herein, we demonstrate how the output of a model system comprising an integrated electrochemical sensor array for analysis of multi-component liquids can be encoded as image representations to leverage existing deep learning computer vision models pre-trained on large collections of image data. The models effectively extract features from these representations and feed specific model heads to perform downstream tasks. More specifically, the raw potentiometric data from the sensor array is processed to yield a spectral response which is cleaned (moving average and SNV) and transformed to an image representation (Gramian Angular Field). Off-the-shelf features are generated leveraging pretrained neural networks developed to classify natural images. Dimensionality reduction yields a set of features that are then used to train machine learning classification or regression heads. The pipeline was applied to generate visual fingerprints of multiple beverages, proving full discrimination of liquid types, and enabling class identification (mean accuracy ~95%) on a model dataset comprising 11 Italian wines. The results demonstrate the successful creation of a new representation of the chemical sensing space which achieves comparable performance as domain-specific hand-crafted feature selection. The present contribution represents an example of integration of data processing techniques and publicly available libraries/models to support transfer of methodologies across domains.
Author(s): Gianmarco Gabrieli, Matteo Manica, Patrick Ruch
Symposium/Session: Data-driven Design of Energy Materials
Abstract: An accurate knowledge of potential energy surfaces and local forces is of paramount importance to implement molecular dynamics (MD) calculations. As the exact solution of the Schrödinger equation for electrons and nuclei becomes quickly impractical with growing system size, approximate methods have been developed in a delicate balance between performance and accuracy. Besides density functional theory and empirical force fields, machine learning has emerged as a novel and effective framework, leading to the family of so-called neural network potentials (NNPs). The success of classical NNPs is nowadays testified by several high-impact scientific works and by the development of dedicated software libraries. At the same time, the quantum mechanical character of the relationship between molecular configurations, energies and forces immediately leads to the question whether quantum machine learning (QML) methods could provide even greater advantages. Inspired by this idea, our work aims at establishing a direct connection between quantum neural networks (QNNs) and molecular force fields. We carry out such program by designing a dedicated quantum neural network architecture and by applying it to different molecules of growing complexity. The quantum models exhibit a larger effective dimension with respect to classical counterparts, achieving competitive performances. Furthermore, we leverage the recently introduced framework of geometric QML to develop equivariant quantum neural networks that natively respect relevant sets of molecular symmetries upon input of cartesian coordinates, thus enhancing trainability and generalization power. Notably, QML is now reaching a level of maturity at which the quest for non-trivial candidate problems -- both practically relevant and suitable to showcase quantum advantage over classical counterparts -- becomes of paramount importance. With our present contribution, we not only show that QNNs can adequately serve the purpose of generating molecular force fields, but we suggest that this may constitute an appealing playground to test and understand the potential of QML techniques.
Author(s): Francesco Tacchino, Oriel Kiss, Isabel Nha Minh Le, Sofia Vallecorsa, Ivano Tavernelli
Symposium/Session: Helping Chemists Manage their Data
Abstract: In our evolving society, many problems such as climate change, sustainable energy systems, pandemics, and others require faster advances. In chemistry, scientific discovery also involves the critical task of assessing risks associated with proposed novel solutions before moving to the experimental stage. Fortunately, recent advances in machine learning and AI have proven successful in addressing some of these challenges. However, there remains a gap in technology that can support the development of end-to-end discovery process, which seamlessly integrate the vast array of available technologies into a flexible, coherent, and orchestrated system. These applications must manage complex knowledge at scale, enabling subject matter experts (SMEs) to efficiently consume and produce knowledge. Moreover, the discovery of novel functional materials heavily relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention due to their ability to generate vast volumes of novel molecules across material domains. However, the high level of creativity these models exhibit often translates into low viability of the generated candidates. To address these challenges, we propose a workbench framework that facilitates human-AI co-creation, enabling SMEs to reduce time-to-discovery and the associated opportunity costs. This framework relies on a knowledge base with domain and process knowledge and user-interaction components to acquire knowledge and advise the SMEs. The framework currently supports three main activities: generative modeling, dynamic dataset triage, and risk assessment.
Author(s): Emilio Ashton Vital Brazil, Renato Fontoura De Gusmao Cerqueira, Carlos Raoni De Alencar Mendes, Vinicius Segura, Juliana Jansen Ferreira, Dmitry Zubarev, Kristin Schmidt, Dan Sanders
Symposium/Session: Helping Chemists Manage their Data
Abstract: The FAIR principles guidelines aim to enhance the discovery and usage of digital objects by humans and computational agents. They are formulated at a high level and, as such, are differently interpreted and implemented by distinct communities of practice, which often have to collaborate, such as in the context of the use of chemicals in scientific discovery. Practical approaches outlining FAIR-related characteristics of digital objects are few and far between, and most of these are domain-agnostic, i.e., they do not consider scientific communities’ varied needs and require specific implementations and combinations for better estimation. Questionnaires have been considered the main mechanism to systematically capture the implementation choices corresponding to each FAIR principle. However, existing questionnaires focus on FAIR assessment using identical questions for distinct communities, i.e., evaluating the digital objects in the same way and usually reckoning that the digital objects have passed through a FAIRification process. In other words, they do not aim at characterizing digital objects, which would give a current overview of the properties that most contribute to their FAIRness. This work builds on the FAIR principles while considering distinct proposed metrics and tools for manual, automated, and semi-automated FAIRness assessment, like a questionnaire specifically designed to assess a plurality of interrelated scientific domains and their possible integration. It reports on applying an improved questionnaire aiming to characterize digital objects’ properties towards their FAIRification on two Materials databases: Materials Cloud and PubChem. We investigate the hypothesis that this questionnaire instills digital objects’ characteristics with a richness of details about their current properties and outlines their main elements for FAIRification. We demonstrate that the improved questionnaire is a more suitable tool for both domain specialists and data stewards to investigate digital objects’ characteristics and improve on them.
Author(s): Leonardo Guerreiro Azevedo, Julio Cesar Cardoso Tesolin, Gabriel Banaggia, Renato Fontoura De Gusmao Cerqueira
Symposium/Session: Simulation and Data Science Approaches to Design Biologically Relevant Polymers and their Applications
Abstract: In recent years, language models have disrupted multiple application domains, from natural language to chemistry and material science. Since their inception, they have enabled a revolutionary way to hypothesize the design of novel materials, shown remarkable capabilities in modeling reactivity and successfully adopted in automating chemical synthesis planning. This talk will cover our recent research on applying language models to accelerate scientific discovery in chemistry, from small molecules to polymers and proteins. Our methodologies cover textual representation of molecules, natural language, and hybrid representations, which allow leveraging different data modalities to build holistic foundation models. Besides introducing the methodologies, we will also cover various applications of language models for material design and synthesis. By harnessing the power of language models and the growing availability of datasets, we can transform the discovery process at different stages, paving the way for a revolutionary computer-aided approach to designing, optimizing, and validating novel materials.
Author(s): Matteo Manica
Symposium/Session: Advances in Carbon Capture, Utilization, and Storage for a Sustainable Energy Future
Abstract: High-Throughput Computational Screening (HTCS) is an invaluable technique that has been used to sift through the growing number of candidate gas capture and separation materials compiled in databases during the last two decades. The screening workflow typically consists of loading the material structure from a Crystallographic Information File (CIF) and performing Grand Canonical Monte Carlo (GCMC) simulations of the adsorption behavior of molecules of interest. GCMC provides the equilibrated number of molecules that adsorb on each material at a given temperature and pressure. By sweeping a range of pressures at a fixed temperature, one obtains an adsorption isotherm.
In more advanced studies, the simulated isotherms are fed as input to a process-level optimization method that propagates the molecular-level performance metrics to the process scale. The process-level model covers both the equilibrium and kinetics aspects of adsorption, including mass transfer considerations. Sensitivity analysis shows that the process-level performance is heavily influenced by the adsorption kinetics.
In this work, we performed molecular- to process-level screening of ~1000 metal-organic frameworks (MOF) for carbon capture. We simulated their adsorption isotherms and propagated their process-level performance, leading to a material ranking. We then took the top 10% materials and investigated with classical Molecular Dynamics (MD) simulations how the adsorbate molecules diffuse into the system. We found that many apparently good carbon capture materials in fact had very low diffusivity, which severely impacts their real-world performance at the process level.
Finally, we propose a computational workflow that treats the diffusivity coefficient as a top-tier metric in HTCS studies going forward to accelerate the discovery of new sustainable materials for carbon capture.
Author(s): Felipe Lopes Oliveira, Rodrigo Neumann Barros Ferreira, Binquan Luan, Ashish B. Mhadeshwar, Jayashree Kalyanaraman, Anantha Sundaram, Joseph M. Falkowski, Jonathan R. Szlachta, Yogesh V. Joshi, Mathias Steiner
Symposium/Session: Symposium on Materials for Lithium and Sodium Batteries
Abstract: Lithium-iodine batteries are among a class of next generation conversion-based chemistries that deliver high energy density using abundant, low-cost materials. There are two main obstacles facing these chemistries: the instability of the lithium anode that leads to capacity fade and the low utilization of the active material under practical cell conditions that leads to low specific capacity. In this work we address anode instability through chemical treatment of the surface layer to form a borate rich interphase that protects the lithium from parasitic reactions. We demonstrate that the properties of the treated lithium surface are highly dependent on the treatment environment and require precise tuning to achieve optimal performance. The stabilized lithium surface shows improved capacity retention in lithium-iodine cells at practical mass loadings (above 10 mg/_cm_2). Further, we explore the relationship between iodine utilization and mass-transport limitations. The results indicate that diffusion limited transport of the dissolved active material is the major source for the reduction of specific capacity with increasing iodine loading. These studies provide design rules for materials discovery to enable stable and high energy density conversion batteries.
Author(s): Murtaza Zohair, Maxwell Giammona, Linda Sundberg, Andy Tek, Anthony Fong, Khanh Nguyen, Vidushi Sharma, Holt Bui, Young-hye Na
Symposium/Session: Chemical Information Across the Chemistry Enterprise
Abstract: Molecular fragmentation has been frequently used for machine learning, molecular modeling, and drug discovery studies. However, the current molecular fragmentation tools often lead to large fragments that are useful to limited tasks. Specifically, long aliphatic chains, certain connected ring structures, fused rings, as well as various nitrogen-containing molecular entities often remain intact when using BRICS. With no known methods to solve this issue, we find that the fragments taken from BRICS are inflexible for tasks such as fragment-based machine learning, coarse-graining, and ligand-protein interaction assessment. In this work, we develop a revised BRICS (r-BRICS) module that allows more flexible fragmentation on a wider variety of molecules. We show that r-BRICS generates smaller fragments than BRICS, allowing localized fragment assessments. Furthermore, r-BRICS generates a fragment database with significantly more unique small fragments than BRICS, which is useful for fragment-based drug discovery, submolecular motif identification and coarse-grained simulations.
Author(s): Leili Zhang
Symposium/Session: Past, Present and Future of AI and Predictive Analytics for Chemical Reactions
Abstract: The right solvent is a crucial factor in achieving environmentally friendly, selective, and highly converted chemical reactions. While artificial intelligence-based computer-aided synthesis tools are capable of predicting starting materials and reactants for synthesizing a desired product, they often lack the ability to reliably predict reaction conditions such as the appropriate solvent. In this study, we demonstrate that data-driven machine-learning models can reliably predict the correct solvent for a broad spectrum of organic reactions. We extracted single-solvent reactions from two patent-derived datasets, Pistachio and the USPTO dataset which is openly available. We trained a BERT-based classifier and a random forest in combination with differential reaction fingerprints, achieving a Top-3 accuracy of up to 86.88\% for predicting the most commonly used solvent, as well as a reliable prediction of underrepresented classes with an F1-macro score of up to 56.87\%. An uncertainty analysis revealed that the models' misclassifications can often be explained by the fact that the reaction class of the reaction in question can be run in multiple solvents. These models are currently undergoing experimental validation in a campaign to test reactions that were successfully run in a solvent that differs from the one predicted by the model, in order to evaluate their real-world applicability. This work highlights the potential of data-driven approaches for addressing key challenges in organic synthesis, demonstrating the practical application of machine learning models in predicting reaction solvents for more efficient and sustainable chemical synthesis
Author(s): Oliver Schilter, Carlo Baldassari, Teodoro Laino, Philippe Schwaller
Symposium/Session: Quantum Computing for Tackling Challenges in Quantum Chemistry
Abstract: In recent years, quantum computing has emerged as a promising platform for simulating strongly correlated systems in chemistry, for which the standard quantum chemistry methods are either qualitatively inaccurate or too expensive. However, due to the hardware limitations of the available noisy near-term quantum devices, their application is currently limited only to small chemical systems. One way for extending the range of applicability can be achieved by means of hybrid classical-quantum embedding approaches, multiple of which have been put forward all with different tradeoffs. In this talk, I will present a projection-based embedding method for combining the variational quantum eigensolver (VQE) algorithm, although not limited to, with density functional theory (DFT). The developed VQE-in-DFT method was recently implemented in Qiskit and used to compute the triple bond breaking process in butyronitrile on an IBM quantum device. Our results show that the developed method is a promising approach for simulating systems with a strongly correlated fragment on a quantum computer. This development as well as its future extensions will benefit many different chemical areas including the computer aided drug design as well as the study of metalloenzymes with strongly correlated components.
Author(s): Max Rossmannek, Fabijan Pavošević, Angel Rubio, Ivano Tavernelli
Symposium/Session: Enhance your Data - Smart Ways to Metadata and Knowledge Graphs
Abstract: Material discovery processes require knowledge bases that can aid domain experts in deriving new hypotheses and managing experimental data and workflows. Unfortunately, the manual creation of knowledge bases is a labor-intensive, time-consuming and error-prone activity. Knowledge Extraction Pipeline (KEP) is a human-in-the-loop pipeline to semi-automatically extract knowledge from scientific literature without the need of exhaustive manual data annotations. KEP is based on the idea that knowledge is extracted from sentences classified as relevant in given document. It is composed of three tools. Sentence Selection tool obtains text information from PDFs and selects relevant sentences by using a Large Language Model (LLM). After the expert curation of these sentences, Knowledge Extraction tool extracts the desired knowledge by using the table creation from unstructured data use case provided by the LLM. The expert has again the opportunity to curate the extracted knowledge before Knowledge Representation tool creates RDF graph representing the knowledge obtained from relevant sentences.
The pipeline was applied to find PFAS and their applications in PDFs. 15 relevant sentences mentioning PFASs and applications plus 15 not relevant sentences together with their classifications were provided as context to the LLM that was able to find out the relevant sentences of documents. Its accuracy was 85%. Next, the “table creation from unstructured data” LLM use case was used and 30 relevant sentences were annotated with tabular annotations highlighting PFASs and applications mentioned in each sentence. These sentences and their annotations were provided as context to LLM that was able to provide the same kind of annotation to the other relevant sentences. Its accuracy in this task was 86%. The RDF knowledge base of PFAS and applications was created by using the tabular annotations provided by LLM. The use case demonstrated that KEP extracts relevant knowledge without the need of extensive manual annotations.
Author(s): Viviane Torres Da Silva, Breno William Santos Rezende De Carvalho, Marcelo Archanjo Jose, Sandro Rama Fiorini
Symposium/Session: Murphree Award in Industrial Chemistry: Symposium in Honor of Qinghuang Lin
Abstract: The discovery, development and deployment of new materials provides new business opportunities as well as to drive advances in high value applications ranging from microelectronics to medicine. Polymer science continues to be perceived as a mature field since most of the efficient polymer-forming reactions have been exploited. As advances in computational chemistry and AI systems continue, there influence on materials development on multiple length scales, the creation and understanding of new polymer-forming reactions, catalysts discovery and the modeling of supramolecular assemblies are becoming more pervasive. For example, catalysis is a foundational pillar for sustainable chemical processes; the discovery of highly active, environmentally benign catalytic processes is a central goal of Green Chemistry. Together with Robert Waymouth (Stanford University) we have developed a broad class of highly active, environmentally benign organic catalysts for the synthesis of biodegradable and biocompatible plastics that was largely driven by the convergence of experimental and computational chemistries. Fundamental mechanistic and theoretical investigations have provided new scientific insights on the diversity of mechanistic pathways for organocatalytic polymerization reactions and the opportunities that these new insights have created for the synthesis of well-defined macromolecular architectures. The monomer feedstocks have focused on those from renewable resources such as lactides, lactones and carbonates, but also from petrochemical feedstocks.
The recent advances in catalyst development that span many orders of magnitude over a large palette of monomers offers a unique opportunity for rapid materials discovery, as the future of materials research will be conducted in an AI-driven, automated laboratory. Historically at IBM, materials workflows are very targeted on internal applications such as lithography and interlayer dielectrics that move rapidly towards devices. However, the commercialization of new materials in the general application space is traditionally very slow. From the discovery phase to market placement, materials development timelines are labor-intensive and require massive capital expenditure. To overcome this challenge, the merging of automated synthesis, high-throughput characterization, and predictive AI into a single pipeline offers the opportunity to dramatically accelerate materials development at a fraction of the traditional cost.
Author(s): James Hedrick
Symposium/Session: Chemical informatics (R)evolution: Towards Democratization and Open Science
Abstract: Chemical-structure databases that rely on manual data curation, while remaining an authoritative data source, have challenges to scale with the increasing volume of (patent) documents. Publicly accessible chemical-structure databases that contain data processed in an automated fashion instead therefore become increasingly popular resources to enable accelerated discovery of new molecules.
This presentation discusses the chemical-structure database PatCID (Patent Chemical-structure Image Discovery) with more than TBD_number unique molecular structures that are displayed as 2D molecular structure images in documents from patent offices in the United States (USPTO), Europe, Japan (JPO), Korea, and China, published after 1980. We found that to have good coverage of the breadth of the organic chemistry domain, in particular processing of JPO documents in addition to USPTO documents was crucial. This is because, for example, about 70% of JPO patent families in the organic chemistry domain were not extended to the USPTO. The chemical-structure database will be made publicly accessible.
For this chemical-structure database, a new graph-based visual recognition model was developed to convert 2D molecular structure images to a standard machine-readable molecular description. The model comprises a deep keypoint detector and a graph neural network that classifies atoms and bonds. A substantial precision-advantage over the often-used OSRA utility and over alternative deep learning approaches was obtained, especially for low-resolution and unconventional images frequently found in documents from patent offices in Asian Pacific. The model architecture and a trained model will be made publicly available on GitHub.
A large-scale chemical-structure database can accelerate discovery of new molecules. To substantiate this, we will present a methodology that helped discover new cyanine dye molecules. In an automated fashion, options for the distinct cyanine dye’s substructures were catalogued and white space in the patent document landscape was identified. This supported chemistry experts, who otherwise would have faced an overwhelming number of documents to digest for knowledge extraction, in their decision making.
Author(s): Ingmar Meijer, Valery Weber, Lucas Morin, Peter Staar, Junta Fuchiwaki, Masataka Hirose
Symposium/Session: Chemical informatics (R)evolution: Towards Democratization and Open Science
Abstract: Grand Canonical Monte Carlo (GCMC) is a widely used method for simulating gas adsorption in nanoporous solids, including metal-organic frameworks (MOFs), zinc-imidazole frameworks (ZIFs), covalent organic frameworks (COFs), and zeolites. In these simulations the framework-adsorbate interactions are modeled using a classical force field. The van der Waals energy is calculated using the Lennard-Jones potential with specific parameters, while electrostatics are determined using partial atomic charges. It is crucial to select appropriate force field parameters and partial charge assignment schemes as they can significantly influence the simulation results. Recently we presented the CRAFTED database that contains approximately 50,000 simulated isotherms of CO2 and N2 on 690 MOF structures with a systematic selection of different force fields and temperatures. However the current version of the database does not include purely organic structures, such as covalent organic frameworks (COFs).
Here we present an expansion of the CRAFTED database with simulated adsorption isotherms on 716 COF structures taken from the CURATED database. The simulations were performed for the adsorption of CO2 and N2 with two force fields obtained from all the possible combinations of Lennard-Jones parameters taken from two models (UFF and DREIDING) and six partial charge schemes (no charge, Qeq, EQeq, and DDEC, PACMOF, and MPNN). These simulations were performed at three temperatures (273, 298, and 323 K) and within pressure ranges of 0,001 to 1 bar for N2, and 0,001 to 10 bar for CO2. This new results introduces 51,552 adsorption isotherms on CRAFTED, which doubles its current size and provides a more comprehensive representation of the diversity of reticular materials chemistry.
This expanded and more comprehensive dataset of adsorption isotherms enables a more detailed evaluation of the uncertainty introduced by the choice of force field in both molecular and process-level simulations. This, in turn, aids in the search for simulation schemes that have lower levels of uncertainty, allowing for the development of more accurate computational approaches for the search of new materials that can efficiently capture CO2.
Author(s): Felipe Lopes Oliveira, Conor Cleeton, Rodrigo Neumann Barros Ferreira, Binquan Luan, Lev Sarkisov, and Mathias Steiner
Symposium/Session: Symposium on Materials for Lithium and Sodium Batteries
Abstract: The advent of data driven artificial intelligence and machine learning techniques has opened much larger design spaces for composite materials and mixtures with larger numbers of formulants. At the same time, performance benefits have been recently reported for so called “high entropy” electrolytes with large numbers of formulants resulting in more diverse set of solvation structures and enhanced charge transport kinetics in batteries with nonaqueous liquid electrolytes and lithium metal anodes. Halogen-based battery cathodes have also attracted recent interest due to their high-rate capability (>1 mA/cm2) and moderately high (200-400 mAh/g) specific capacity. The performance of halogen cathodes, due to the solution mediated nature of the conversion reactions involved, is closely coupled to both the electrolyte formulation and the solid-electrolyte interphase (SEI) layer formed on the lithium metal anode. Due to the high dependence on electrolyte formulation and many interrelated causal dependencies between cathode, electrolyte, and anode, halogen cathodes are an interesting application area for high entropy electrolytes and the use of AI to facilitate a more efficient survey of the relevant design space. In this work we report on the use of a novel artificial intelligence platform and corresponding web application to survey a space consisting of 4 salts and 4 solvents resulting in an optimized electrolyte formulation for a cell chemistry using an interhalogen cathode (I-Cl) and lithium metal anode which outperforms any electrolyte formulation currently reported in the literature for this system and showing the potential of both high-entropy electrolytes and the promise of AI to facilitate efficient searching of large formulation spaces.
Author(s): Maxwell Giammona, Vidushi Sharma, Tim Erdmann, Khanh Nguyen, Andy Tek
The IBM Research booth will highlight recent advances in developing technology for accelerated scientific discovery. Interactive demonstrations will give you hands-on experience on:
Symposium/Session: Helping Chemists Manage their Data
Abstract: Learning from laboratory data at scale poses a bottleneck in research workflows. To overcome the bottleneck, we consider and propose solutions to the fundamental questions of Who captured the data? What data was captured? When was the data captured? Where was the data captured? Why was the experiment carried out? And how?
Electronic laboratory notebooks (ELNs) have traditionally been used to record answers to the above questions. However, they impose a heavy burden on the researcher by requiring manual data entry. In addition, the records do not include information about the environment in which the researcher generated the data. For instance, which instruments were used, which software version, and the actions that led to the data generated. Thus, there is a need to improve traceability through the prototypical research workflow.
We propose a data management infrastructure to capture a richer representation of experimental workflows to address the concerns. The infrastructure is divided into a primary component deployed in one or more cloud platforms and a minimal component installed on local laboratory instruments. Data generated during a workflow is automatically tied to the corresponding action, thereby improving the reproducibility of experiments, traceability, and automating data entry. Each experimental step, the related workflows, and data are readily accessible through cloud-based services. In turn, the infrastructure provides a framework for systematic and homogenous data collection to facilitate the application of machine learning (ML) on experimental data. This alleviates the burden of recording experimental data from the researcher while providing a framework for ML tools to gain a richer representation and understanding of the experiments carried out. In addition, the precise tracking of experiments fosters collaboration between researchers that can exchange workflows and related data in a common format.
Author(s): Amol Thakkar, Andrea Giovannini, Matteo Manica, Alain Vaucher, Patrick Ruch, Teodoro Laino
Symposium/Session: Quantum Computing for Tackling Challenges in Quantum Chemistry:
Abstract: Molecular orbital (MO) simulation enables us to tackle various chemical problems. However, there are some cases where conventional MO simulation cannot describe the chemical problem accurately, such as isotope effects. Treatment of nuclear quantum effects are essential to tackle these problems. The Multicomponent Molecular Orbital (MCMO) method treats the nuclei quantum mechanically and allows us to include the nuclear quantum effects in the quantum chemistry simulations, but requires larger computational resources compared to conventional MO simulations. Therefore, MCMO simulations are limited to smaller-size molecular systems compared to conventional MO simulations. The use of quantum computers for MCMO simulations is appealing. Some studies have investigated MCMO simulation using quantum algorithms, however, they are limited in number. In this work, we investigate calculation of molecular properties using MCMO simulation with quantum algorithms including simulations on quantum devices. We computed the dipole moments of HD molecule, which cannot be calculated using conventional MO methods. Variational quantum eigensolver with UCCSD ansatz was used for computing the dipole moments. We further combined the MCMO method with Frozen Natural Orbital approach for virtual space truncation to reduce the problem size for the use of near-term quantum devices. We studied the effect of removing translational and rotational motions of the system on the calculation of the dipole moments. We confirmed that removing the rotational motions of the system is essential for accurate computation of the dipole moments of HD.
Author(s): Yukio Kawashima, Tanvi Gujarati, Yuki Orimo, Kenichi L. Ishikawa, Takeshi Sato
Symposium/Session: Advances in Energy and Fuel
Abstract: Lithium-ion batteries (LIBs) have advanced significantly as an essential energy storage solution for the last decade. However, gradual saturation in achievable energy density, flammability, high cost, and environmental impacts of critical raw materials (ex. cobalt and nickel) have yet to be addressed. This enhances the need to discover and develop new next-generation battery technologies that utilize sustainable materials and encompasses improved performance for a range of applications including electric vehicles and stationary energy storage. This work focusses on discovery of safe and high-performance liquid electrolytes for next generation battery systems by using data driven methodologies such as QSAR (Quantitative Structure to Activity Relationship) prediction models and formulants-to-performance mapping deep learning models. Generally, electrolyte constituents are screened-based on one or more essential properties such as their stability in electrochemical window, solubilities of salts in solvents, and ionic conductivities. Advancements in computation techniques allow easy assessment of these properties for selection of suitable electrolyte constituents through high throughput screening process called computational funneling. In our work, we further expand the electrolyte discovery workflow to the next stage of finding the right composition of electrolyte constituents (solvents, co-solvents, salts) using deep learning, that may otherwise involve high throughput experimentation in a relatively larger chemical space. It has been reported that exhaustively searching 4 component mixtures of electrolytes would require over a million evaluations. To optimally search the formulation design space for target performance, we adopt a simulation-experiment-AI synergistic approach where the initial battery cell tests that are used to develop battery chemistry are used for model learning and then driving the optimization of new electrolyte formulation, resulting in a fewer number of lab- experiments for validation. In this talk we will discuss our data-driven electrolyte discovery workflow and toolkit, along with a use case demonstration conducted with a next-generation lithium-metal battery based on iodine-conversion chemistries.
Author(s): Vidushi Sharma, Maxwell Giammona, Tim Erdmann, Andy Tek, Khanh Nguyen, Linda Sundberg, Dmitry Zubarev, Young-hye Na
Symposium/Session: Young Industrial Polymer Scientist Award in Honor of Hayley Brown
Abstract: High value specialty polycarbonates are employed in numerous applications, including resins for 3D additive printing, macromonomers for polyurethanes, surfactants, battery electrolytes, and degradable adhesives for medical applications. Driving the growth in use cases are the significant advancements made over the last two decades in improved methodologies for the synthesis of aliphatic carbonates as well as upcycling carbon dioxide (CO2) into high value-added materials. We have recently reported a method to transform 1,3- and 1,5-diols into functional cyclic carbonates without the use of hazardous reagents. Employment of TMEDA was shown to provide selective ring-closure to the cyclic carbonate in the presence of CO2 while minimizing oligomerization and formation of other byproducts. A series of commercial and synthetic 1,3- and 1,5-diols were employed to generate 6- and 8-membered cyclic carbonates with diverse pendant functional groups tuning polymer properties. The ability to tune these functional monomers and subsequent polymers has allowed numerous applications that include drug, gene and cell delivery as well as the use of the polymer as a stand-alone therapeutic including antimicrobials, anticancer agents and antiviral therapies packaged as either micelles, hydrogels or coacervates. Specifically, we address the multi-faceted problem of drug-resistance as well as other important concerns in disease treatment exploiting polymer science to develop novel macromolecular therapeutics for treating infectious disease and cancer.
Author(s): James Hedrick
Symposium/Session: Reactivity at the Mineral-Water Interface: Validation through Modeling and Experiments at the Pore Scale
Abstract: Predicting the spatiotemporal evolution of the pore space of subsurface rock formations under fluid-solid interactions has applications in reservoir engineering, oil recovery, and carbon dioxide geological sequestration. The geometry of the reservoir porous space may evolve in time depending on several factors, including geometry characteristics, phasic properties, flow conditions, and underlying coupled pore-scale processes. Common methods to track these geometrical changes comprise transport-reaction simulations that combine fluid transport results with chemical and physical processes at the pore scale. These numerical methods are often limited to small domains and single reactions as they become impractical and computationally costly for complex spatial domains and multiscale phenomena. To overcome these limitations, in this work, we model the rock pore space geometry, extracted from high-resolution X-ray microtomography images of suitable rocks, as a network of connected capillaries, a sparse graph representation with significantly reduced degrees-of-freedom with respect to its mesh- or lattice-based counterparts, and assume laminar piston-like flow within each capillary and conservation of mass at each network node. This allows to track the geometry evolution of porous media due to simultaneous pore-scale processes (i.e., erosion, mineral dissolution, and mineral precipitation) by solving the transport equations iteratively to extract pressure and flow rate fields at each point in the network and then computing the volume accumulation or loss within each capillary. Computation of the change in capillary diameter employs phenomenological correlations for those physical and chemical processes at each temporal iteration, adjusting for the distinct time scales of each phenomenon.
Author(s): David Alejandro Lazo Vasquez, Jaione Tirapu Azpiroz, Rodrigo Neumann Barros Ferreira, Manuela Fernandes Blanco Rodriguez, Ronaldo Giro, Matheus Esteves, Ademir Ferreira Da Silva, Benjamin Wunsch, Mariana Del Grande, Mathias Steiner
Symposium/Session: Free and Open Source Software: Harnessing the Power of Data
Abstract: Molecular dynamics simulation is well-established as a technique contributing to drug and materials discovery. Increasingly important is its use as a data source for training AI models. Scaling the scope and size of such data sets will be key to building foundation models based on large-scale and diverse information. We use an IBM-developed open-source toolkit, Simulation Toolkit for Scientific Discovery (ST4SD), to automate simulation workflows. These workflows can be readily scaled to take full advantage of traditional high-performance computing and emerging OpenShift clusters. We then show how large-scale simulation data can be digested by graph-based, deep neural networks that our team has designed. We build a model for antigen-peptide immunogenic prediction that outperforms hand-engineered features trained on the same dataset and is further shown to outperform state-of-the-art sequence-based models in the low-data regime.
Author(s): Joseph Morrone, Jeff Weber, Seung Gu Kang, Leili Zhang, Tien Huynh, Wendy Cornell
The IBM Research booth will highlight recent advances in developing technology for accelerated scientific discovery. Interactive demonstrations will give you hands-on experience on:
Symposium/Session: Machine Learning in Chemistry
Abstract: This study relates to the experimental characterization of amine solutions and the use of machine learning methods to classify and predict novel amines for carbon dioxide capture. Carbon Capture and Storage (CCUS) is a critical element of world efforts to mitigate the effects of climate change. Amine solvents have been successfully applied to large-scale CCUS implementations, but material challenges include the cost of regeneration energy, solution degradation, and corrosivity. Improvements in stability, binding capacity, kinetics, and vapor-liquid equilibrium have been achieved through improve chemistry,. but accelerating their discovery with machine learning and AI has seen limited exploration. Machine learning provides a promising method for reducing the time and resource burdens of materials development through efficient correlation of structure-property relationships to allow down-selection and focusing on promising candidates. Towards demonstrating this, we have developed an end-to-end "discovery cycle" to select new aqueous amines compatible with the commercially viable acid gas scrubbing for carbon capture. We combine a simple, rapid laboratory assay for CO2 absorption with a machine learning based molecular fingerprinting model approach. The prediction process shows 60% accuracy against experiment for both material parameters and 80% for a single parameter on an external test set. The discovery cycle determined several promising amines that were verified experimentally, and which had not been applied to carbon capture previously. In the process we have compiled a large, single-source data set for carbon capture amines and produced an open-source machine learning tool for the identification of amine molecule candidates.
Author(s): Theodore Van Kessel, Benjamin Wunsch, Flaviu Cipcigan, Alexander Harrison, James Mcdonagh, Stamatia Zavitsanou, Stacey Gifford
Symposium/Session: Big Data in Polymer Chemistry
Abstract: Domain-specific languages (DSLs) are used for specific domain areas for where their custom syntax and narrowed scope allow for concise, interpretable expression of the programming tasks. Despite the extensive use of DSLs for a variety of domains, they are relatively underexplored for knowledge representation and translation tasks within experimental science. Here, we will discuss how the flexibility and expressiveness inherent to DSLs can enable effective representation of experimental data with a specific focus on experimental polymer data using a DSL termed Chemical Markdown Language (CMDL). We will discuss how the inherent extensibility of using a DSL such as CMDL enables straightforward support and use of a variety of polymer structural representation systems as well as accommodate a multitude of experimental data types. Experimental data represented using CMDL may be seamlessly utilized to develop of ML-models for materials and catalyst design, which in turn have been validated experimentally. The interoperability of CMDL enabled platforms and data representations with broader open-source data initiatives within polymer chemistry will also be discussed.
Author(s): Nathaniel Park
Symposium/Session: Emerging Areas and New Methods in Biological Chemistry
Abstract: Protein folding results from an intricate interaction among intrinsic and extrinsic factors like amino acid sequence and solvation environment. The hydrophobic effect primarily drives the water-soluble globular protein to fold adjusted by following side-chain packing, whereas the folding of membrane protein, immersed in the hydrophobic lipid bilayer, is not well understood. The lack of water inside the lipid bilayer diminishes the hydrophobic effect, while the van der Waals packing becomes a crucial driving force. This may imply that the membrane protein interior is tightly packed. Paradoxically, membrane proteins such as channels, transporters, receptors, and enzymes require cavities (i.e., voids, pockets, and pores) for their designated function. Then, how do membrane proteins achieve the stability carrying out function? How does the hydrophobic lipid bilayer engage in stabilizing membrane proteins and residue-wise interaction network? In this presentation, we discuss the dynamic interplay between the lipid bilayer and membrane protein on structure and function using molecular dynamics simulations and experiments. Taking the intramembrane protease GlpG of Escherichia coli as our model system, we investigate how the bilayer stabilize the protein by facilitating the residue burial through a comparative study with micelle environment. We also show that cavities created in membrane proteins can be stabilized by favorable interaction with surrounding lipid molecules and play a pivotal role in balancing stability and flexibility for function through cavity filling mutagenesis.
Author(s): Seung Gu Kang
Symposium/Session: The Herman Mark Award in Honor of Robert Waymouth
Abstract: Current strategies to reduce CO2 emissions are insufficient—both point-source and direct-air-capture (DAC) must be considered to mitigate excessive atmospheric CO2 concentrations. Given the urgency of climate change issues and the immense challenges of developing viable methodologies for CO2 conversion, we posit that understanding structure–property relationships of organic/inorganic molecular reactivity across multiple length scales will lead to the evolution of remarkably efficient transformations of CO2 and revolutionize chemistries to control the fate of this greenhouse gas. Thus, we sought to investigate families of superbases (SBs) that serve as mitigating agents. This talk will focus on describing the wide-scope reactivity of a family of modular SBs that can be exploited in a variety of chemical transformations of from dilute and pure gaseous sources as well as polymerizations. We found that the SBs can form zwitterionic complexes to activate , which can be readily mineralized into metal carbonates. Importantly, the highly reactive nature of SBs renders them widely useful to upcycle into high value products.
Author(s): James Hedrick
Symposium/Session: Machine Learning and AI for Organic Chemistry
Abstract: The application of machine learning models in chemistry has made remarkable strides in recent years. From enhancing retrosynthesis over expediting DFT calculations to predicting new drug candidates, the field has seen immense progress. Although there has been increased interest in the field of analytical chemistry, machine learning based methods have so far not been adopted into everyday use by bench chemists. Of the analytical instruments that are commonly available to the chemist, Infrared (IR) spectroscopy has receded in importance with the advent of more powerful structure elucidation tools such as nuclear magnetic resonance (NMR) and liquid chromatography–mass spectrometry (LC/MS). While chemists routinely identify functional groups from IR spectra, obtaining further information from them is challenging. Previous work on applying machine learning to IR spectroscopy has focused on identifying functional groups, and very few attempts at predicting the molecular structure directly have been published. In this work we introduce a novel machine learning approach to predict the molecular structure directly from the IR fingerprint region. To achieve this, we developed a transformer model trained on IR spectra (400-2000 cm-1) that predicts molecular structures as SMILES strings. In addition, we assessed the impact of appending the chemical formula to the input string, enhancing the accuracy of the model. Given the lack of large and high-quality experimental IR spectra databases, we generated a training set of 650,000 simulated IR spectra using molecular dynamics. Our approach achieved a top 1 accuracy of 29.7% and a top 10 accuracy of 62.8% on a test set sampled from PubChem with a heavy atom count ranging from 6 to 13. The model obtained in this fashion provides a pre-trained model that can be fine-tuned on smaller experimental datasets.
Author(s): Marvin Alberts, Teodoro Laino, Alain Vaucher
Symposium/Session: Machine Learning and AI for Organic Chemistry
Abstract: Machine learning has become a powerful tool in accelerating scientific research. However, the reliability and trustworthiness of machine learning predictions depend on the selection of relevant features used in such models. In this work, we propose a multi-stage systematic approach for selecting molecular descriptors based on their causal relationship with a given property, which not only improves the accuracy of machine learning predictions but also enhances their interpretability. The proposed multi-stage feature selection consists of four different blocks: i) feature extraction using Mordred descriptors, ii) data cleaning, iii) causal feature selection per Mordred descriptors module, and iv) general causal feature selection. The Markov Blanket algorithm is used to construct the causal-effect graphs between the features and the property to predict, resulting in a sub-dataset with meaningful features and their importance over the target. To evaluate our approach, we selected two challenging tasks: predicting the toxicity and biodegradability of chemical compounds through molecular descriptors. The results demonstrate that the proposed multi-stage approach outperformed the state-of-the-art methods while using significantly fewer features for both tasks. The proposed methodology enhances the interpretability of machine learning predictions, making it easier for experts to identify the most relevant features and to understand the underlying mechanisms that govern the behavior of the studied molecules. This approach can be applied to a wide range of scientific problems, and we believe it will play a key role in advancing the field of machine learning in science.
Author(s): Eduardo Almeida Soares, Karen Fiorella Aquino Gutierrez, Emilio Ashton Vital Brazil, Renato Fontoura De Gusmao Cerqueira
The IBM Research booth will highlight recent advances in developing technology for accelerated scientific discovery. Interactive demonstrations will give you hands-on experience on:
Symposium/Session: Drug Design
Abstract: Rational design of antibody therapeutics represents an important frontier in drug discovery, and designing novel and diverse antibody candidates for immune checkpoint inhibition remains a topic of particular interest in cancer immunotherapy development. I cover efforts to establish and target novel epitopes with antibody therapeutics for immuno-oncology applications, using the immune cell exhaustion-associated receptor TIM-3 as a case study. I discuss the use of molecular dynamics simulations to identify TIM-3 conformational changes in functional contexts, the fitting of those conformational ensembles to X-ray reflectivity data, and the design of antibodies targeting unique functional TIM-3 epitopes with AI-enhanced free energy simulation approaches.
Author(s): Jeff Weber
Symposium/Session: Machine Learning in Chemistry
Abstract: Machine learning models in chemistry have made impressive progress in recent years. From enhancing retrosynthesis over folding proteins to predicting new drug candidates, the field has seen immense advances.[1–3] While the application of machine learning in analytical chemistry has also seen increased attention, machine learning based methods have so far not been adopted into everyday use by bench chemists. NMR spectroscopy is among the most powerful analytical instruments available to chemists. It can be used to characterise molecular structure, determine complicated stereochemistry, and quantify mixtures. Although chemists regularly use NMR, and numerous programs exist to help process spectra, fully automated structure elucidation remains conceptual in practice. Machine learning may be a valuable tool that could allow automatic structure elucidation. While previous attempts to use machine learning to characterise molecules spectra have been limited, one successful example involves determining the structure of compounds with up to 10 heavy atoms. However, this model requires high resolution 1H and 13C NMR data, and was trained on simulated spectra. In this work we introduce a novel machine learning approach to predict the molecular structure directly from the 1H NMR. To achieve this, we developed a transformer model trained on 1H NMRs that predicts molecular structures as SMILES strings. We obtained 1H NMRs from the experimental sections of the patent reactions in NextMove’s Pistachio dataset.[4] The model takes the chemical formula in addition to the 1H NMR in text form as input. In contrast to previous work, we include molecules with a heavy atom count from 10 to 35. We trained the model on approximately 750,000 examples. Our model achieves a top 1 accuracy of 21.2% and a top 10 accuracy of 40.7%. The model’s accuracy could be further improved by including the 13C NMR and finetuning on more detailed NMR data.
Author(s): Marvin Alberts, Federico Zipoli, Alain Vaucher
Symposium/Session: New Concepts in Polymeric Materials
Abstract: Availability of experimental data of high quality and in large quantity is fundamental to unlock opportunities for accelerating research and development by the application of data science and AI. In this context, we will present automated continuous and discontinuous experimentation setups for the synthesis of polymers and polymeric networks to further drive the adoption of automation. Besides synthesizing a series of 100 distinct block copolymers in 9 minutes, continuous flow reactors also allowed the synthesis of tailored segmented polyurethanes under real-time process monitoring and provided access to performing controlled ring-opening polymerization in milliseconds timeframe. Due to their facile de-/assembly, continuous flow reactor setups offer reconfigurability and wide customizability. To overcome the resulting necessity to redevelop automation code, we developed a control and simulation software, LabDCS, and it will be presented how LabDCS allows to build, simulate, and operate chemical plants based on flow reactor hardware. In the second part, we will cover the utilization of a single channel pipettor for generating a dataset of sol-gel materials and how ensemble models were trained to represent the growth of the sol what constitutes an important process parameter for industrial clients.
Author(s): Tim Erdmann, Nathaniel Park, Pedro Arrechea, Sarath Swaminathan, Dmitry Zubarev, James Hedrick
Symposium/Session: Early Career Investigators
Abstract: Huntington’s disease is an inherited neurodegenerative disorder caused by the overduplication of CAG repeats in the Huntingtin gene. There exists strong correlation between the extended polyglutamines (polyQ) within exon-1 of Huntingtin protein (Htt) and age onset of Huntington’s disease (HD). However, the underlying molecular mechanism is still poorly understood. We have applied extensive molecular dynamics simulations to study the folding of the pathogenic Htt-exon-1 across different polyQ-lengths and different species. By examining the radii of gyration, secondary structures and residue-residue interactions of Htt-exon-1 with these various sequences, we found that the polyP segments “chaperone” the rest of the HttEx1 by forming ad hoc polyP binding grooves. Such a process elongates the otherwise poorly solvated polyQ domain, while modulating its secondary structure propensity from β-strands to α-helices. This chaperoning effect is achieved mostly through transient hydrogen bond interactions between polyP and the rest of HttEx1, resulting in a striking golden ratio of ∼2:1 between the chain lengths of polyQ and polyP.
Author(s): Leili Zhang
Symposium/Session: Taking a Deep Dive into Chemical Space
Abstract: The relatively recent emergence of deep learning-based AI has opened the door to generation of fit-for-purpose molecules for drug discovery and other applications. The advantage of generation over prediction lies in the vast chemical space that can be considered, going beyond molecules known to exist or which have been explicitly imagined or enumerated. Here we determine the size, global diversity, local diversity, and fit for purpose of molecules generated for a common target using a variety of generative approaches which vary with respect to number of targets included in the training and use or absence of 3D protein structural information. The results provide guidance on the use of specific approaches for lead finding or lead optimization.
Author(s): Wendy Cornell
Symposium/Session: Chemical Data Interoperability, Validation & Evaluation
Abstract: Biodegradability is a crucial factor in assessing the long-term impact of chemicals on the environment. However, experimental testing to determine biodegradability is time-consuming and laborious. To address this issue, in silico approaches such as quantitative structure-activity relationship (QSAR) models are highly encouraged by legislators. European legislators have incorporated chemical persistency in the Registration, Evaluation, and Authorization of Chemicals (REACH) for the assessment of chemicals. However, only 61% of chemicals produced or imported in quantities of over 1000 tons per year have information on biodegradability. As a potential solution, REACH encourages the use of QSAR models to predict the biodegradability of compounds. To encourage the development of QSAR models to predict the biodegradability of compounds, this work extends the "All-Public set," which is an aggregated dataset with information on 2830 compounds from various sources. In this study, we contribute to this dataset by adding information on the biodegradability of 3707 new compounds from the ECHA database, resulting in a larger dataset with the biodegradability information of 6537 compounds. By providing a larger dataset with biodegradability information, we aim to promote the development of more accurate QSAR models for predicting the biodegradability of compounds. This will enable more efficient and effective assessments of the potential impact of chemicals on the environment, facilitating the development of more sustainable and environmentally friendly products.
Author(s): Eduardo Almeida Soares, Victor Shirasuna, Emilio Ashton Vital Brazil, Renato Fontoura De Gusmao Cerqueira
Symposium/Session: Drug Design: Lightning talks: Novel Workflows and Methods
Abstract: Designing molecules to be active and selective for targets has always been at the core of drug discovery. While recent advances in generative AI are revolutionizing drug discovery by efficiently representing desired molecule properties in computationally amenable latent space, the 3D binding structure would be the most critical but challenging factor to incorporate with, among others, those tailoring molecule structures and thus property. In our talk, we present a 3D-structure-based generative AI method capable of simultaneously generating active small molecules and their putative binding modes for a given target. Notably, incorporating a 3D network strongly enhanced the structural compatibility of the generated molecules in the target binding pocket as well as their synthetic feasibility compared to those of the ligand-based 2D generative modeling. Furthermore, a massive docking simulation for the generated molecules recapitulated the co-generated binding mode. In addition, a significant correlation was found between docking pose ranks and contact recovery rates, implying that the model could learn the underlying physics, albeit not explicitly trained. Furthermore, we also present the extensibility of our 3D generative approach to generating molecules with specific activity profiles against multiple protein receptors. Overall, our study demonstrates the importance of explicitly including 3D protein information in the molecule generation process for AI-driven drug discovery, rather than just including the 3D information to filter generated molecules, providing insight into generative modeling for multiple on/off-targets.
Author(s): Seung Gu Kang, Jeff Weber, Joseph Morrone, Leili Zhang, Tien Huynh, Wendy Cornell
Abstract: While many exemplary libraries and packages for enabling development of actionable predictive models for polymer chemistry. Their effective utilization by subject-matter experts (SMEs) during the data generation and representation tasks can be challenging. Here, we explore how domain-specific languages (DSLs) can serve as an intermediate tool to facilitate effective translation and representation of experimental data for consumption within AI and informatics pipelines. Additionally, by leveraging language assistive tools present in modern integrated development environments (IDEs), we can significantly reduce the burden of learning and using a DSL in daily research workflows for knowledge capture. Ultimately, DSLs and their use within IDEs can serve as a solution to handle knowledge representation and translation tasks required to enable development or fine-tuning of effective AI models for materials design.
Authors: Siya Kunde, Stephanie Houde, Dmitry Zubarev, Rachel Bellamy
Symposium/Session: Machine Learning and AI for Organic Chemistry
Abstract: Accelerating the discovery process is synonymous with use of Artificial intelligence (AI). We can harness the power of AI to help predict outcomes, make decisions or generate new artifacts guided by desired attributes. It is important to ensure a seamless integration of the technology with the humans in charge to remediate loss of control and skill, and to engender trust in the capabilities of the AI system to ensure optimal results. In most cases however the subject matter experts (SMEs) may be left guessing as to the capabilities of the new system they are asked to use. They may have to spend time familiarizing themselves with a new interface or learn an entirely new skill like coding to launch and use APIs. We bridge this gap through a human-centered approach to design of such AI systems. We conducted a two-part study with SMEs to understand their needs, wants and expectations in the replacement of PFAS materials and potential role of an AI assistant. First, we interviewed seven chemists using a think-out-loud protocol while attempting to find a fluorine free superacid for photolithography using tools of their choice. This was followed up with questions about what role an AI assistant could play in helping them achieve these same tasks. We gained insights into chemists’ methodology in tackling the discovery process, the types of tools currently used to achieve this, and how an AI assistant could fill gaps in current technologies and provide user-friendly interface that can help experts focus on the innovation process. Next, we organized feedback sessions with six of the same chemists to present storyboards on various design scenarios. The vignettes showcased ideas to support individual as well as collaborative contributions, while utilizing a conversational AI assistant to search, generate, visualize, manipulate and curate solutions.
Author(s): Siya Kunde, Stephanie Houde, Dmitry Zubarev, Rachel Bellamy