“It’s easier to learn a language by trial and error, instead of just by repeating correct sentences,” she said. In the domain of chemistry, as with linguistic errors, these negatives aren’t random — failure is informative because each attempt is based on background knowledge and an educated hypothesis. Both classes of negative reaction give feedback on how to build an eventual successful experiment.
Graziani and her colleagues built on IBM’s pioneering work in applying transformer-based language models to chemical language processing, training their own model with chemical reactions extracted from United States Patent and Trademark Office (USPTO) patents. Its language-modeling core uses the very transformer backbone that has since been scaled up to power state-of-the-art large language models, including IBM’s Granite series. They fine-tuned the model on two chemistry datasets — one with more than 500 well-characterized electrophilic aromatic substitution reactions but no negative data, and one with real-world results including negative data. Negative data could be easily generated for the first set, though, because there are only a few possible unexpected yields for that reaction type.
“Since long before today’s surge in large language models, at IBM we have been pioneers in the use of language models (transformer architectures) for scientific applications,” IBM Research scientist Teodoro Laino added. “Our 2019 paper, one of the first to apply language models in science, became the springboard for the reinforcement learning techniques we now use to extract insights even from unsuccessful experiments.”2
Model fine-tuning was meticulous for the research group, who crafted reward functions to support the use of reinforcement learning from human feedback (or RLHF). This approach is common in machine vision and natural language processing tasks, but not in chemistry. The key was building the reward function in a way that made sense. The friction in working with many negative samples and very few positives is that you're working with a very sparse space from which to get rewards — a hard task for reinforcement learning.
In typical RLHF cases, there is plenty of positive data to teach a model how to identify and predict desirable or undesirable outcomes. But again, not so in the chemistry lab, which has very few breakthroughs amid all the misses. Success came down to the use of a vectorial representations of chemical reactions in a latent space, which was originally optimized for token prediction, and that was then further tuned to discriminate the positives among this large number of negative outcomes.
Re-encoding the latent space to embed the successful reactions closer to each other made it possible to classify positives against negatives. In this new representational space, that optimal margin turns the task into simple boundary separation between successful and unsuccessful reactions.
“Eventually this was the trick,” Graziani said. So during training time, with a generative model predicting new reactions unseen in the data, the fine-tuned model could distinguish where it should be embedded and could inform the guess of whether the predicted reaction could have worked.
Compared to a model trained on USPTO data but not fine-tuned on negative data, the fine-tuned experimental model performed over 10% better at predicting successful reactions. In the test dataset, all possible positives have been described, so it was clear when the model predicted a successful reaction.
One of the challenges with this line of research, said Graziani, is not in persuading chemists of the value of negative reaction data (in fact, numerous perspective articles already champion its importance) but in the scarcity of venues willing to publish it. Aside from a few publicly curated datasets, today’s publishing ecosystem doesn’t offer venues for reporting failed experiments, so models are still starved of the data they need.
This problem is rooted in the incentives of academic publishing and career progression, which prize monumental individual contributions over methodical team efforts. The 2015 reproducibility crisis in the social sciences cast new light on the importance of replications in the scientific literature, but it’s still rare to see a null result splashed across the front page of an academic journal. This new model shines in this landscape.
“In experimental case where you might have an abundance of negatives but very few positive samples,” said Graziani, “Our approach is winning in the sense that it unlocks the learning mechanisms that would otherwise remain stuck in simplified tuning.”
Negative Chemical Data Boosts Language Models in Reaction Outcome Prediction, Science Advances, 2025 ↩
Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction, ACS Central Science, 2019 ↩