Berkeley Innovation Forum 2025 at IBM Research
- San Jose, CA, USA
IBM is proud to be sponsoring the 42nd International Conference on Machine Learning (ICML).
The ICML is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence known as machine learning. ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, and robotics. ICML is one of the fastest growing artificial intelligence conferences in the world. Participants at ICML span a wide range of backgrounds, from academic and industrial researchers, to entrepreneurs and engineers, to graduate students and postdocs.
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2026 summer internships.
Large language models (LLMs) and Generative AI (GenAI) are at the forefront of frontier AI research and technology. With their rapidly increasing popularity and availability, challenges and concerns about their misuse and safety risks are becoming more prominent than ever. In this talk, we introduce a unified computational framework for evaluating and improving a wide range of safety challenges in generative AI. Specifically, we will show new tools and insights to explore and mitigate the safety and robustness risks associated with state-of-the-art LLMs and GenAI models, including (i) safety risks in fine-tuning LLMs, (ii) LLM jailbreak mitigation, (iii) prompt engineering for safety debugging, and (iv) robust detection of AI-generated content.
Pin-Yu Chen (IBM)
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) List of booth demos (by title)
While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. This paper focuses on the scenario where that code patch does not yet exist. Doing so supports two major use-cases. First, it supports TDD (test-driven development), the discipline of "test first, write code later" that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces TDD-Bench-Verified, a benchmark for generating tests from issues, and Otter, an LLM-based solution for this task. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planner. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.
Toufique Ahmed (IBM); Jatin Ganhotra (IBM); Rangeet Pan (IBM); Avi Shinnar (IBM); Saurabh Sinha (IBM); Martin Hirzel (IBM)
Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical work-flows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce ST-WEBAGENTBENCH, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the Completion Under Policy (CuP) metric, which credits only completions that respect all applicable policies, and the Risk Ratio, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, ST-WebAgentBench provides an actionable first step toward deploying trustworthy web agents at scale.
Ido Levy (IBM); Ben Wiesel (IBM); Sami Marreed (IBM); Alon Oved (IBM); Avi Yaeli (IBM); Segev Shlomov (IBM)
Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.
Dmitry Krotov (IBM); Benjamin Hoover (IBM); Parikshit Ram (IBM)
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) List of booth demos (by title)
Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications.
Ziyao Wang; Muneeza Azmat (IBM); Ang Li; Raya Horesh (IBM); Mikhail Yurochkin (IBM)
Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-consuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.
Jiarui Lu; Xiaoyin Chen; Stephen Zhewen Lu; Aurelie Lozano (IBM); Vijil Vijil (IBM); Payel Das (IBM); Jian Tang
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that -- all else equal -- estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ scaling behavior, they are often similar enough that a target model's behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.
Leshem Choshen (IBM); Yang Zhang (IBM); Jacob Andreas
Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective. To enhance robustness, we introduce the concept of `invariance' into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM's hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.
Changsheng Wang; Yihua Zhang; Jinghan Jia; Parikshit Ram (IBM); Dennis Wei (IBM); Yuguang Yao; Soumyadeep Pal; Nathalie Baracaldo Angel (IBM); Sijia Liu
Automation of analog topology design is crucial due to customized requirements of modern appli- cations with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to- sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based ana- log topology generation. SFCI addresses these challenges by improving component-type recog- nition through identifier-based representations, re- ducing token length complexity to O(|V | + |E|), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our ex- periments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transfer- ability for circuits with more vertices with up to 58.5% improvement. These advancements estab- lish LaMAGIC2 as a robust framework for analog topology generation.
Chen-chia Chang; Wan-hsuan Lin; Yikang Shen; Yiran Chen; Xin Zhang (IBM)
Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention maps. ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset. ConceptAttention works for popular image models and even seamlessly generalizes to video generation. Our work contributes the first evidence that the representations of multi-modal DiTs are highly transferable to vision tasks like segmentation.
Alec Helbling; Tuna Meral; Benjamin Hoover (IBM); Pinar Yanardag; Polo Chau
Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized concept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention maps. ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset. ConceptAttention works for popular image models and even seamlessly generalizes to video generation. Our work contributes the first evidence that the representations of multi-modal DiTs are highly transferable to vision tasks like segmentation.
Alec Helbling; Tuna Meral; Benjamin Hoover (IBM); Pinar Yanardag; Polo Chau
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the cross-entropy (CE) loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the L p norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the CE objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives. The code is available via: https://tum-ai. github.io/number-token-loss/.
Jonas Zausinger; Lars Pennig; Anamarija Kozina; Sean Sdahl; Julian Sikora; Adrian Dendorfer; Timofey Kuznetsov; Momahad Hagog; Nina Wiedemann; Kacper Chlodny; Vincent Limbach; Anna Ketteler; Thorben Prein; Vishwa Mohan Singh; Michael Morris Danziger (IBM); Jannis Born (IBM)
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: \textit{Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM?} This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (\textit{i.e.,} an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.
Maohao Shen; Guangtao Zeng; Zhenting Qi; Zhang-Wei Hong; Zhenfang Chen (IBM); Wei Lu; Gregory Wornell; Subhro Das (IBM); David Cox (IBM); Chuang Gan (IBM)
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness.
Hongli Zhan; Muneeza Azmat; Raya Horesh (IBM); Jessy Li; Mikhail Yurochkin (IBM)
Time-series forecasting plays a critical role in many real-world applications. Although increasingly powerful models have been developed and achieved superior results on benchmark datasets, through a fine-grained sample-level inspection, we find that (i) no single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases. These findings prompt us to explore how to adaptively leverage the distinct strengths of various forecasting models for different samples. We introduce TimeFuse, a framework for collective time-series forecasting with sample-level adaptive fusion of heterogeneous models. TimeFuse utilizes meta-features to characterize input time series and trains a learnable fusor to predict optimal model fusion weights for any given input. The fusor can leverage samples from diverse datasets for joint training, allowing it to adapt to a wide variety of temporal patterns and thus generalize to new inputs, even from unseen datasets. Extensive experiments demonstrate the effectiveness of TimeFuse in various long-/short-term forecasting tasks, achieving near-universal improvement over the state-of-the-art individual models. Code is available at https://github.com/ZhiningLiu1998/TimeFuse.
Zhining Liu; Ze Yang; Xiao Lin; Ruihong Qiu; Tianxin Wei; Yada Zhu (IBM); Hendrik Hamann (IBM); Jingrui He; Hanghang Tong
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) List of booth demos (by title)
**[Red Hat] **The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the "importance" of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance.To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.
Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL, a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, where reward models iteratively optimize circuit topologies by evaluating and adjusting designs to ensure adherence to constraints like validity, efficiency, and expected output voltage. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.
Prashanth Vijayaraghavan (IBM); Luyao Shi (IBM); Ehsan Degan (IBM); Vandana Mukherjee (IBM); Xin Zhang (IBM)
We introduce a novel approach for discovering effective degrees of freedom (DOF) in molecular dynamics simulations by mapping the DOF to approximate symmetries of the energy landscape. Unlike most existing methods, we do not require data and rely on knowledge of the forcefield (energy function) and the initial state. We present a scalable symmetry loss function compatible with existing force-field frameworks and a Hessian-based method efficient for smaller systems. Our approach enables systematic exploration of conformational space by connecting structural dynamics to energy landscape symmetries. We apply our method to two systems, Alanine dipeptide and Chignolin, recovering their known important conformations. Our approach can prove useful for efficient exploration in molecular simulations with potential applications in protein folding and drug discovery.
Jeet Mohapatra; Nima Dehmamy (IBM); Csaba Both; Subhro Das (IBM); Tommi Jaakkola
[Red Hat] One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment.While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.
**[Red Hat] **Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key \& Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys \& Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under 1% relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.
Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 80% of the throughput of serving a single LoRA.
Rickard Gabrielsson; Jiacheng Zhu; Onkar Bhardwaj (IBM); Leshem Choshen (IBM); Kristjan Greenewald (IBM); Mikhail Yurochkin (IBM); Justin Solomon
This position paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call literal theory of mind: the ability to predict the behavior of others. Measuring this kind of reasoning is very informative in testing the ability of agents with self-consistent reasoning. However, it is important to note the distinction between this and what we actually care about when this self-consistency cannot be taken for granted. We call this functional theory of mind: the ability to adapt to agents in-context following a rational response to predictions about their behavior. We find that top performing open source LLMs may display strong capabilities in literal theory of mind, depending on how they are prompted, but seem to struggle with functional theory of mind -- even when partner policies are exceedingly simple. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.
Matthew Riemer (IBM); Zahra Ashktorab (IBM); Djallel Bouneffouf (IBM); Payel Das (IBM); Miao Liu (IBM); Justin Weisz (IBM); Murray Campbell (IBM)
Recent data-efficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows.
Michael Sun (IBM); Weize Yuan; Gang Liu; Wojciech Matusik; Jie Chen (IBM)
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) List of booth demos (by title)
While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. This paper focuses on the scenario where that code patch does not yet exist. Doing so supports two major use-cases. First, it supports TDD (test-driven development), the discipline of "test first, write code later" that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces TDD-Bench-Verified, a benchmark for generating tests from issues, and Otter, an LLM-based solution for this task. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planner. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.
Toufique Ahmed (IBM); Jatin Ganhotra (IBM); Rangeet Pan (IBM); Avi Shinnar (IBM); Saurabh Sinha (IBM); Martin Hirzel (IBM)
In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level \textit{concepts} (e.g., \texttt{stripes}, \texttt{black}) and then predict a task label from those concepts. In particular, we study the impact of \textit{concept interventions} (i.e.,~operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term \textit{leakage poisoning}, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce \mbox{MixCEM}, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.
Mateo Espinosa Zarlenga; Gabriele Dominici; Pietro Barbiero (IBM); Zohreh Shams; Mateja Jamnik
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.
Saurabh Jha (IBM); Rohan Arora (IBM); Yuji Watanabe (IBM); Takumi Yanagawa (IBM); Yinfang Chen; Jackson Clark; Bhavya Bhavya (IBM); Mudit Verma (IBM); Harshit Kumar (IBM); Hirokuni Kitahara (IBM); Noah Zheutlin (IBM); Saki Takano (IBM); Divya Pathak (IBM); Felix George (IBM); Xinbo Wu; Bekir Turkkan (IBM); Gerard Vanloo (IBM); Michael Nidd (IBM); Ting Dai (IBM); Oishik Chatterjee (IBM); Pranjal Gupta (IBM); Suranjana Samanta (IBM); Pooja Aggarwal (IBM); Rong Lee (IBM); Pavankumar Murali (IBM); Jae-wook Ahn (IBM); Debanjana Kar (IBM); Ameet Rahane (IBM); Carlos A. Fonseca (IBM); Amit Paradkar (IBM); Yu Deng (IBM); Pratibha Moogi (IBM); Prateeti Mohapatra (IBM); Naoki Abe (IBM); Chandra Narayanaswami (IBM); Tianyin Xu; Lav Varshney; Ruchi Mahindru (IBM); Anca Sailer (IBM); Larisa Shwartz (IBM); Daby Sow (IBM); Nicholas Fuller (IBM); Ruchir Puri (IBM)
**[Red Hat] **Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a 150% speedup over the baselines in end-to-end training time for training Wasserstein GAN on 12+ GPUs.
Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data.
Michael Sun (IBM); Orion Foo; Gang Liu; Wojciech Matusik; Jie Chen (IBM)
Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.
Yu Wang; Dmitry Krotov (IBM); Yuanzhe Hu; Yifan Gao; Wangchunshu Zhou; Julian Mcauley; Dan Gutfreund (IBM); Rogerio Feris (IBM); Zexue He (IBM)
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.
Saurabh Jha (IBM); Rohan Arora (IBM); Yuji Watanabe (IBM); Takumi Yanagawa (IBM); Yinfang Chen; Jackson Clark; Bhavya Bhavya (IBM); Mudit Verma (IBM); Harshit Kumar (IBM); Hirokuni Kitahara (IBM); Noah Zheutlin (IBM); Saki Takano (IBM); Divya Pathak (IBM); Felix George (IBM); Xinbo Wu; Bekir Turkkan (IBM); Gerard Vanloo (IBM); Michael Nidd (IBM); Ting Dai (IBM); Oishik Chatterjee (IBM); Pranjal Gupta (IBM); Suranjana Samanta (IBM); Pooja Aggarwal (IBM); Rong Lee (IBM); Pavankumar Murali (IBM); Jae-wook Ahn (IBM); Debanjana Kar (IBM); Ameet Rahane (IBM); Carlos A. Fonseca (IBM); Amit Paradkar (IBM); Yu Deng (IBM); Pratibha Moogi (IBM); Prateeti Mohapatra (IBM); Naoki Abe (IBM); Chandra Narayanaswami (IBM); Tianyin Xu; Lav Varshney; Ruchi Mahindru (IBM); Anca Sailer (IBM); Larisa Shwartz (IBM); Daby Sow (IBM); Nicholas Fuller (IBM); Ruchir Puri (IBM)
Bridging the gap between algorithmic precision and human-like risk nuance is essential for crafting multi-agent systems that learn adaptable and strategically intuitive behaviors. We introduce CPT-MADDPG, an extension of the Multi-Agent Deep Deterministic Policy Gradient algorithm, embedding Cumulative Prospect Theory (CPT) value and probability weight transforms into both actor and critic updates. By replacing expected return maximization with rank-dependent Choquet integrals over gains and losses, CPT-MADDPG endows agents with tunable risk profiles —ranging from exploratory, risk-seeking to conservative, loss-averse behaviors—without human intervention. Across competitive pursuit (Simple Tag), cooperative coverage (Simple Spread), and strategic bidding (first-price auctions), we show that risk-seeking parameterized CPT speeds early learning, extreme risk-averse parameterized CPT enforces prudence at a performance cost, transparent utility sharing preserves coordination under heterogeneity, and naive dynamic adaptation destabilizes convergence. In auction settings, learned CPT policies replicate documented overbidding phenomena, with short-term gains followed by long-term losses. Our work demonstrates a principled framework for integrating human-like risk attitudes toward strategic multi-agent deployment.
Sheyan Lalmohammed; Khush Gupta; Alok Shah; Keshav Ramji (IBM)
Prompt engineering for LLMs remains complex, with existing frameworks either hiding complexity behind restrictive APIs or providing inflexible canned patterns that resist customization -- making sophisticated agentic programming challenging. We present the Prompt Declaration Language (PDL), a novel approach to prompt representation that tackles this fundamental complexity by bringing prompts to the forefront, enabling manual and automatic prompt tuning while capturing the composition of LLM calls together with rule-based code and external tools. By abstracting away the plumbing for such compositions, PDL aims at improving programmer productivity while providing a declarative representation that is amenable to optimization. This paper demonstrates PDL's utility through a real-world case study of a compliance agent. Tuning the prompting pattern of this agent yielded up to 4x performance improvement compared to using a canned agent and prompt pattern.
Mandana Vaziri (IBM); Louis Mandel (IBM); Yuji Watanabe (IBM); Hirokuni Kitahara (IBM); Martin Hirzel (IBM); Anca Sailer (IBM)
Recent advances in large reasoning models (LRMs) have enabled strong multi-step reasoning, but existing unlearning methods, designed for standard LLMs, fail to address the unique challenges of LRMs. We present the first systematic study of LRM unlearning and show that conventional methods often leave reasoning traces intact, despite removing final answers. To overcome this, we propose Reasoning-aware Representation Misdirection for Unlearning (MU), which suppresses sensitive reasoning traces while preserving general reasoning ability. Experiments show that (MU) significantly reduces reasoning leakage and performs well on both reasoning and safety benchmarks, offering the first principled solution for mitigating reasoning trace leakage in LRM unlearning.
Changsheng Wang; Chongyu Fan; Yihua Zhang; Jinghan Jia; Dennis Wei (IBM); Parikshit Ram (IBM); Nathalie Baracaldo Angel (IBM); Sijia Liu
Large language models (LLMs) have rapidly evolved into powerful engines capable of driving agentic workflows, i.e., autonomous sequences of actions traditionally performed by humans (e.g., booking flights, preparing administrative forms) based on textual and/or visual inputs. Embracing collaborative and federated learning is essential in this context, as these paradigms enable the aggregation of distributed data while preserving user privacy and ensuring regulatory compliance. By keeping data localized, federated approaches allow agentic workflows to continuously learn and adapt from diverse user interactions without exposing sensitive information. This distributed learning framework not only facilitates scalable and personalized improvements but also mitigates biases by incorporating insights from a broad range of environments, ultimately amplifying the transformative potential of agentic workflows for both industry and everyday applications.
Recent commercial deployments, such as OpenAI Operator, highlight the significant impact of agentic workflows on the global economy and daily life. However, these workflows currently face several challenges including imprecise execution (e.g., incorrectly interacting with UI elements), suboptimal tool-use efficiency (e.g., latency in processing), and limitations in adaptive user-agent interactions (e.g., ineffective co-piloting and supervision). Additionally, while agentic workflows generate valuable data from user interactions, the sensitive and localized nature of this data creates hurdles for centralized learning approaches.
Collaborative and federated learning are powerful methodologies to overcome these challenges. They facilitate collective improvement by enabling continuous workflow optimization through the distributed updates of the model and prompts without having to share the raw data. These methods also support personalization by tailoring agentic responses to individual user styles and preferences without compromising privacy. Importantly, they maintain strict regulatory compliance by ensuring that sensitive data remains local, which a critical requirement under emerging legislative frameworks such as the EU AI Act and Canada Bill C-27.
This workshop uniquely focuses on the convergence of collaborative/federated learning with agentic workflows, fostering interdisciplinary research that bridges theoretical foundations, practical implementations, and regulatory considerations.
Alexander Erben; Gauri Joshi; Nicholas Lane; Huan Sun; Shiqiang Wang (IBM); Herbert Woisetschläger
In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known. Here we find that the choice of initialization of a learnable PE greatly influences its ability to learn interpretable PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions in multiple dimensions, and 2) lead to improved generalization. These results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.
Taku Ito (IBM); Luca Cocchi; Tim Klinger (IBM); Parikshit Ram (IBM); Murray Campbell (IBM); Luke Hearne
While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to captures long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention over less than 100B tokens, achieving strong performance on a suite of synthetic tasks. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding, operating as content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into their behavior towards length generalization.
Alok Shah; Khush Gupta; Keshav Ramji (IBM); Pratik Chaudhari