Berkeley Innovation Forum 2025 at IBM Research
- San Jose, CA, USA
ICSE, the IEEE/ACM International Conference on Software Engineering, is the premier software engineering conference. IBM Research is excited to sponsor ICSE this year as a Platinum sponsor.
We invite all attendees to visit us during the event at our booth, from Wednesday April 30th to Friday May 2nd.
We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics.
Presentation times of conference workshops, demos, papers, and tutorials can be found in the agenda section at the bottom of this page. Note: All times are displayed in your local time.
Congratulations to the IBM team winning the Distinguished Paper Award at ICSE SEIP for ASTER: Natural and Multi-language Unit Test Generation with LLMs.
Learn more about our work in AI for Code.
Visit us at the IBM Booth to meet with IBM researchers to speak about what its like to work at IBM and future job opportunities .
Abstract
Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code- and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task as suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.
Authors
Abstract
One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an (human or AI) agent might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function’s new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new, “natural”, large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks where our dataset provides opportunities for both learning and evaluation. We find that current models (specifically Llama 3.1, 405B, Mixtral 8x22B) do find these maintenance-related tasks challenging.
Authors
Abstract
Examples in web API specifications can be essential for API testing, API understanding, and even building chat-bots for APIs. Unfortunately, most API specifications lack human written examples. This paper introduces a novel technique for generating examples for web API specifications. We start from in-context learning (ICL): given an API parameter, use a prompt context containing a few examples from other similar API parameters to call a model to generate new examples. However, while ICL tends to generate correct examples, those lack diversity, which is also important for most downstream tasks. Therefore, we extend the technique to iterated-calls ICL (ICICL): use a few different prompt contexts, each containing a few examples, to iteratively call the model with each context. Our intrinsic evaluation demonstrates that ICICL improves both correctness and diversity of generated examples. More importantly, our extrinsic evaluation demonstrates that those generated examples significantly improve the performance of downstream tasks of testing, understanding, and chat-bots for APIs.
Authors
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Demos to be shown at our booth:
Abstract
Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. In this case, the confidence measure can serve as a basis for rational graduated decision making on how much review and care is needed. Calibration has so far been studied in mostly non-generative (e.g., classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Given generated code developers must decide whether to directly use, use after varying intensity of careful review, or discard model-generated code; thus calibration is vital in generative settings. In this paper we make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. We then show how calibration can be improved, using standard methods such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in Software Engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.
Authors
Abstract
There are many organizations, especially in domains such as banking, insurance, airline that are looking for tools to identify and extract business rules from legacy mainframe code. Existing works have considered execution paths for a single business variable as the granularity of business rules which limits the identification of complex rules. In our work, we address this limitation and provide a tool called A-COBREX, which implements a novel technique to identify business rules involving multiple business variables from the source code. We have evaluated the same on 27 programs with ground truth annotations. It has a recall of 74.12% and precision of 62.21% for fuzzy match between ground truth and extracted rules. The screencast is available at https://youtu.be/adriX4q41PA, and the tool at https://github.com/SaravananKrishnan/BRE.
Authors
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Demos to be shown at our booth:
Abstract
As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as OpenAPI ones has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents—API, dependency, parameter, and value—collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents’ behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.
Authors
Abstract
Large Language Models for Code (or code LLMs) are increasingly gaining popularity and capabilities, offering a wide array of functionalities such as code completion, code generation, code summarization, test generation, code repair, refactoring, translation, and more. To leverage code LLMs to their full potential, developers must provide code-specific contextual information to the models. These are typically derived and distilled using program analysis tools. However, there exists a significant gap—these static analysis tools are often language-specific and come with a steep learning curve, making their effective use challenging. These tools are tailored to specific program languages, requiring developers to learn and manage multiple tools to cover various aspects of their code base. Moreover, the complexity of configuring and integrating these tools into the existing development environments adds an additional layer of difficulty. This challenge limits the potential benefits that could be gained from the more widespread and effective use of static analysis in conjunction with code LLMs.
In this technical briefing, we present Codellm-Devkit (CLDK)—an open-source library that significantly simplifies the process of performing program analysis at various levels of granularity. As a Python-based library, CLDK offers developers an intuitive and user-friendly interface, making it incredibly easy to provide rich program analysis context to code LLMs. With this library, developers can effortlessly integrate detailed, code-specific insights that enhance the operational efficiency and effectiveness of LLMs in coding tasks. This hands-on session will enable participants to perform static analysis to build LLM-based solutions for coding tasks such as: (1) code generation, (2) code summarization, and (3) test generation across different programming languages. Through practical exercises, developers will gain hands-on experience in enhancing the functionality and applicability of code LLMs using CLDK’s APIs.
Presenters
Abstract
As REST APIs have become widespread in modern web services, comprehensive testing of these APIs has become increasingly crucial. Due to the vast search space consisting of operations, parameters, and parameter values along with their complex dependencies and constraints, current testing tools suffer from low code coverage, leading to suboptimal fault detection. To address this limitation, we present a novel tool, AutoRestTest, which integrates the Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation-dependent parameters using the SODG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. AutoRestTest provides a command-line interface and continuous telemetry on successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised. In this paper, we introduce our tool and present preliminary results.
Authors
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Demos to be shown at our booth:
Abstract
Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness---evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.
Authors