IBM at ICSE 2025

Name: IBM at ICSE 2025
Start: 2025-04-27T12:00:00.000Z
End: 2025-05-04T03:00:00.000Z

Apr272025—May32025

Ottawa, Ontario, Canada

This event has ended.

About

ICSE, the IEEE/ACM International Conference on Software Engineering, is the premier software engineering conference. IBM Research is excited to sponsor ICSE this year as a Platinum sponsor. We invite all attendees to visit us during the event at our booth, from Wednesday April 30th to Friday May 2nd.

We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics.

Presentation times of conference workshops, demos, papers, and tutorials can be found in the agenda section at the bottom of this page. Note: All times are displayed in your local time.

Congratulations to the IBM team winning the Distinguished Paper Award at ICSE SEIP for ASTER : Natural and Multi-language Unit Test Generation with LLMs.

Read more about ASTER in this IBM Research Blog post.

Learn more about our work in AI for Code.

Read our accepted papers at ICSE 2025

Career opportunities

Visit us at the IBM Booth to meet with IBM researchers to speak about what its like to work at IBM and future job opportunities .

Current IBM Research open roles
Join our Talent Network and let us know you attended ICSE.

Keep up with emerging research and scientific developments from IBM Research. Subscribe to the Future Forward Newsletter.

Explore all current IBM Research openings

Agenda

Description:
Abstract
Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code- and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task as suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.
Authors
Toufique Ahmed
Premkumar Devanbu
Christoph Treude
Michael Pradel
Description:
Abstract
One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an (human or AI) agent might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function’s new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new, “natural”, large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks where our dataset provides opportunities for both learning and evaluation. We find that current models (specifically Llama 3.1, 405B, Mixtral 8x22B) do find these maintenance-related tasks challenging.
Authors
Kunal Suresh Pai
Prem Devanbu
Toufique Ahmed

Upcoming events

Nov122025—Nov122025
AI Hardware Forum 2025
- 10:00 AM EST
- Yorktown Heights, NY, USA
Oct222025—Oct232025
IBM at PyTorch 2025
- 8:30 AM PDT
- San Francisco, CA, USA
Nov122025—Nov142025
IBM Quantum Developer Conference 2025
- 9:00 AM EST
- Atlanta, Georgia, USA

View all events

About

Career opportunities

Agenda

Upcoming events

AI Hardware Forum 2025

IBM at PyTorch 2025

IBM Quantum Developer Conference 2025