IBM’s software engineering agent tops the Multi-SWE-bench leaderboard for Java
iSWE-Agent for Java GitHub issue resolution secured the leaderboard’s top two spots. The first entry used a single run with a frontier model and the second used inference scaling with open models.
Much like an automotive mechanic’s talents would be better spent fixing engines than patching flat tires all day, a software engineer’s value lies in their creative problem-solving skills. Unfortunately, workdays are easily eaten up by debugging, coding, and documentation. A new agentic AI tool from IBM Research aims to offload some of the more monotonous aspects of the software engineer’s job, and two new versions are leading the pack.
IBM’s software engineering (iSWE)-Agent for Java now occupies the number one and two spots on Multi-SWE-Bench in the Java category. One entry, based on the frontier model Claude 4.5 Sonnet, is at the top of the leaderboard, and the other, based on open-source models, is runner-up. These iSWE agents resolved 33% and 31% of Java issues on the benchmarking test, respectively.
iSWE-Agent uses two inputs: a Java or Python codebase, and an issue like a bug report or feature request. With these two inputs, it generates a patch whose purpose is to resolve the issue. “Most GitHub issues are bug reports, some are feature requests, and iSWE follows the same basic workflow for either one” said Martin Hirzel, principal research scientist and manager at IBM Research.
Why Java?
While the team’s past work has focused on Python, this new version is built for Java. The reasons for that are twofold. One reason is purely practical: IBM has many Java customers, and more broadly there is a sizable Java developer community for whom an agentic AI tool would save a lot of headaches.
The other, more scientific, reason to focus on Java is that everyone else is focusing on Python to the point that existing SWE agents are overfitting to Python. “The Python leaderboard for issue resolution is kind of saturated, and there’s mounting evidence that the latest frontier models are basically contaminated by seeing this benchmark data in their training,” Hirzel said. In other words, these tools are learning to excel at benchmarking tests, but it’s not clear how this performance extends to the actual tasks they’re meant to be designed for. This trend has the unfortunate result of degrading community confidence in the validity of benchmarks for Python SWE agents.
Additionally, there’s an unmet need for better-performing Java SWE agents. Whereas the leaderboard for Python is full of agents scoring in the 70 to 80% accuracy range for issue resolution, many agents ranked on the Java leaderboards are scoring in the 20s and 30s. There’s room to grow, and it seems to be a more challenging task.
Developing iSWE agents
iSWE-Agent is IBM’s first proprietary multi-agent SWE system written entirely in-house and built on IBM-native open-source technologies (PDL) to mitigate long-term maintenance concerns. What sets iSWE-Agent apart from other leading SWE agents is its architectural innovations and specialized Java-aware tools not found in other coding agents, said IBM Research senior research engineer Jatin Ganhotra, who serves as project lead and architect for iSWE-Agent. Unlike popular SWE agents that require unfettered shell access, iSWE-Agent uses safer tools that are mostly read-only, based on IBM's open-source program analysis toolkit CodeLLM DevKit (CLDK).
To fill the gap in the SWE agent field, researchers took two approaches. In one, a proprietary iSWE-Agent was used with a frontier model, Claude 4.5 Sonnet. At its core, iSWE-Agent works through two specialized components: one that pinpoints where changes are needed in the code and why, and another that applies those edits. Instead of relying on traditional methods like shell commands or simple text replacements, iSWE-Agent takes a smarter route. It uses custom-built tools to understand code structure and apply edits, which are then checked for accuracy before they’re finalized. This approach improves reliability and allows the system to handle more complex changes across multiple files.
The results show in the numbers. iSWE-Agent built on the frontier model Claude 4.5 Sonnet achieved 33% resolved success to top the frontier model category — and the Java leaderboard.
The second experiment used open models. iSWE-OpenModels was brought up to competitive performance with a technique called inference scaling. Inference scaling can take various forms, but in essence it is a way to add more compute power at inference time. One method involves generating multiple different outputs which are then scored and ranked to improve performance. Much of the innovation for iSWE-OpenModels involved honing the model that selects which of these outputs to submit. “By carefully designing the selection component, we get very good performance,” said IBM distinguished research scientist David Kung.
Inference scaling with open models is made practical by the fact that smaller models are cheaper to host, and since they are open-source, they can be hosted locally instead of paying the API costs of a frontier-model provider. And for IBM customers concerned with privacy or constrained by industry regulations, employing open models makes it possible to perform inference scaling on-premises.
Before the team submitted the two Java iSWE entries to the Multi-SWE-Bench leaderboard, the top rank was held by one that used Gemini 2.5 Pro, which scored 28.9% for its ability to resolve Java issues. By comparison, another submission using frontier model Claude 3.7 Sonnet was scoring 23.4% accuracy. “The best submission with open-source models on the leaderboard was using DeepSeek-R1, with 22.7%,” said Kung.
To beat those numbers, Kung and colleagues used inference scaling. To do this, they ran inference on the same issue multiple times. “The reason we do that is, by repeatedly sampling from the agent, we increase the probability that one of the solutions is going to be correct,” said Kung.
Getting a good result depends on selecting the best result from the candidates produced by the agent. That’s where the team’s ‘Verifier’ component comes in. In this phase, a scorer LLM — in this case, the team built a fine-tuned version of Qwen-2.5-Coder-32B — assigns a score to each proposed patch and selects the best five.
“The final step in our inference scaling pipeline is based on LLM-as-a-judge that looks at these five candidates and will select a single one,” said Kung. In a tournament-style selection, different combinations of the candidate patches are compared, and at the end of this process, a top contender emerges. In this way, the iSWE-OpenModels agent surpassed all submissions using open models with a 31% resolve rate — achieving performance comparable to that of SWE agents powered by frontier models.
And in fact, the IBM Research models pulled ahead in both categories.
The team plans to build on their findings by working on new versions that they hope will score even higher on the benchmarks, with a goal of solving developers’ Java tasks with AI in much the same way they can today for Python. They are also planning to integrate software engineering agents with incident analysis, to proactively remedy incidents caused by software changes and bugs.
Related posts
- NewsMike Murphy
IBM Granite 4.0: Hyper-efficient, high performance hybrid models for India
Technical noteRudra Murthy, Rameswar Panda, Jaydeep Sen, and Amith SingheeIBM and ESA open-source AI models trained on a new dataset for analyzing extreme floods and wildfires
ReleaseKim MartineauAccelerating AI inference with IBM Storage Scale
Technical noteYue Zhu, Radu Stoica, Animesh Trivedi, Jonathan Terner, Frank Schmuck, Jeremy Cohn, Christof Schmitt, Anthony Hsu, Guy Margalit, Vasily Tarasov, Swaminathan Sundararaman, Talia Gershon, and Vincent Hsu
