ASTER: Natural and multi-language unit test generation with LLMs
Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit-test generation have been developed. Despite this effort, usable tools exist for very few programming languages. That’s why our team at IBM Research decided to build the Automated Test Case Generator, or ASTER.
ASTER demonstrates that LLM prompting guided by lightweight program analysis can generate high-coverage and natural tests. ASTER implements this approach for Java and Python. Moreover, ASTER incorporates software mocking mechanisms, enabling it to handle complex scenarios involving external dependencies, such as database interactions or web service calls.
To fully understand why we believe this work to be so important, let’s go through the key challenges that developers face while writing unit tests, and explain how ASTER can help.
The challenge of unit test generation
It’s a time-consuming task. Unit tests let developers validate the functionality of the application in a granular fashion. The tests focus on the implementation of each unit, typically a method, function, or procedure. However, writing tests that are maintainable and achieve high coverage can be tedious and time-consuming, which hinders developers’ productivity and reduces the turnaround time for new or updated enterprise applications.
There’s a lack of “naturalness” in tests generated by conventional approaches. Previous studies1 have shown that developers find automatically generated tests lacking in “naturalness” characteristics, suffering poor readability and comprehensibility, covering uninteresting sequences, and containing trivial or ineffective assertions. Automatically generated tests are also known to contain anti-patterns2 and are generally not perceived as being natural in that they do not resemble the kinds of tests that developers write.
All these factors inhibit the adoption of test generation tools in practice, as developers consider the tests generated by these tools to be hard to maintain and are reluctant to add them to regression test suites without some or considerable rewrite. For instance, Figure 2 shows a couple of examples of tests generated by EvoSuite and CodaMosa. The variable names, test names, and test assertions are not meaningful, and an experienced developer would not write such test cases. In fact, our survey of internal IBM developers revealed that developers do not find such tests useful and, in most cases, would not add them to their regression test suites (Figure 1).
LLM-generated tests are often do not compile or run. With LLMs, developers can generate tests that look more natural. But those models can hallucinate, having limited access to the application being tested, and as such the generated tests very often do not compile and run. To use such tests, developers often have to fix them, which can take considerable effort — in some cases, more than the effort of writing the tests from scratch.
Lack of support for ready-to-use tests for multiple programming languages. Despite decades of research on test generation, ready-to-use test generation tools are available for a few popular programming languages, such as Java, C, and Python, and it requires tremendous effort to extend them to support more program languages.
ASTER: Static-analysis-guided pipeline
ASTER addresses all these issues through four different processes:
1. Preprocessing via static analysis: The system performs a thorough static analysis of the application being tested to extract vital contextual information. This includes identifying method signatures, call hierarchies, and potential dependencies, all of which are crucial for generating meaningful tests.
2. LLM-guided test generation: Leveraging the insights from static analysis, ASTER crafts detailed prompts for LLMs. These prompts guide the models to generate unit tests that are not only syntactically correct but also semantically rich and aligned with human coding practices.
3. Postprocessing and refinement: The generated tests undergo validation to ensure they compile and execute. ASTER iteratively refines these tests, addressing any errors through targeted prompt enhancement.
4. Coverage augmentation: To maximize test effectiveness, ASTER identifies untested code paths and instructs the LLMs to generate additional tests specifically targeting these areas, thereby improving overall code coverage.
Empirical validation and developer feedback
We evaluated ASTER with several models, including IBM’s flagship model, Granite. We tested on Granite-8B, Llama3-8B, Granite-34B, CodeLlama-34B, Llama3-70B, and GPT4-turbo. In terms of datasets, we used a diverse set of projects, starting with projects from the Defects4J dataset, which consists of Java SE applications. We also included open-source and internal applications with Java EE features. In our evaluation, we found:
Advantages of static analysis-guided LLM-based approach. LLM-based test generation guided by static analysis is very competitive with EvoSuite (the best conventional tool in term of generating high-coverage tests for Java) in coverage achieved for Java SE projects, being slightly lower in some cases (-7%) and considerably higher in other cases (4x-5x).
For Java EE projects, ASTER significantly outperforms EvoSuite (on average, exceeding it by 26.4%, 10.6%, and 18.5% in terms of line, branch, and method coverage achieved) and able to generate test cases for applications where existing approaches fail to do so.
ASTER generates Python tests with higher coverage (+9.8%, +26.5%, and +22.5%) with all the models compared to CodaMosa.
Smaller models are equally performant as their bigger counterparts. Smaller models (in our case, Granite-34b and Llama-3-8b) demonstrate competitive performance, with only 0.1%, 6.3%, and 2.7% loss in line, branch, and method coverage, compared to larger models (here, Llama-70b and GPT-4). The key benefit is affordability and addressing privacy concerns in enterprise settings, which necessitate on-premises solutions, with developers preferring models hosted internally or on local workstations.
To understand developer perspectives on comprehensibility and usability of ASTER-generated tests compared to tests generated by EvoSuite (or CodaMosa) or developers, we conducted an anonymous online survey within IBM. The survey consists of a set of background questions, followed by a series of focal methods together with two test cases for each method, and a set of questions for each focal method and its test pair. The survey received 161 responses, with participants having various roles, including software developer, QA engineer, principal solution architect, and research scientist. We found that developers prefer ASTER-generated tests over EvoSuite and CodaMosa tests in many respects, with over 70% also willing to add such tests with minor or no changes to their test buckets.
Recognition and what’s next
The paper describing ASTER has been accepted in the Software Engineering in Practice track of the 2025 International Conference on Software Engineering (ICSE), a premier venue for software-engineering research, and recognized with the Distinguished Paper Award at the conference. Future research directions include extending ASTER to other programming languages and levels of testing, creating fine-tuned models for testing to reduce the cost of LLM interactions, and exploring techniques for improving the fault-detection ability of the generated tests.
Part of this work has been incorporated into IBM watsonx Code Assistant. You can read the full paper on ASTER here.
References
- 
G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg, “Does automated unit test generation really help software testers? a controlled empirical study,” ACM Trans. Softw. Eng. Methodol., Sep. 2015 ↩ 
- 
A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hellendoorn, “Revisiting test smells in automatically generated tests: Limitations, pitfalls, and opportunities,” in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2020, pp. 523–533. ↩ 
Related posts
- Technical noteRangeet Pan, Rahul Krishna, Saurabh Sinha, Raju Pavuluri, and Maja Vukovic
- Tiny benchmarks for large language modelsNewsKim Martineau
- What is red teaming for generative AI?ExplainerKim Martineau
- An open-source toolkit for debugging AI models of all data typesTechnical noteKevin Eykholt and Taesung Lee