ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning

Harsha Kokel; Michael Katz; Kavitha Srinivas; Shirin Sohrabi

AAAI 2025

Workshop paper

25 Feb 2025

ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning

Abstract

Recent introduction of ACPBench dataset provided atomic reasoning tasks required for efficient planning. The dataset aimed at devising the tasks in the easiest possible form, boolean or multiple-choice questions where the model needed to choose the right answer from the provided options. Here, we introduce the other extreme - we devise a generative version of ACPBench, open-ended questions where the model needs to generate the correct answer. Models that perform well on these tasks could in principle be integrated into a planner or used directly as a policy. We test the performance of a variety of models on our tasks and find that for most of these tasks the performance of even the largest models is still subpar. Our experiments show that with the exception of the simplest {\em progression} task, all tested models score below 60%, indicating that even the current frontier models have a long way to go before they can reliably reason about planning.

Conference paper