Reports on the 2015 AAAI Workshop Series
Stefano V. Albrecht, J. Christopher Beck, et al.
AAAI 2015
Recent introduction of ACPBench dataset provided atomic reasoning tasks required for efficient planning. The dataset aimed at devising the tasks in the easiest possible form, boolean or multiple-choice questions where the model needed to choose the right answer from the provided options. Here, we introduce the other extreme - we devise a generative version of ACPBench, open-ended questions where the model needs to generate the correct answer. Models that perform well on these tasks could in principle be integrated into a planner or used directly as a policy. We test the performance of a variety of models on our tasks and find that for most of these tasks the performance of even the largest models is still subpar. Our experiments show that with the exception of the simplest {\em progression} task, all tested models score below 60%, indicating that even the current frontier models have a long way to go before they can reliably reason about planning.
Stefano V. Albrecht, J. Christopher Beck, et al.
AAAI 2015
Daniel Fišer, Daniel Gnad, et al.
IJCAI 2021
Carlos Hernández Ulloa, Adi Botea, et al.
IJCAI 2017
Chih-kai Ting, Karl Munson, et al.
AAAI 2023