Djallel Bouneffouf, Matthew Riemer, et al.
NeurIPS 2025
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Djallel Bouneffouf, Matthew Riemer, et al.
NeurIPS 2025
Jannis Born, Filip Skogh, et al.
NeurIPS 2025
Tian Gao, Amit Dhurandhar, et al.
NeurIPS 2025
Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025