GEM2 Workshop: Generation, Evaluation &amp; Metrics

Ofir Arviv; Miruna Clinciu; Kaustubh Dhole; Rotem Dror; Sebastian, Gehrmann; Eliya Habba; Itay Itzhak; Yotam Perlitz; Simon, Mille; Enrico, Santus; Michal Shmueli-Scheuer; João, Sedoc; Gabriel Stanovsky; Oyvind, Tafjord

ACL 2025

Workshop

27 Jul 2025

GEM2 Workshop: Generation, Evaluation & Metrics

Abstract

Evaluating large language models (LLMs) is challenging. Running LLMs over medium or large scale corpus can be prohibitively expensive; they are consistently shown to be highly sensitive to prompt phrasing, and it is hard to formulate metrics which differentiate and rank different LLMs in a meaningful way. Consequently, the validity of the results obtained over popular benchmarks such as HELM or MMLU, lead to brittle conclusions (Sclar er al., 2024, Mizrahi et al., 2024, Alzahrani et al., 2024). We believe that meaningful, efficient, and robust evaluation is one of the cornerstones of the scientific method, and that achieving it should be a community-wide goal.

In this workshop we seek innovative research relating to the evaluation of LLMs and language generation systems in general. This includes, but is not limited to, robust, reproducible and efficient evaluation metrics, as well as new approaches for collecting evaluation data which can help in better differentiating between different systems and understanding their current bottlenecks.

To facilitate and spur research in this field we publish two large datasets of model predictions together with prompts and gold standard references: DOVE and DataDecide. These datasets go beyond reporting just the accuracy of a model on a given sample, and also include various axes which identify how the prompt was created and which were found to affect performance (instruction template, few-shot examples, their order, delimiters, etc.), as well as any known information about the model (pre training corpora, type of instruction-tuning, different checkpoints, and more), and the annotated gold label. Through this dataset, researchers will be able to investigate key questions such as: Are larger models more robust across different prompting configurations? Are common enumerators (e.g., A/B, 1/2) less sensitive compared to rare ones (e.g., I/IV, #/$)? Which evaluation axes should be prioritized when testing with limited resources? Can we identify patterns distinguishing examples where models show high robustness (consistent answers across configurations) versus low robustness (varying answers)?

Workshop paper