IBM at ACL 2025

  • | Live
  • Vienna, Austria and virtual

About

IBM is proud to sponsor the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).

We look forward to meeting you at the event and telling you more about our latest work and career opportunities at IBM Research. Our team will be presenting a series of workshops, papers and demos related to a broad range of AI topics.

Why attend

Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.

Career opportunities

Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2026 summer internships.

Agenda

  • Description:

    Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.

    Full Booth Schedule with staff and demos (by time) Booth Demos (by title)

  • Description:

    As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (\benchmark{}), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics.
    While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBs: https://ibm.github.io/POBS

    George Kour (IBM); Itay Nakash (IBM); Ateret Anaby-Tavor (IBM); Michal Shmueli-Scheuer (IBM)

  • Description:

    Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versa- tility of such evaluations make the use of LLM- based judges a compelling solution for this chal- lenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source sys- tems. We argue that this setting overlooks criti- cal factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we con- duct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is as- sessed by comparing the resulting system rank- ing to a human-based ranking. Beyond over- all judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

    Ariel Gera (IBM); Odellia Boni (IBM); Yotam Perlitz (IBM); Roy Bar-Haim (IBM); Lilach Edelstein (IBM); Asaf Yehudai (IBM)

  • Description:

    Accurate multi-modal document retrieval iscrucial for Retrieval-Augmented Generation(RAG), yet existing benchmarks do not fullycapture real-world challenges with their currentdesign. We introduce REAL-MM-RAG, an au-tomatically generated benchmark designed toaddress four key properties essential for real-world retrieval: (i) multi-modal documents, (ii)enhanced difficulty, (iii) Realistic-RAG queriesand (iv) accurate labeling. Additionally, wepropose a multi-difficulty-level scheme basedon query rephrasing to evaluate models’ seman-tic understanding beyond keyword matching.Our benchmark reveals significant model weak-nesses, particularly in handling table-heavydocuments and robustness to query rephras-ing. To mitigate these shortcomings, we cu-rate a rephrased training set and introduce anew finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models toachieve state-of-the-art retrieval performanceon REAL-MM-RAG benchmark. Our workoffers a better way to evaluate and improve re-trieval in multi-modal RAG systems while alsoproviding training data and models that addresscurrent limitations. Our benchmark is availableat this project page.

    Navve Wasserman (IBM); Roi Pony (IBM); Oshri Naparstek (IBM); Adi Raz Goldfarb (IBM); Eliyahu Schwartz (IBM); Udi Barzelay (IBM); Leonid Karlinsky (IBM)

  • Description:

    There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models are not either explicitly trained to be safe, or experience a loss in their safety abilities in the process, making them capable of generating harmful content. We observe that simple interpolation between the domain and alignment delta parameters leads to safer domain-specific models that preserve their utility. Building on this, we introduce MergeAlign, a simple, efficient, and effective model merging-based alignment method. We apply MergeAlign on Llama3 models that are experts in medicine and finance, obtaining substantial safety alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged, as well as the applicability of MergeAlign on more general code and math expert models using the Qwen-2.5 series of models. We hope our findings open new research avenues towards efficient development and deployment of safe expert LLMs.

    Megh Thakkar; Quentin Fournier; Matthew Riemer (IBM); Pin-Yu Chen (IBM); Amal Zouaq; Payel Das (IBM); Sarath Chandar

  • Description:

    Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. In this work, we present \dataname{}~(\explicitdataname{}) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an \emph{holistic} perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against \dataname{}, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. \dataname{} consists of more than \datasetsize{} prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation.

    Eliya Habba; Ofir Arviv (IBM); Itay Itzhak; Yotam Perlitz (IBM); Elron Bandel (IBM); Leshem Choshen (IBM); Michal Shmueli-Scheuer (IBM); Gabriel Stanovsky

  • Description:

    Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach.

    Qitong Wang; Mohammed Zaki; Georgios Kollias (IBM); Vasileios Kalantzis (IBM)

  • Description:

    Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on Llama-2-7B-Chat and Mistral-7B-Instruct-v0.2 demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and robust solution to various LLM platforms.

    Chen Xiong; Xiangyu Qi; Pin-Yu Chen (IBM); Tsung-yi Ho

  • Description:

    Text-to-SQL aims to translate natural language queries from users into SQL statements executable over a database, which is highly practical as it enables anyone to easily retrieve the desired information from the database. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that particularly require grounding in the various database schemas, which makes the generated SQLs less accurate sometimes. To address this problem, we propose constructing a knowledge base for text-to-SQL -- a foundational source of common knowledge -- from which we retrieve and generate the necessary knowledge for given diverse queries. Due to this, our work has a different focus from existing work that either manually annotates knowledge or generates only a few pieces of knowledge for each query. In particular, our knowledge base is comprehensive and constructed based on a combination of all the available existing questions and their associated database schemas along with their relevant knowledge via LLM prompting, and can be effectively reused for unseen databases from different datasets. We experimentally validate our approach on benchmark text-to-SQL datasets, considering both overlapping and non-overlapping database scenarios, on which it outperforms relevant baselines substantially.

    Jinheon Baek; Horst Samulowitz (IBM); Oktie Hassanzadeh (IBM); Shankar Subramaniam (IBM); Sola Shirai (IBM); Alfio Gliozzo (IBM); Debarun Bhattacharjya (IBM)

  • Description:

    Conversational agents are increasingly woven into individuals’ personal lives, yet users of-ten underestimate the privacy risks involved. The moment users share information with these agents (e.g., LLMs), their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLMs. It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LLMs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally-deployable framework that operates between users and LLMs, and identifies and reformulates out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals through different approaches to classify information relevant to the intended goals.

    Ivoline Ngong (IBM); Swanand Ravindra Kadhe (IBM); Hao Wang (IBM); Keerthiram Murugesan (IBM); Justin Weisz (IBM); Amit Dhurandhar (IBM); Karthikeyan Natesan Ramamurthy (IBM)

  • Description:

    What happens when a named entity recognition (NER) system encounters entities it has never seen before? In practical applications, models must generalize to unseen entity types where labeled training data is either unavailable or severely limited—a challenge that demands zero-shot learning capabilities. While large language models (LLMs) offer extensive parametric knowledge, they fall short in cost-effectiveness compared to specialized small encoders. Existing zero-shot methods predominantly adopt a relaxed definition of the term with potential leakage issues and rely on entity type names for generalization, overlooking the value of richer descriptions for disambiguation. In this work, we introduce ZeroNER, a description-driven framework that enhances hard zero-shot NER in low-resource settings. By leveraging general-domain annotations and entity type descriptions with LLM supervision, ZeroNER enables a BERT-based student model to successfully identify unseen entity types. Evaluated on three real-world benchmarks, ZeroNER consistently outperforms LLMs by up to 16% in F1 score, and surpasses lightweight baselines that use type names alone. Our analysis further reveals that LLMs derive significant benefits from incorporating type descriptions in the prompts.

    Alessio Cocchieri; Marcos Martínez Galindo (IBM); Giacomo Frisoni; Gianluca Moro Moro; Claudio Sartori Sartori; Giuseppe Tagliavini

Upcoming events