IBM Research Brazil Forum 2025
- Rio de Janeiro, Brazil
IBM is proud to sponsor the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) in Toronto, Canada. We invite all attendees to visit us during the event at our booth in the Exhibition Center of at the the Westin Harbour Castle.
We look forward to meeting you at the event and telling you more about our latest work and career opportunities at IBM Research. Our team will be presenting a series of workshops, papers and demos related to a broad range of AI topics.
Read our accepted papers at ACL 2023.
For presentation times of workshops, demos, papers, and tutorials see the agenda section below. Note: All times are displayed in your local time.
View the booth demo & staff schedule.
Keep up with emerging research and scientific developments from IBM Research. Subscribe to the Future Forward Newsletter.
We look forward to meeting and seeing you in Toronto!
Visit us in the Harbor Ballroom to meet with IBM Researchers and recruiting to speak about future job opportunities or 2024 summer internships.
Featured positions to learn more about at ACL:
ACL Attendees - To further engage, let us know you attended the conference and want to be considered for future Research opportunities here: Submit your information to IBM Research
Sign up to be notified of future openings by joining our Talent Network.
Visit us in the Expo center from 9am - 5pm to talk to researchers, recruiters, and interact with live demos.
Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by ``translating'' the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.
Authors: Guy Dar; Mor Geva; Ankit Gupta (IBM); Jonathan Berant
Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.
Authors: Zexue He; Graeme Blackwood (IBM); Rameswar Panda (IBM); Julian Mcauley; Rogerio Feris (IBM)
Zero-shot learning (ZSL) focuses on annotating texts with entities or relations that have never been seen before during training. This task has a lot of applications in practice due to the lacking labeled data in real-world situations within specific domains. Recent advances in machine learning with large pretrained language models demonstrate significant results in zero-shot learning with numerous novel methods. It is very high demand both in the industry and the research community to have a frame work where people with different backgrounds can easily access the latest ZSL methods or pretrained models. In this work, we create a new ZSL framework called Zshot. The main goal of our work is to provide researchers with a frame work where they can quickly benchmark and compare different state-of-the-art ZRL methods with standard benchmark datasets included in the framework. Moreover, it is designed to support the industry with ready APIs for production under the standard Spacy NLP pipeline. Our API is extendible and evaluable, moreover, we include numerous enhancements such as automatic description generation, boosting the accuracy with pipeline ensembling, and visualization utilities available as a SpaCy extension.
Authors: Gabriele Picco (IBM); Marcos Martínez Galindo (IBM); Alberto Purpura (IBM); Leopold Fuchs (IBM); Vanessa Lopez (IBM); Lam Thanh Hoang (IBM)
To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.
Authors: Lining Zhang; Simon Mille; Yufang Hou (IBM); Daniel Deutsch; Elizabeth Clark; Yixin Liu; Saad Mahamood; Sebastian Gehrmann; Miruna Clinciu; Khyathi Chandu; João Sedoc
Pretraining has been shown to scale well with compute, data size and data diversity. Combining all, multitask mixture of supervised datasets produces improved performance compared to self-supervised pretraining. Until now, massively multitask learning required simultaneous access to all datasets in the mixture and heavy compute resources that are only available to well-resourced teams.
In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models and pretrained models keep improving each other. We show that ColD Fusion yields comparable benefits to multitask pretraining by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, training and testing on 35 datasets the ColD Fusion outperforms RoBERTa by points in average without any changes to the architecture.
Authors: Shachar Don-Yehiya (IBM); Elad Venezian (IBM); Colin Raffel; Noam Slonim (IBM); Yoav Katz (IBM); Leshem Choshen (IBM)
Recent work in natural language processing (NLP) has yielded appealing results from scaling; however, using only scale to improve performance means that resource consumption also scales. Resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
Authors: Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro Martins, André Martins, Peter Milder, Colin Raffel, Jessica Forde, Emma Strubell, Edwin Simpson, Noam Slonim, Jesse Dodge, Iryna Gurevych, Niranjan Balasubramanian, Leon Derczynski and Roy Schwartz
Open Information Extraction (OpenIE) has been used in the pipelines of various NLP tasks. Unfortunately, there is no clear consensus on which models to use in which tasks. Muddying things further is the lack of comparisons that take differing training sets into account. In this paper, we present an application-focused empirical survey of neural OpenIE models, training sets, and benchmarks in an effort to help users choose the most suitable OpenIE systems for their applications. We find that the different assumptions made by different models and datasets have a statistically significant effect on performance, making it important to choose the most appropriate model for one's applications. We demonstrate the applicability of our recommendations on a downstream Complex QA application.
Authors: Kevin Pei; Ishan Jindal (IBM); Kevin Chang; Zhai, Chengxiang; Yunyao Li
Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.
Authors: Ella Neeman; Roee Aharoni; Or Honnovich; Leshem Choshen (IBM); Idan Szpektor; Omri Abend
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licences at \url{https://ai4bharat.iitm.ac.in/naamapadam}
Authors: Arnav Mhaske; Harshit Kedia; Sumanth Doddapaneni; Mitesh M. Khapra; Pratyush Kumar; Rudra Murthy Venkataramana (IBM); Anoop Kunchukuttan
In the deployment of real-world text clas- sification models, label scarcity is a com- mon problem. As the number of classes increases, this problem becomes even more complex. One way to address this problem is by applying text augmentation methods.
Authors: Adir Rahamim; Guy Uziel (IBM); Esther Goldbraich (IBM); Ateret Anaby-Tavor (IBM)
Text classification datasets from specialised or technical domains are in high demand, especially in industrial applications. However, due to the high cost of annotation such datasets are usually expensive to create. While Active Learning (AL) can reduce the labeling cost, required AL strategies are often only tested on general knowledge domains and tend to use information sources that are not consistent across tasks. We propose Reinforced Active Learning (RAL) to train a Reinforcement Learning policy that utilizes many different aspects of the data and the task in order to select the most informative unlabeled subset dynamically over the course of the AL procedure. We demonstrate the superior performance of the proposed RAL framework compared to strong AL baselines across four intricate multi-class, multi-label text classification datasets taken from specialised domains. In addition, we experiment with a unique data augmentation approach to further reduce the number of samples RAL needs to annotate.
Authors: Lukas Wertz; Jasmina Bogojeska; Katya Mirylenka (IBM); Jonas Kuhn
We propose a method to control the attributes of large language models (LLMs) for the text generation task using Causal ATE scores and counterfactual augmentation. We explore this method in the context of LLM detoxification and propose the Causally Fair Language (CFL) architecture for detoxifying existing pre-trained LLMs in a plug-and-play manner. Our architecture is based on a Structural Causal Model (SCM) that achieves significantly faster training time than many existing detoxification techniques. Further, we achieve state of the art performance in several evaluation metrics using Real Toxicity Prompts. Our experiments show that CFL achieves such a detoxification without much impact on the model perplexity. Using the LM Loss over the BOLD dataset, we show that CFL mitigates the unintended bias of other detoxification techniques
Authors: Rahul Madhavan; Rishabh Garg (IBM); Kahini Wadhawan (IBM); Sameep Mehta (IBM)
Over the past few years, zero-shot prompt-based learning has become a de facto standard in many NLP tasks where training data is unavailable. Particularly for sentiment analysis, much effort has been put into designing high-performing prompt templates. However, two problems exist; First, a large pre-trained language model is often biased to its training data, leading to poor performance in prompt templates the LM has rarely seen. This problem cannot be resolved by scaling. Second, when it comes to various domains, such as the financial and food domain, re-designing prompt templates by human experts for domain adaptation is required, which is time-consuming and inefficient. To remedy both shortcomings, we propose a simple yet strong data construction method to de-bias prompt templates, yielding a large improvement across different domains, pre-trained language models, and prompt templates. Also, we demonstrate the advantage of using our domain-agnostic data over in-domain ground-truth data.
Authors: Yang Zhao (IBM); Tetsuya Nasukawa (IBM); Masayasu Muraoka (IBM); Bhatta Bhattacharjee (IBM)
Visit us in the Expo center from 9am - 5pm to talk to researchers, recruiters, and interact with live demos.
Large pre-trained language models based on the transformer architecture drastically changed the natural language processing (NLP) landscape. However, deploying those models for on-device applications in constrained devices such as smart watches is completely impractical due to their size and inference cost. As an alternative to transformer-based architectures, recent work on efficient NLP has shown that weight-efficient models can reach competitive performance for simple tasks, such as slot filling and intent classification, with model sizes in the order of the megabyte. This work introduces the pNLP-Mixer architecture, an embedding-free MLP-Mixer model for on-device NLP that achieves high weight-efficiency thanks to a novel projection layer. We evaluate a pNLP-Mixer model of only two megabytes in size on two multi-lingual semantic parsing datasets, MTOP and multiATIS. On MTOP, our quantized model achieves 99.2% the performance of mBERT, while using 85x less parameters. Our model consistently beats the state-of-the-art of tiny models (pQRNN) of the exact same size by a margin of more than 5%.
Authors: Francesco Fusco (IBM); Damian Pascual; Peter Staar (IBM); Diego Antognini (IBM)
Answering natural language questions using information from tables (TableQA) is of considerable recent interest. In many applications, tables occur not in isolation, but embedded in, or linked to unstructured text. Often, a question is best answered by matching its parts to either table cell contents or unstructured text spans, and extracting answers from either source. This leads to a new space of TextTableQA problems that was introduced by the HybridQA dataset. Existing adaptations of table representation to transformer-based reading comprehension (RC) architectures fail to tackle the diverse modalities of the two representations through a single system. Training such systems is further challenged by the need for distant supervision. To reduce cognitive burden, training instances usually include just the question and answer, the latter matching multiple table rows and text passages. This leads to a noisy multi-instance training regime involving not only rows of the table, but also spans of linked text. We respond to these challenges by proposing MITQA, a new TextTableQA system that explicitly models the different but closely-related probability spaces of table row selection and text span selection. Our experiments indicate the superiority of our approach compared to recent baselines. The proposed method is currently at the top of the HybridQA leaderboard with a held out test set, achieving 21 % absolute improvement on both EM and F1 scores over previous published results.
Authors: Vishwajeet Kumar (IBM); Yash Gupta; Saneem Chemmengath (IBM); Jaydeep Sen (IBM); Soumen Chakrabarti; Samarth Bharadwaj (IBM); Feifei Pan (IBM)
Text-based reinforcement learning agents have predominantly been neural network-based models with embeddings-based representation, learning uninterpretable policies that often do not generalize well to unseen games. On the other hand, neuro-symbolic methods, specifically those that leverage an intermediate formal representation, are gaining significant attention in language understanding tasks. This is because of their advantages ranging from inherent interpretability, the lesser requirement of training data, and being generalizable in scenarios with unseen data. Therefore, in this paper, we propose a modular, NEuro-Symbolic Textual Agent (NESTA) that combines a generic semantic parser with a rule induction system to learn abstract interpretable rules as policies. Our experiments on established text-based game benchmarks show that the proposed NESTA method outperforms deep reinforcement learning-based techniques by achieving better generalization to unseen test games and learning from fewer training interactions.
Authors: SUBHAJIT CHAUDHURY (IBM); Sarath Swaminathan (IBM); Daiki Kimura (IBM); Prithviraj Sen; Keerthiram Murugesan (IBM); Rosario Uceda-Sosa (IBM); Michiaki Tatsubori (IBM); Achille Fokoue (IBM); Pavan Kapanipathi (IBM); Asim Munawar (IBM); Alexander Gray (IBM)
Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem – it requires multiple concepts (nouns) from captions to be aligned to several objects in images. To tackle this problem, we go to the roots – the best learner, children. We take inspiration from cognitive science studies dealing with children’s language learning to propose a curriculum learning framework. The learning begins with easy-to-align image caption pairs containing one concept per caption. The difficulty is progressively increased with each new phase by adding one more concept per caption. Correspondingly, the knowledge acquired in each learning phase is utilized in subsequent phases to effectively constrain the learning problem to aligning one new concept-object pair in each phase. We show that this learning strategy improves over vanilla image-caption training in various settings – pretraining from scratch, using a pretrained image or/and pretrained text encoder, low data regime etc.
Authors: Hammad Ayyubi, Rahul Lokesh, Alireza Zareian, Bo Wu and Shih-Fu Chang
Social biases and stereotypes are embedded in our culture in part through their presence in our stories, as evidenced by the rich history of humanities and social science literature analyzing such biases in children stories. Because these analyses are often conducted manually and at a small scale, such investigations can benefit from the use of more recent natural language processing (NLP) methods that examine social bias in models and data corpora. Our work joins this interdisciplinary effort and makes a unique contribution by taking into account the event narrative structures when analyzing the social bias of stories. We propose a computational pipeline that automatically extracts a story's temporal narrative verb-based event chain for each of its characters as well as character attributes such as gender. We also present a verb-based event annotation scheme that can facilitate bias analysis by including categories such as those that align with traditional stereotypes. Through a case study analyzing gender bias in fairy tales, we demonstrate that our framework can reveal bias in not only the unigram verb-based events in which female and male characters participate but also in the temporal narrative order of such event participation.
Authors: Paulina Toro Isaza (IBM); GX Xu (IBM); Toye Oloko (IBM); Yufang Hou (IBM); Nanyun Peng; Dakuo Wang
Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast with static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding, that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach not only can match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.
Authors: Francesco Fusco (IBM); Diego Antognini (IBM)
Because of these easily observable trends, we have proposed the SustaiNLP workshop with the goal of promoting more sustainable NLP research and practices, with two main objectives: (1) encouraging development of more efficient NLP models; and (2) providing simpler architectures and empirical justification of model complexity. For both aspects, we will encourage submissions from all topical areas of NLP.
Authors: Marcos Zampieri; Skye Morgan; Kai North; Tharindu Ranasinghe; Austin Simmons; Paridhi Khandelwal; Sara Rosenthal (IBM); Preslav Nakov
Visit us in the Expo center from 9am - 4pm to talk to researchers, recruiters, and interact with live demos.
The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PrimeQA: a one-stop and open-source QA repository with an aim to democratize QA research and facilitate easy replication of state-of-the-art (SOTA) QA methods. PrimeQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation. It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on public benchmarks, and expanding pre-existing methods. PrimeQA is available at: https://github.com/primeqa.
Authors: Avi Sil (IBM); Jaydeep Sen (IBM); Bhavani Iyer (IBM); Martin Franz (IBM); Kshitij Fadnis (IBM); Mihaela Bornea (IBM); Sara Rosenthal (IBM); Scott McCarley (IBM); Rong Zhang (IBM); Vishwajeet Kumar (IBM); Yulong Li (IBM); Arafat Sultan (IBM); Riyaz Bhat (IBM); Juergen Bross (IBM); Hans Florian (IBM); Salim Roukos (IBM)
This paper studies a new task of federated learning (FL) for semantic parsing, where multiple clients collaboratively train one global model without sharing their semantic parsing data. By leveraging data from multiple clients, the FL paradigm can be especially beneficial for clients that have little training data to develop a data-hungry neural semantic parser on their own. We propose an evaluation setup to study this task, where we re-purpose widely-used single-domain text-to-SQL datasets as clients to form a realistic heterogeneous FL setting and collaboratively train a global model. As standard FL algorithms suffer from the high client heterogeneity in our realistic setup, we further propose a novel LOss Reduction Adjusted Re- weighting (Lorar) mechanism to mitigate the performance degradation, which adjusts each client’s contribution to the global model update based on its training loss reduction during each round. Our intuition is that the larger the loss reduction, the further away the current global model is from the client’s local optimum, and the larger weight the client should get. By applying Lorar to three widely adopted FL algorithms (FedAvg, FedOPT and FedProx), we observe that their performance can be improved substantially on average ((4\%-20\% absolute gain under MacroAvg) and that clients with smaller datasets enjoy larger performance gains. In addition, the global model converges faster for almost all the clients.
Authors: Tianshu Zhang (IBM); Changchang Liu (IBM); Wei-Han Lee (IBM); Yu Su; Huan Sun
The recent emergence of Neuro-Symbolic Agent (NeSA) approaches to natural language-based interactions calls for the investigation of model-based approaches. In contrast to model-free approaches, which existing NeSAs take, learning an explicit world model has an interesting potential especially in the explainability, which is one of the key selling points of NeSA. To learn useful world models, we leverage one of the recent neuro-symbolic architectures, Logical Neural Networks (LNN). Here, we describe a method that can learn neuro-symbolic world models on the TextWorld-Commonsense set of games. We then show how this can be improved further by adding a proprioception that gives better tracking of the internal logic state and model. Also, the game-solving agents performance in a TextWorld setting shows a great advantage over the baseline with 85\% average steps reduction and 2.3 average scoring.
Authors: Don Joven Ravoy Agravante (IBM); Daiki Kimura (IBM); Michiaki Tatsubori (IBM); Asim Munawar (IBM); Alexander Gray (IBM)
The wide applicability and adaptability of large language models (LLM) has enabled their rapid adoption. While the pre-trained models can perform many tasks, such models are often fine-tuned to improve their performance. However, this leads to issues over violation of model licenses, model theft, and copyright infringement. Moreover, recent advances show that generative technology is capable of producing harmful content which exacerbates the problems of accountability within model supply chains. Thus, we need a method to investigate how a model was trained or a piece of text was generated and what their source pre-trained model was. In this paper we take a first step to addressing this open problem by tracing back the origin of a given fine-tuned LLM to its corresponding pre-trained base model. We consider different knowledge levels and attribution strategies, and find that we are able to trace back to the original base model with an AUC of 0.804.
Authors: Myles Foley (IBM); Ambrish Rawat (IBM); Taesung Lee (IBM); Yufang Hou (IBM); Gabriele Picco (IBM); Giulio Zizzo (IBM)
One of the more prominent methods in- volves using the text-generation capabili- ties of language models. We propose Text AUgmentation by Dataset Reconstruction (TAU-DR), a novel method of data aug- mentation for text classification. We con- duct experiments on several multi-class datasets, showing that our approach im- proves the current state-of-the-art tech- niques for data augmentation.
Authors: Ariel Gera (IBM); Roni Friedman-Melamed (IBM); Ofir Arviv (IBM); Chulaka Gunasekara (IBM); Benjamin Sznajder (IBM); Noam Slonim (IBM); Eyal Shnarch (IBM)
Data drift is the change in model input data that is one of the key factors that lead to machine learning models performance degradation over time. Monitoring drift helps detecting these issues and preventing their harmful consequences. Meaningful drift interpretation is a fundamental step towards effective re-training of the model. In this study we propose an end-to-end framework for reliable model-agnostic change-point detection and interpretation in large task-oriented dialog systems, proven effective in multiple customer deployments. We evaluate our approach and demonstrate its benefits with a novel, carefully curated dataset, simulating customer requests to a dialog system. We make the data publicly available for the research community.
Authors: Ella Rabinovich (IBM); Matan Vetzler (IBM); Samuel Ackerman (IBM); Ateret Anaby-Tavor (IBM)
Along with the successful deployment of deep neural networks in several application domains, the need to unravel the black-box nature of these networks has seen a significant increase recently. Several methods have been introduced to provide insight into the inference process of deep neural networks. However, most of these explainability methods have been shown to be brittle in the face of adversarial perturbations of their inputs in the image and generic textual domain. In this work we show that this phenomenon extends to specific and important high stakes domains like biomedical datasets. In particular, we observe that the robustness of explanations should be characterized in terms of the accuracy of the explanation in linking a model's inputs and its decisions - faithfulness - and its relevance from the perspective of domain experts - plausibility. This is crucial to prevent explanations that are inaccurate but still look convincing in the context of the domain at hand. To this end, we show how to adapt current attribution robustness estimation methods to a given domain, so as to take into account domain-specific plausibility. This results in our DomainAdaptiveAREstimator (DARE) attribution robustness estimator allowing us to properly characterize the domain-specific robustness of faithful explanations. Next, we provide two methods, adversarial training and FAR training, to mitigate the brittleness characterized by DARE, allowing us to train networks that display robust attributions. Finally, we empirically validate our methods with extensive experiments on three established biomedical benchmarks.
Authors: Adam Ivankay (IBM); Mattia Rigotti (IBM); Pascal Frossard
Nearly all general-purpose neural semantic parsers generate logical forms in a strictly top-down autoregressive fashion. Though such systems have achieved impressive results across a variety of datasets and domains, recent works have called into question whether they are ultimately limited in their ability to compositionally generalize. In this work, we approach semantic parsing from, quite literally, the opposite direction; that is, we introduce a neural semantic parsing generation method that constructs logical forms from the bottom up, beginning from the logical form's leaves. The system we introduce is lazy in that it incrementally builds up a set of potential semantic parses, but only expands and processes the most promising candidate parses at each generation step. Such a parsimonious expansion scheme allows the system to maintain an arbitrarily large set of parse hypotheses that are never realized and thus incur minimal computational overhead. We evaluate our approach on compositional generalization; specifically, on the challenging CFQ dataset and three Text-to-SQL datasets where we show that our novel, bottom-up semantic parsing technique outperforms general-purpose semantic parsers while also being competitive with comparable neural parsers that have been designed for each task.
Authors: Maxwell Crouse (IBM); Pavan Kapanipathi (IBM); SUBHAJIT CHAUDHURY (IBM); Tahira Naseem (IBM); Ramón Fernandez Astudillo (IBM); Achille Fokoue (IBM); Tim Klinger (IBM)
Key Point Analysis (KPA) has been recently proposed for deriving fine-grained insights from collections of textual comments. KPA extracts the main points in the data as a list of concise sentences or phrases, termed Key Points, and quantifies their prevalence. While key points are more expressive than word clouds and key phrases, making sense of a long, flat list of key points, which often express related ideas in varying levels of granularity, may still be challenging. To address this limitation of KPA, we introduce the task of organizing a given set of key points into a hierarchy, according to their specificity. Such hierarchies may be viewed as a novel type of Textual Entailment Graph. We develop ThinkP, a high quality benchmark dataset of key point hierarchies for business and product reviews, obtained by consolidating multiple expert annotations. We compare different methods for predicting pairwise relations between key points, and for inferring a hierarchy from these pairwise predictions. In particular, for the task of computing pairwise key point relations, we achieve significant gains over existing strong baselines, by applying directional distributional similarity methods to a novel distributional representation of key points, and further boost performance via weak supervision.
Authors: Arie Cattan; Lilach Edelstein (IBM); Yoav Kantor (IBM); Roy Bar-Haim (IBM)
A common practice for Named Entity Recognition (NER) has been to treat the NER problem as a whole. We propose to break the problem into two logical sub-tasks: (1) Span Detection which identifies mention spans from a sentence irrespective of entity type; (2) Span Classification which classifies the spans into their semantic classes. We formulate both sub-tasks as question-answering (QA) problems and build on the BERT architecture. This framework produces two leaner models which can be optimized separately for each sub-task, instead of a large complex model.
Experiments with four cross-domain datasets demonstrate that this pipelined NER approach is effective and time efficient. 2Q-NER outperforms the baselines on OntoNotes5.0, WNUT17 and a cybersecurity dataset and gives on-par performance on BioNLP13CG. In all cases, it achieves a significant reduction in training time. The effectiveness of our system stems from fine-tuning the BERT model twice, separately for span detection and classification.
Authors: Jatin Arora; Youngja Park (IBM)
Transformer-based pre-trained large language models are getting increasingly popular and widely used in both academic and industrial settings because of their outstanding performance in many academic benchmarks. Nevertheless, there are still concerns about the hidden biases in these models which can have adverse effects on certain groups of people such as discriminatory outcomes or reinforcing harmful stereotypes. One promising way to inspect and uncover such biases is through visual inspection with human-in-the-loop. In this paper, we present Finspector, a human-centered visual inspection tool for exploring and comparing biases among foundation models. The goal of the tool is to make it easier to identify potential bias in different bias categories easily through a set of intuitive visual analytics using log-likelihood scores generated by language models.
Authors: Bc Kwon (IBM); Nandana Mihindukulasooriya (IBM)
Recently it has been shown that state-of-the-art NLP models are vulnerable to adversarial attacks, where the predictions of a model can be drastically altered by slight modifications to the input (such as synonym substitutions). While several defense techniques have been proposed, and adapted, to the discrete nature of text adversarial attacks, the benefits of general-purpose regularization methods such as label smoothing for language models, have not been studied. In this paper, we study the adversarial robustness provided by various label-smoothing strategies in foundational models for diverse NLP tasks in both in-domain and out-of-domain settings. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
Authors: Yahan Yang; Soham Dan (IBM); Dan Roth; Insup Lee
We present PAIRSPANBERT, a SPANBERT-based pre-trained model specialized for bridging resolution. To this end, we design a novel pre-training objective that aims to learn the contexts in which two mentions are implicitly linked to each other from a large amount of data automatically generated either heuristically or via distance supervision with a knowledge graph. Despite the noise inherent in the automatically generated data, we achieve the best results reported to date on three evaluation datasets for bridging resolution when replacing SPANBERT with PAIRSPANBERT in a state-of-the-art resolver that jointly performs entity coreference resolution and bridging resolution.
Authors: Hideo Kobayashi; Yufang Hou (IBM); Vincent Ng
Human-annotated labels and explanations are critical for training explainable NLP models. However, unlike human-annotated labels whose quality is easier to calibrate (e.g., with a majority vote), human-crafted free-form explanations can be quite subjective, as some recent works have discussed. Before blindly using them as ground truth to train ML models, a vital question needs to be asked: How do we evaluate a human-annotated explanation’s quality? In this paper, we build on the view that the quality of a human-annotated explanation can be measured based on its helpfulness (or impairment) to the ML models’ performance for the desired NLP tasks for which the annotations were collected. In comparison to the commonly used Simulatability score, we define a new metric that can take into consideration the helpfulness of an explanation for model performance at both fine-tuning and inference. With the help of a unified dataset format, we evaluated the proposed metric on five datasets (e.g., e-SNLI) against two model architectures (T5 and BART), and the results show that our proposed metric can objectively evaluate the quality of human-annotated explanations, while Simulatability falls short.
Authors: Bingsheng Yao; Prithviraj Sen (IBM); Lucian Popa (IBM); james hendler; Dakuo Wang
The MultiCoNER II shared task aims at de-tecting complex, ambiguous named entities with fine-grained types in a low context setting. Previous winning systems incorporated external knowledge bases to retrieve helpful contexts. In our submission we additionallypropose splitting the NER task into two stages, a Span Extraction Step, and an Entity Classifi-cation step. Our results show that the formerdoes not suffer from the low context settingcomparably, and in so leading to a higher over-all performance for an external KB-assistedsystem. We achieve 3rd place on the multilingual track and an average of 6th place overall.
Authors: Mohab Elkaref (IBM); Nathan Herr (IBM); Shinnosuke Tanaka (IBM); Geeth R De Mel (IBM)
We construct a new dataset in the technology domain, which contains 640 technical stack entities and 6,412 mentions collected from industrial content management systems. We demonstrate that CoSiNES yields higher accuracy and faster runtime than baselines derived from leading methods in this domain. CoSiNES also achieves competitive performance in four standard datasets from the chemistry, medicine, and biomedical domains, demonstrating its cross-domain applicability.
Authors: Vishal Saley; Rocktim Das; Dinesh Raghu (IBM); Mausam -
Auditing unwanted social bias in language models (LMs) is inherently hard due to the multi-disciplinary nature of the work. In addition, the rapid evolution of LMs can make benchmarks irrelevant in no time. Bias auditing is further complicated by LM brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness?
We propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between types of model errors. First, we extend an existing bias benchmark for NLI (BBNLI) using a combination of LM-generated lexical variations, adversarial filtering, and human validation. We demonstrate that the newly created dataset BBNLI-next is more challenging than BBNLI: on average, BBNLI-next reduces the accuracy of state-of-the-art NLI models from 95.3%, as observed by BBNLI, to 58.6%.Second, we employ BBNLI-next to showcase the interplay between robustness and bias, and the subtlety in differentiating between the two. Third, we point out shortcomings in current bias scores used in the literature and propose bias measures that take into account pro-/anti-stereotype bias and model brittleness.
We will publicly release the BBNLI-next dataset to inspire research on rapidly expanding benchmarks to keep up with model evolution, along with research on the robustness-bias interplay in bias auditing.\footnote{All datasets included in this work are in English only and they address US-centered social biases. In the spirit of efficient NLP research, no model training or fine-tuning was performed to conduct this research.} Warning: This paper contains offensive text examples.
Authors: Ioana Baldini Soares (IBM); Chhavi Yadav; Payel Das (IBM); Kush Varshney (IBM)
With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts or what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale text-to-image prompt dataset totaling 6.5TB, containing 14 million images generated by Stable Diffusion, 1.8 million unique prompts, and hyperparameters specified by real users. We analyze the syntactic and semantic characteristics of prompts. We pinpoint specific hyperparameter values and prompt styles that can lead to model errors and present evidence of potentially harmful model usage, such as the generation of misinformation. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models. DiffusionDB is publicly available at: https://poloclub.github.io/diffusiondb/
Authors: Zijie Wang; Evan Montoya; David Munechka; Haoyang Yang; Benjamin Hoover (IBM); Polo Chau
We investigate LangID for Brazilian Indigenous Languages (BILs), using the Bible as training data. Our research extends from previous work, by presenting two analysis on the generalization of Bible-based LangID in non-biblical data. First, with newly collected non-biblical datasets, we show that such a LangID can still provide quite reasonable accuracy in languages for which there are more established writing standards, such as Guarani Mbya and Kaigang, but there can be a quite drastic drop in accuracy depending on the language. Then, we used LangID to get a panorama on what we can expect from applying LangID in a large set of texts, considering about 13M sentences extracted from the Portuguese Wikipedia. The results point out how difficult this task can be, since only 9 sentences were accounted as correctly classified after manual inspection, and how the lack of handling other american indigenous languages can affect the task.
Authors: Paulo Rodrigo Cavalin (IBM); Pedro Henrique Leite Da Silva Pires Domingues (IBM); Julio Nogima (IBM); Claudio Santos Pinhanez (IBM)
Recent years have seen a proliferation of aggressive social media posts, often wreaking even real-world consequences for victims. Aggressive behaviour on social media is especially evident during important sociopolitical events such as elections, communal incidents, and public protests. In this paper, we introduce a dataset in English to model political aggression. The dataset comprises public tweets collated across the time-frames of two of the most recent Indian general elections. We manually annotate this data for the task of aggression detection and analyze this data for aggressive behaviour. To benchmark the efficacy of our dataset, we perform experiments by fine-tuning pre-trained language models and comparing the results with models trained on an existing but general domain dataset. Our models consistently outperform the models trained on existing data. Our best model achieves a macro F1-score of on our dataset. We also train models on a combined version of both datasets, achieving the best macro F1-score of , on our dataset. Additionally, we create subsets of code-mixed and non-code-mixed data from the combined dataset to observe variations in results due to the Hindi-English code-mixing phenomenon. We publicly release the anonymized data, code, and models for further research.
Authors: Akash Rawat; Nazia Nafis; Dnyaneshwar Bhadane; Diptesh Kanojia; Rudra Murthy Venkataramana (IBM)