Workshop paper

In-Context Bias Propagation in LLM-Based Tabular Data Generation

Abstract

Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in low-resource settings. While prior work has shown potential to improve downstream performance through augmenting underrepresented groups, these benefits often assume access to a subset of in-context examples unbiased and representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context prompt bias leads to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context samples, ultimately compromising the fairness of downstream classifiers for a targeted subgroup. Our findings lead us to define and validate a new vulnerability associated with LLM-based data generation pipelines which rely on in-context prompts within sensitive domains.