Research
3 minute read

An invisible watermark to keep tabs on tabular data

An IBM researcher and his colleagues have found a way to embed a secret key in AI-generated tables, allowing AI platforms to more easily protect and track data created with their models.

Building effective AI models can be difficult for enterprises in highly regulated fields because of the red tape involved in accessing sensitive customer data to train these models.

Generative AI has made it easier to create a synthetic alternative — fabricated data for training purposes that statistically looks like shopping patterns, health records, or credit scores, of real people but can’t be traced to specific individuals. Enterprises can mine this surrogate synthetic data for insights without putting real customer data at risk.

As synthetic data proliferates, tracking its provenance is becoming increasingly important. Companies that serve AI models may want to prevent data created on their platforms from being stolen to commit fraud or to train a competitor’s models. They may also want to protect themselves in case private or copyrighted information gets inadvertently leaked.

A watermark can embed ownership information into AI-generated content that can be later verified with the help of a watermark extraction and validation tool. Watermarks for AI-generated text and images have been around for about two years now; the technology was recently extended to synthetic tabular data — the workhorse for enterprise AI models — through work by IBM’s Pin-Yu Chen and colleagues at TU Delft, University of Neuchâtel, and University of Turin.

Researchers described their watermarking method for AI-generated tabular data in a spotlight paper presented at the ICLR 2025 in Singapore last month.

“If a company’s platform is used to generate tables meant to deceive investors or regulators, that company could be held liable,” said Chen, an expert on AI adversarial testing. “If the synthetic data is automatically watermarked, however, bad actors may think twice about misusing it. Watermarks also allow companies to keep tabs on all the synthetic data they generate.”

Multimodal watermarks

As generative AI changes how we work, standards for disclosing and crediting AI’s assistant are evolving. IBM Research recently published the AI Attribution Toolkit to help content creators write a formal statement describing how they used AI in their work.

Self-reporting standards for AI use are important, said Chen, but they won’t be enough to address the tide of AI-generated content already flooding the internet. “Not everyone will be willing to be upfront about their AI use,” he said. “We need watermarks as well as reliable detectors in the wild.”

AI model providers now have a range of watermarking algorithms to choose from. Some of the most effective ones work by embedding a secret pattern during the generation process. How this signature gets baked in varies by modality.

For large language models, patterns can be inserted by changing how the LLM samples and picks its next token. Watermarking algorithms might encourage the model to choose more tokens from a sanctioned green list, for example, than a red list. By analyzing the ratio of green and red tokens in a document, a watermark detector can estimate whether it was written by a human or a machine.

One drawback of green and red list watermarks is they can degrade the quality of the generated text. To keep watermarked text sounding as natural as possible, and to reduce the chances of it being tampered with, Chen and his colleagues last year introduced a method that inserts two watermarks into each generated token.

Their double watermarking method, Duwak, short for dual watermark, outperformed other leading methods on the human-like quality of the watermarked task, as well as its resistance to tampering. Their complementary extraction and validation tool could also decipher the watermark with a third as many tokens, making it practical for shorter texts.

A different set of algorithms have emerged for watermarking AI-generated images. Today’s most popular image generators use a physics-inspired diffusion process to create realistic yet imaginative pictures.

Statistical noise is gradually added and removed from a seed image to create something new. Watermarks can be added by manipulating the noise sampling process, leaving a secret pattern that can be deciphered with a key while running the process in reverse.

Row by row attribution

Chen and his colleagues may be the first to extend watermarking into the process of generating tabular data, rather than tagging the synthetic data itself.

When diffusion models generate an image, they create one holistic latent representation. When they generate a table, however, they form a new representation for each row of data they create. Each row, theoretically, requires its own watermark.

The researchers realized they needed a family of watermarks with patterns close enough to be recognized by the validation tool but varied enough to maintain the statistical properties of the real-world data. Their solution, TabWak, short for tabular watermark, subtly alters the secret pattern in each row.

“We can verify who owns the content, without degrading its usefulness,” said Chen.

Banks are just one of the industries that use synthetic tables to protect customer data while building models. Watermarking these stand-in tables provides an extra layer of security, says former IBM researcher Lydia Y. Chen, who helped create both Duwak and TabWak, and now teaches computer science at University of Neuchâtel.

“If legal disputes arise from the misuse of AI-generated data, it’s important to be able to verify which parties created the tables and should be held responsible,” she said.

Related posts