Workshop paper

StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

Abstract

Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for eval- uating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key- value extraction from text using existing tabular data. It uses avail- able tabular data as structured ground truth, and follow a two-stage “plan-then-execute” pipeline to synthetically generate correspond- ing natural-language reports. To ensure alignment between text and structured source, we introduce a multi-dimensional evalua- tion strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring unit and time accuracy. We evaluated the proposed method on 87,881 examples across 50 datasets. Results reveal that while LLMs achieve strong fac- tual accuracy and avoid hallucination, they struggle with narrative coherence in producing and producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a comprehensive infrastructure, including datasets, evaluation tools, and baseline extraction systems, to support con- tinued research. Our findings highlight a critical gap: Models can generate accurate text but struggle to maintain information acces- sibility, a key requirement for practical deployment in different sectors and demanding both accuracy and machine processability.