This is the gentle introduction to our work, which will be presented at CCS 2025. Code and datasets are available on Github.
Tabular data is one of the most widely used data structures across science, industry, and governance. As data-driven decision-making becomes the norm, sharing and publishing tabular data is increasingly important. However, sharing data raises serious privacy concerns. A promising solution is data synthesis, which generates synthetic data that mimics the real dataset while preserving privacy. Tabular data synthesis has been studied for years, with algorithms ranging from simple histograms to statistical models (e.g., probabilistic graphical models) to advanced deep learning methods (e.g., diffusion models and large language models). Broadly, synthesis algorithms can be divided into two categories:
While so many synthesizers have been proposed, a comprehensive understanding of the strengths and weaknesses of different synthesizers remains elusive. DP synthesizers are rarely compared directly with HP ones, and there’s no standardized way to compare statistical methods against deep generative models. What’s more concerning is that different methods use different evaluation pipelines, making comparisons inconsistent. To address this, we propose a comprehensive evaluation process that works for both DP and HP synthesizers:
It is known that evaluating generative models is inherently difficult, and qualitative evaluation of tabular data through visual inspection is also infeasible.
There are various evaluation metrics proposed, and different methods use distinct evaluation metrics, and they all claim that they are state-of-the-art.
For example, many DP papers rely only on simple one/two-way query errors, while HP methods borrow metrics from image and text generation
Fidelity measures how closely synthetic data matches real data. Unlike images or text, where joint distributions are natural to evaluate, tabular data is often better measured through marginal distributions. However, current approaches use different statistical metrics for discrete and continuous attributes, making results hard to compare.
We propose using Wasserstein distance to measure marginal similarity. It works consistently for discrete, continuous, or mixed attributes and provides a meaningful structural similarity metric. Interestingly, we found that under this unified metric, some newer DP synthesizers actually underperform compared to simpler statistical methods. However, without DP constraints, advanced models, especially diffusion-based ones, can achieve impressive fidelity and generate highly realistic data.
We first observe that widely-used privacy evaluation metrics for HP synthesizers (e.g., Distance to Closest Record (DCR)
Therefore, we propose a new semantic privacy evaluation metric, based on the principles of differential privacy, that works for all types of synthesizers. Unlike MIA, which measures privacy risk from an adversary’s perspective, our approach directly approximates the leave-one-out distinguishability for all records in the training set and utilizes the maximum leakage as the privacy risk measurement. We present extensive experiments showing that our metric, the Membership Disclosure Score (MDS), can faithfully identify the privacy risks of HP synthesizers and is more effective than MIA and other heuristic privacy metrics.
Synthetic data is often used to train models for downstream tasks, such as machine learning prediction. However, unlike vision or NLP, where standardized benchmark models exist, the performance of tabular prediction can vary widely across different models. There’s still ongoing debate
We also find that many synthesis algorithms are highly sensitive to their hyperparameters. Achieving optimal performance for each dataset requires careful tuning. However, most synthesizers do not provide a clear methodology for this tuning process. Therefore, we propose a unified model tuning objective that combines our proposed fidelity and utility metrics. We found that this tuning objective can significantly improve the performance of many synthesizers (especially deep learning-based ones) and enables fairer comparisons. We note that our tuning objective is by no means perfect and advocate for researchers to design specific tuning strategies tailored to their own synthesis algorithms.
In this work, we carefully examine the limitations of existing evaluation metrics for tabular data synthesis and propose a systematic and principled framework for assessing the performance of both DP and HP synthesizers. We hope that future work on tabular data synthesis will adopt our proposed evaluation framework to enable better and fairer comparisons, helping to clearly identify the advantages and limitations of new advances in the field.
However, while our paper makes a significant step toward a standardized evaluation and serves as a bridge to connect and compare DP and HP synthesis algorithms, it has its own limitations. For example, the computation of our privacy metric (MDS) is quite expensive, as it requires training many shadow models for an accurate privacy estimation. Nevertheless, we hope our work will inspire future research to design tabular data synthesis algorithms that achieve both high fidelity and strong privacy.