How to (principally) assess tabular data synthesis algorithms?

This is the gentle introduction to our work, which will be presented at CCS 2025. Code and datasets are available on Github.

Introduction

Tabular data is one of the most widely used data structures across science, industry, and governance. As data-driven decision-making becomes the norm, sharing and publishing tabular data is increasingly important. However, sharing data raises serious privacy concerns. A promising solution is data synthesis, which generates synthetic data that mimics the real dataset while preserving privacy. Tabular data synthesis has been studied for years, with algorithms ranging from simple histograms to statistical models (e.g., probabilistic graphical models) to advanced deep learning methods (e.g., diffusion models and large language models). Broadly, synthesis algorithms can be divided into two categories:

Proposed Evaluation Process

While so many synthesizers have been proposed, a comprehensive understanding of the strengths and weaknesses of different synthesizers remains elusive. DP synthesizers are rarely compared directly with HP ones, and there’s no standardized way to compare statistical methods against deep generative models. What’s more concerning is that different methods use different evaluation pipelines, making comparisons inconsistent. To address this, we propose a comprehensive evaluation process that works for both DP and HP synthesizers:

  1. Data preparation phase. Tabular data usually requires preprocessing before synthesis. Different algorithms use different preprocessing techniques. For instance, statistical methods select low-dimensional marginals to represent distributions compactly, while deep generative models rely on encoding and normalization.
  2. Model tuning. Hyperparameter tuning is crucial for fair comparison but is often poorly documented. We found that proper tuning dramatically affects synthetic data quality. Our paper proposes a simple but effective tuning approach to ensure fair evaluation across different algorithms.
  3. Model training. After tuning, models are trained to generate synthetic data. Some statistical synthesizers don’t have learnable parameters and may skip this phase, while deep models optimize complex objectives.
  4. Model evaluation. The trained model generates synthetic data, which is then evaluated on three key aspects: fidelity, privacy, and utility.

What Makes a Good Evaluation Metric?

It is known that evaluating generative models is inherently difficult, and qualitative evaluation of tabular data through visual inspection is also infeasible. There are various evaluation metrics proposed, and different methods use distinct evaluation metrics, and they all claim that they are state-of-the-art. For example, many DP papers rely only on simple one/two-way query errors, while HP methods borrow metrics from image and text generation to measure fidelity. Even worse, some HP papers use similarity-based privacy metrics that do not align with the principles of differential privacy, leading to misleading conclusions about their privacy risks.

Fidelity Evaluation Metrics

Fidelity measures how closely synthetic data matches real data. Unlike images or text, where joint distributions are natural to evaluate, tabular data is often better measured through marginal distributions. However, current approaches use different statistical metrics for discrete and continuous attributes, making results hard to compare.

We propose using Wasserstein distance to measure marginal similarity. It works consistently for discrete, continuous, or mixed attributes and provides a meaningful structural similarity metric. Interestingly, we found that under this unified metric, some newer DP synthesizers actually underperform compared to simpler statistical methods. However, without DP constraints, advanced models, especially diffusion-based ones, can achieve impressive fidelity and generate highly realistic data.

Privacy Evaluation Metrics

We first observe that widely-used privacy evaluation metrics for HP synthesizers (e.g., Distance to Closest Record (DCR)) are syntactic. That is, they are properties of the input/output dataset pair and are independent of the synthesis algorithm itself. Over the last decade, the privacy community has realized the limitations of such syntactic metrics (like k-anonymity) and has shifted towards semantic privacy definitions like Differential Privacy. Therefore, HP synthesizers evaluated with these syntactic metrics may provide a false sense of privacy. Furthermore, the widely used empirical privacy evaluation approach, i.e., membership inference attack (MIA), is not always effective for all types of tabular data synthesis.

Therefore, we propose a new semantic privacy evaluation metric, based on the principles of differential privacy, that works for all types of synthesizers. Unlike MIA, which measures privacy risk from an adversary’s perspective, our approach directly approximates the leave-one-out distinguishability for all records in the training set and utilizes the maximum leakage as the privacy risk measurement.  We present extensive experiments showing that our metric, the Membership Disclosure Score (MDS), can faithfully identify the privacy risks of HP synthesizers and is more effective than MIA and other heuristic privacy metrics.

Utility Evaluation Metrics

Synthetic data is often used to train models for downstream tasks, such as machine learning prediction. However, unlike vision or NLP, where standardized benchmark models exist, the performance of tabular prediction can vary widely across different models. There’s still ongoing debate about whether deep learning even excels in tabular tasks. To mitigate this, we propose measuring the relative performance drop across several ML models when trained on synthetic versus real data. This relative drop serves as our measure of utility preservation.

Model Tuning

We also find that many synthesis algorithms are highly sensitive to their hyperparameters. Achieving optimal performance for each dataset requires careful tuning. However, most synthesizers do not provide a clear methodology for this tuning process. Therefore, we propose a unified model tuning objective that combines our proposed fidelity and utility metrics. We found that this tuning objective can significantly improve the performance of many synthesizers (especially deep learning-based ones) and enables fairer comparisons. We note that our tuning objective is by no means perfect and advocate for researchers to design specific tuning strategies tailored to their own synthesis algorithms.

Discussion & Conclusion

In this work, we carefully examine the limitations of existing evaluation metrics for tabular data synthesis and propose a systematic and principled framework for assessing the performance of both DP and HP synthesizers. We hope that future work on tabular data synthesis will adopt our proposed evaluation framework to enable better and fairer comparisons, helping to clearly identify the advantages and limitations of new advances in the field.

However, while our paper makes a significant step toward a standardized evaluation and serves as a bridge to connect and compare DP and HP synthesis algorithms, it has its own limitations. For example, the computation of our privacy metric (MDS) is quite expensive, as it requires training many shadow models for an accurate privacy estimation. Nevertheless, we hope our work will inspire future research to design tabular data synthesis algorithms that achieve both high fidelity and strong privacy.