Synthetic Data and Significance Tests: Why t-tests are Inappropriate and What to Do Instead

0
6


Ray Poynter, 19 June 2025


The statistical trap hiding in synthetic respondents
Discussions about synthetic data are everywhere. Talking about bolstering hard-to-reach quotas, creating digital twins and replacing whole sections of fieldwork. By design, these records replicate the distributions of your original survey, meaning every synthetic element looks plausible. The problem is that this process duplicates information rather than adding it. Variance shrinks, design effects vanish, and the apparent sample size increases. Run an ordinary t-test on such a file, and the p-value will look wonderfully tiny, meaning you will make type 1 errors (calling non-significant differences as significant).

Classic significance testing assumes every respondent is an independent draw from a real population. Synthetic data matches the means and standard deviation of the original data, but it increases the apparent sample size. Market researchers, therefore, need alternative ways of quantifying uncertainty.

Bootstrap & permutation tests: resampling reality, not replicas
A straightforward remedy is to abandon parametric formulas and let the data estimate its sampling distribution. The bootstrap approach does precisely this: repeatedly resampling the observed respondents with replacement, recalculating your statistic each time, and using the resulting empirical distribution for confidence intervals and hypothesis tests. Because only the genuine respondents are resampled, the method automatically preserves the correct amount of variability.

Read more about Bootstrapping here.

Bayesian inference: modelling uncertainty explicitly
Resampling treats the sample as sacrosanct; Bayesian analysis treats parameters as random and embraces model-based uncertainty. In a survey context, hierarchical Bayesian models can represent sampling strata, weighting and small-cell variation, then add an extra layer for the synthetic-data generation process. The output is a posterior distribution for every quantity of interest; you read off a 95 % credible interval instead of hunting for a p-value.

Read more about Bayesian inference here.

Multiple synthetic datasets & Rubin-Reiter combining rules
Another option keeps familiar frequentist machinery but inflates variances to account for synthesis. Donald Rubin first showed that by releasing m independent synthetic datasets and combining analyses, much like multiple imputation, one can recover valid standard errors.

Read more about Rubin-Reiter’s methods here.

Key points for insight professionals

  1. Never treat synthetic rows as extra independent interviews. Doing so guarantees under-estimated standard errors and inflated significance.
  2. Resampling can solve the problem when you still have access to the genuine sample. Bootstrap and permutation tests are considered simple (for data scientists), assumption-light and transparent.
  3. Bayesian models offer flexibility for small or uneven samples and naturally propagate uncertainty in synthesis.
  4. Multiple synthetic datasets, along with Rubin–Reiter rules, allow you to stay in a familiar frequentist world, provided the supplier can deliver several independent datasets.
  5. Document your choice. Clients and users need to understand that some of the data is synthetic, and they should be aware of the basis for your significance testing.