Synthetic Data is a Reality, But Are You Asking the Right Questions?

0
12


Ray Poynter 16 October 2025


At most conferences and events, we are seeing a growing number of examples of Synthetic Data, in its many forms and versions, being used for real projects. However, there still seems to be a body of thought that promotes unscientific criticism of this approach. This sort of criticism is likely to hold some people back from realizing the potential benefits and could cause real commercial damage to those who follow it blindly.

Perhaps I should explain what I mean by unscientific? Here are the key points I am objecting to:

  1. Ignoring experiments. There are now many published studies that show the strengths and weaknesses of synthetic data. Any criticism of whether synthetic data works or not that does not reference the evidence is unscientific.
  2. Failing to recognise the weakness of existing approaches. Too often, we see critics complain about synthetic data because it is not perfect, without acknowledging that there are also problems with asking humans. These problems include panel fraud, poor questions, the inability of humans to be reliable witnesses to their own motivations, and a range of biases, including social desirability bias, question-order bias, and acquiescence bias. The test for synthetic data should not be if it is better than human data, but whether it is as helpful as human data.
  3. Expressing spiritual-like beliefs as facts. When people say AI will never be able to do X, Y and Z they often say it like it is a fact. For example, AI will never be able to replicate the nuances a human can detect in a conversation. They say this a) without evidence, and b) in contradiction to the expert view about where AI is heading (e.g. AGI – Artificial General Intelligence). In general, phrases like ‘we will always need X’ or ‘Y will never do Z’ are not scientific and fall into this belief rather than a reasoned way of thinking.
  4. Supporting views with inappropriate data. We see people quoting examples from the past (pre-Gen AI), citing poor studies (e.g. not conducted by an expert in the AI/synthetic area), or taking a study out of context.

Below is a sample of the many papers and studies published on Synthetic Data.

AI-Augmented Surveys: Leveraging Large Language. Models and Surveys for Opinion Prediction: 2024, Junsol Kim & Byungkyu Lee, https://arxiv.org/pdf/2305.09620

Simulating Human Behavior with AI Agents: 2025, Joon Sung Park et al, https://hai.stanford.edu/assets/files/hai-policy-brief-simulating-human-behavior-with-ai-agents.pdf

Large Language Models Perform as Strong Collaborators, Insight Generators, in AI-Human Hybrid Marketing Research Study: 2024, Wisconsin School of Business, https://business.wisc.edu/news/large-language-models-perform-as-strong-collaborators-insight-generators-in-ai-human-hybrid-marketing-research-study/

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings: 2025, Benjamin F. Maier et al, https://arxiv.org/pdf/2510.08338

How Synthetic Sample in B2B research enhances data quality: 2024, Newton X, https://www.newtonx.com/article/synthetic-sample-b2b-research-data-quality/#:~:text=Our%20tests%20compared%20three%20independent,for%20the%20same%20audience%20specs

Database Report: Twin-2K-500: A Data Set for Building Digital Twins of over 2,000 People Based on Their Answers to over 500 Questions: 2025, Olivier Toubia et al, https://pubsonline.informs.org/doi/10.1287/mksc.2025.0262

If you have links to papers looking at Synthetic Data, whether positive, negative or mixed, please do email me a link to them, I want to create a longer, more complete list.

There are lots of valid concerns about synthetic data. The key concerns, IMHO, are:

  1. How can buyers judge what they are buying, and how can they compare the risks with other solutions?
  2. How do we evaluate the results from synthetic data? Too much focus at the moment is placed on means. I would like to see something more profound.
  3. How do we ensure that new, good data is being added, and how do we ensure that synthetic data does not become an input into future synthetic data? Feeding synthetic data into the creation of synthetic data can cause drift and collapse, and should be avoided.
  4. Given that standard statistical significance testing does not work with synthetic data, what should we use instead?

Esomar recognises that synthetic data is being used; indeed, many papers have been presented at Esomar Conferences on its use. Esomar has issued guidance on certain types of synthetic data and will publish more soon. You can check what Esomar says by clicking here.

Synthetic data in all of its forms (boosts, personas, digital twins etc) is being used by a wide range of smart clients. These clients have worked out where the current options can be used and where they should be avoided.

The key benefits these clients are getting are speed and the ability to conduct research that might not have been done otherwise.

I get a sense that synthetic data is already bigger than CATI as a medium. I suspect that in a couple of years we might see 10% to 20% of projects using some level of synthetic data.

Clearly there are things that the current synthetic data can’t do. We see repeated trials making comments about the weakness of synthetic data in capturing and replaying empathy. Buyers need to be careful: they need to check that what they are buying is fit for purpose and evaluate systems.