What is a synthetic respondent?
A synthetic respondent is an AI-generated stand-in for a single human survey participant, built to answer a research instrument as if it were a real person. It is the individual record in a study, the unit that fills one row of a dataset. Synthetic respondents are assembled into cohorts and synthetic audiences, the collective units researchers actually report on.
The American Association for Public Opinion Research, in its May 2026 report Responsible AI Integration in Survey Research, gives the working definition: LLMs are used to generate sets of responses to survey instruments, often conditioned on demographic, attitudinal, or contextual information, in order to approximate how members of a target population might answer. AAPOR is careful to note what is being measured. A response generated by an AI system is “a model-based approximation of what a person might say, not a direct observation of human expression.”
How does a synthetic respondent differ from a persona, a cohort, and an audience?
The four terms describe different layers of the same object.
A synthetic respondent is one record. It corresponds to a single completed questionnaire, with answers to every item in the instrument. It is the level at which crosstabs are computed.
A synthetic persona, in the ICC/ESOMAR sense, is the agent that generates the record. The 2025 International Code defines it as “a digital representation of a person generated to mimic the behaviours, preferences, and characteristics of real people or groups.” The persona is the engine. The respondent is the output. In practice, many vendors and researchers collapse the two terms, which is part of why the field is hard to read.
A cohort is a defined grouping above the respondent level. Likely voters in a state, buyers of a category, executives in a sector. Cohorts are how studies are sampled and how findings are reported.
A synthetic audience is the full simulated population for a study, assembled from individual respondents to a sample frame. Where a single respondent answers one questionnaire, an audience produces the distribution.
AAPOR adds a related point of vocabulary. The report prefers “synthetic responses” over “synthetic samples” because “the method is not a sampling design, but an attempt to estimate what a specified set of respondents would say.” The same logic is why respondent and audience are not interchangeable: the respondent is the unit of estimation, not the unit of design.
Where did the term come from?
AAPOR provides the cleanest source-of-record statement available: “Argyle and colleagues (2023) introduced the concept of ‘silicon samples’ (which we refer to as synthetic respondents) and criteria for assessing ‘algorithmic fidelity’ for LLMs.” That is the terminological lineage, anchored by a standards body.
The practice predates the phrase. Argyle and colleagues at Brigham Young University published “Out of One, Many: Using Language Models to Simulate Human Samples” in Political Analysis in February 2023. They conditioned GPT-3 on thousands of sociodemographic backstories drawn from real U.S. survey respondents and named the result “silicon samples.” They introduced “algorithmic fidelity” to mean the property by which a language model accurately emulates the response distributions of human subgroups when properly conditioned.
In the same month, Brand, Israeli, and Ngwe published a Harvard Business School working paper showing that LLM-generated willingness-to-pay estimates are “realistic and comparable to estimates from human studies.” This is the earliest empirical case for the practice in a market research context.
By June 2025, the ICC/ESOMAR International Code had been revised to define synthetic data and synthetic personas formally, the first time a global market research standards body had done so. Park et al.'s November 2024 paper anchored the methodological literature with a benchmark the field is still calibrating against. AAPOR's May 2026 report locked the terminology in for the public-opinion research community.
Grounded versus prompted: the most important distinction
The single most consequential question about a synthetic respondent is what it is conditioned on. The literature now distinguishes cleanly between two approaches.
Grounded. The respondent is built from real human data: interview transcripts, survey responses, psychometric instruments, behavioral records. The agent is conditioned on individual-level evidence drawn from a specific human or a sampled population.
Prompted. The respondent is described in natural language to a general-purpose LLM, with no grounding in observed human responses. The agent is asked to roleplay a persona defined by demographic descriptors.
Park et al. (2024) is the most precise comparison available. Using a national sample of 1,052 Americans, they built agents from (i) two-hour semi-structured interviews, (ii) structured surveys (GSS and Big Five), or (iii) both sources combined. On held-out General Social Survey items, agent accuracy reached 83% (interview only), 82% (surveys only), and 86% (combined) of participants' own two-week test-retest consistency. Agents prompted only with demographic descriptions reached 74%. AAPOR summarizes the finding directly: interview-based definitions of synthetic respondents “outperformed those from direct specification on survey item accuracy and on personality inventories.”
The framing matters. The 83% to 86% figures are not absolute accuracy against a ground truth. They are accuracy relative to the human self-replication ceiling, the rate at which a real respondent matches their own prior answer when retested. A synthetic respondent cannot be more reliable than the human it is trying to replicate.
Grounding alone does not solve everything. Brand, Israeli, and Ngwe (2023) found that fine-tuning LLMs with real prior survey data improves alignment for existing products and features, but the improvement does not transfer to new product categories or differences between customer segments. Generalization beyond the conditioning data is the open methodological problem.
How are synthetic respondents used, and which uses are riskiest?
AAPOR distinguishes three application categories and ranks them explicitly: (1) pre-field diagnostic testing, (2) post-field augmentation and imputation, and (3) synthetic data collection as a substitute for human respondents. The report states plainly that the third is the most risky of the core tasks AAPOR considers.
The implication is operational. Using synthetic respondents to stress-test a questionnaire before fielding it carries lower validity risk than using them to impute missing values, which in turn carries lower risk than fielding them as a replacement for human data collection. A research design that uses synthetic respondents at all three levels of the gradient should disclose which level each result rests on.
AAPOR pairs this with a minimum standard. “At a minimum, researchers should be transparent about and clearly distinguish between data derived from human respondents and data generated by AI systems.” That is the floor, not the ceiling.
How are synthetic respondents evaluated?
The defensible methods come from traditional survey research and are being adapted, not invented.
Test-retest against human self-replication. A synthetic respondent built from a real person's prior data is asked novel questions, and accuracy is measured against either that person's own later answers or the test-retest consistency rate of comparable humans. This is the method Park et al. (2024) use.
Holdout panel validation. The synthetic population is asked the same questions as a fielded human panel, and the two distributions are compared on toplines and on every relevant crosstab.
Standards conformance. The ICC/ESOMAR Code 2025 introduces explicit transparency and human-oversight expectations. It does not prescribe accuracy thresholds; it requires disclosure of method and purpose.
AAPOR signals where the bar is moving. The evaluation frontier is shifting from “whether an LLM can match the average response to a single item toward the broader challenge of whether synthetic responses can reproduce relationships in the data.” Future assessments will examine whether synthetic responses preserve correlations, joint distributions, subgroup interactions, and multivariate structure, not merely marginal frequencies. A toplines-look-fine evaluation is no longer sufficient.
What synthetic respondents cannot do well
The peer-reviewed literature and AAPOR's report converge on a consistent list.
Variance compression. Bisbee et al. (2024), publishing in the same journal as Argyle, prompted GPT-3.5 Turbo with ANES persona characteristics and concluded that “sampling by ChatGPT is not reliable for statistical inference: there is less variation in responses than in the real surveys, and regression coefficients often differ significantly from equivalent estimates obtained using ANES data.” Topline averages can look acceptable while the underlying distribution is too narrow.
Prompt drift. The same authors document that “the distribution of synthetic responses varies with minor changes in prompt wording,” and “the same prompt yields significantly different results over a 3-month period.” Reproducibility is not automatic.
Subgroup degradation. Verasight's January 2026 white paper reported that topline synthetic estimates approximated human surveys to within four percentage points, but “error between LLM-generated and human-generated samples ballooned to 10 points on average, and could reach 30 points for the smallest subgroups.”
Failure on marginalized groups and sensitive topics. AAPOR notes that synthetic responses are “especially” weak “for marginalized groups, culturally specific concepts, and emotionally charged topics, where publicly available models are particularly limited because they are fine-tuned to avoid controversial positions.” The report warns that synthetic responses “may also create a false sense of representativeness, smoothing over real-world variability or amplifying existing model biases.”
Identity misportrayal and flattening. A 2025 Nature Machine Intelligence study, run across four LLMs and 3,200 human participants spanning 16 demographic identities, demonstrated empirically that current LLMs “both misportray and flatten the representations of demographic groups.”
Model drift. AAPOR adds a benchmark-stability concern. The models used to generate synthetic responses “are continually retrained, updated, or realigned without notice in ways that can alter their behavior over time.” A published accuracy figure has a shelf life shorter than the model it was measured against.
None of these is fixable by prompt engineering. They are structural properties of current LLMs used as respondents, and they are the reason the methodologically serious literature insists on evaluation rather than assertion.
References
American Association for Public Opinion Research Task Force on Responsible AI Integration in Survey Research (2026). Responsible AI Integration in Survey Research.Section 3.1.2 “AI as a Respondent,” pp. 14–17. May 2026.
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis, 31(3), 337–351.DOI: 10.1017/pan.2023.2.
ICC/ESOMAR (2025). International Code on Market, Opinion and Social Research and Data Analytics.June 2025 revision.
More from the knowledge base.
What is a synthetic audience?
A plain definition of synthetic audiences: AI-generated populations that behave like real ones, how they are built and calibrated, what they are good for, and what separates a defensible one from a generic LLM wrapper.
Read article →What is a cohort?
What a cohort is across four research traditions, what AAPOR disclosure elements require, and why cohort precision determines synthetic audience fidelity.
Read article →The human self-replication ceiling
The empirical ceiling on survey reliability, derived from test-retest research and Park et al. (2024). A reference for evaluating synthetic audience accuracy claims.
Read article →Put the platform in front of a real decision.
Bring a decision your team is working on. A research engineer will draft the cohort, the sample, and the study with you, in one working session. The methodology comes out with the result.