Methodology

The human self-replication ceiling

Filed underMethodology
Reading time6 min read
Last updatedMay 19, 2026

When the same person is asked the same survey question on two different occasions, they do not give identical answers 100% of the time. The gap between a respondent and their own earlier answer is the human self-replication ceiling: the empirical maximum reliability that any survey instrument can extract from a real human population.

This ceiling is the right benchmark for synthetic audiences. Claiming to match human responses with 100% fidelity would mean overfitting on noise, panel-conditioning artifacts, and increasingly, AI bot contamination in online research panels. A synthetic audience that exceeds the self-replication ceiling is not more accurate than humans. It is less honest about what surveys actually measure.

This page summarizes the empirical literature on human self-replication and derives the population-level ceiling that any responsible synthetic audience evaluation should respect.

How self-replication is measured

The standard method is test-retest reliability: the same respondents answer the same instrument across two or more waves, separated by enough time for memory effects to fade but not enough for genuine attitude change to dominate. Reliability is then quantified using one of several coefficients:

  • Raw accuracy, the percentage of identical responses on retest, used for categorical items.
  • Pearson correlation (r) between waves, used for continuous or ordinal scales.
  • Cohen's kappa (κ) for categorical responses, correcting for chance agreement.
  • Intraclass correlation (ICC) when more than two waves are available.
  • Heise's three-wave model and its descendants, which separate measurement error from genuine attitude change.

The three-wave approach, pioneered by Heise (1969) and refined by Alwin and colleagues, is the methodological state of the art. Two-wave estimates tend to understate true reliability because they treat real attitude change as error.

What the research shows

The most recent and rigorous benchmark: Park et al. (2024)

The most directly relevant evidence for synthetic audience evaluation comes from Park, Zou, et al. (2024), “LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals.” The Stanford and Google DeepMind team recruited 1,052 representative US adults, conducted two-hour qualitative interviews with each, and administered the General Social Survey (GSS), the 44-item Big Five Inventory, five economic games, and five social science experiments. Participants retook the same surveys and experiments two weeks later.

Their reported self-consistency rate is the most precise benchmark available:

“For the GSS, the interview-based generative agents predicted participants' responses with an average normalized accuracy of 0.83 (std = 0.11), calculated from a raw accuracy of 65.67% (std = 6.51) divided by participants' self-consistency of 79.53% (std = 8.65).”

On the GSS, real humans match their own prior answer 79.53% of the time when retested two weeks later. This is the empirical individual-level ceiling for synthetic prediction.

Park et al.'s best-performing agent architecture, combining two-hour qualitative interviews with structured survey data, reached 0.86 normalized accuracy on the GSS (raw accuracy of approximately 68% divided by the 79.53% ceiling). This is the current academic state of the art for individual-level prediction grounded in rich self-report data.

Broader survey methodology literature

The Stanford finding is consistent with decades of survey methodology research.

Hout and Hastings (2016) used GSS three-wave panels from 2006–2014 to estimate reliability for 293 core items. About 21% of items had reliability coefficients above 0.85, another 24% between 0.70 and 0.85, with factual items more reliable than attitudinal ones and the least reliable items being reports of racial stereotypes.

Alwin and Krosnick (1991) estimated reliability for 96 survey attitude measures across five three-wave national reinterview surveys. They established that reliability is systematically predictable from question design. Number of response options, polarity, presence of a middle category, and topic familiarity all matter.

Kim et al. (2015) reported kappa values ranging from 0.44 to 0.93 across 28 items in the Korean Community Health Survey, with habit-based items significantly more reliable than awareness or attitude items.

Eisenberg et al. (2019), in a large PNAS study, found that self-report surveys generally show high test-retest reliability, but that estimates in the literature are “highly variable.” There is no single number that summarizes human self-consistency across all instrument types.

Complications that lower the ceiling further

Panel conditioning. Halpern-Manners, Warren, and Torche (2017) demonstrated measurable conditioning effects in the GSS itself. Simply being interviewed in a prior wave changes how respondents answer in subsequent waves. Published reliability estimates may be partly inflated artifacts of repeated measurement.

Online panel contamination. Online survey panels, the dominant source of human reference data for synthetic audience evaluation, face documented contamination from automated and AI-assisted respondents. This further lowers the effective ceiling that any synthetic system can legitimately target without overfitting to noise.

From individual-level to population-level: deriving the distribution-overlap ceiling

The Stanford finding measures individual-level self-consistency: does the same person give the same answer on retest? Most synthetic audience use cases (message testing, audience reaction forecasting, policy evaluation) depend on population-level distributions: what percentage of the audience chose option A, B, or C.

These are different metrics with different ceilings.

At the individual level, the ceiling is 79.53% (Park et al., 2024). At the population level, the ceiling is higher, because individual answer changes partially cancel out across the population. If 10% of respondents flip from A to B and 10% flip from B to A, the aggregate marginal distribution is unchanged even though 20% of individuals were inconsistent.

For a typical GSS-style item with three response options of roughly comparable probability and an individual flip rate of approximately 20%, the resulting aggregate distribution-overlap ceiling is approximately 91%. If you ran the same survey on the same humans twice, the second-wave distribution would overlap with the first-wave distribution at roughly 91% on average, even though one in five individuals gave a different answer.

0.91 is the working reference ceiling for population-level distribution overlap, derived from Park et al.'s individual-level finding. It is the upper bound a synthetic audience can meaningfully target without overfitting on noise.

Methodological note

The translation from individual-level self-consistency to aggregate distribution overlap assumes that individual answer changes are approximately symmetric across response options. If flips are systematic (for example, drift toward a single option over the two-week interval), the cancellation effect is weaker and the true ceiling is somewhat lower than 91%. For most attitude items on well-designed instruments, the symmetric-flip assumption is reasonable. For items measuring rapidly evolving issues, or items affected by external events between waves, it is less so. The 0.91 figure is a working ceiling for general benchmarking, with the understanding that item-specific ceilings vary.

References

Tufiş, P. A., Alwin, D. F., & Ramírez, D. N. (2024). A Catch-22: The Test–Retest Method of Reliability Estimation. Journal of Survey Statistics and Methodology, 12(4), 1011–1034.

Alwin, D. F., & Krosnick, J. A. (1991). The Reliability of Survey Attitude Measurement. Sociological Methods & Research, 20(1), 139–181.

Kim, S. J., Han, J. A., Kim, Y. H., et al. (2015). Test-retest reliability of health behavior items in the Community Health Survey in South Korea. Epidemiology and Health, 37, e2015045.

Eisenberg, I. W., Bissett, P. G., Enkavi, A. Z., et al. (2019). Large-scale analysis of test-retest reliabilities of self-regulation measures. Proceedings of the National Academy of Sciences, 116(12), 5472–5477.

Halpern-Manners, A., Warren, J. R., & Torche, F. (2017). Panel Conditioning in the General Social Survey. Sociological Methods & Research.

Heise, D. R. (1969). Separating Reliability and Stability in Test-Retest Correlation. American Sociological Review, 34(1), 93–101.

Hout, M., & Hastings, O. P. (2016). Reliability of the Core Items in the General Social Survey: Estimates from the Three-Wave Panels, 2006–2014. Sociological Science, 3, 971–1002.

Park, J. S., Zou, C. Q., Kamphorst, J., et al. (2024). LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals. arXiv:2411.10109.

Put the platform in front of a real decision.

Bring a decision your team is working on. A research engineer will draft the cohort, the sample, and the study with you, in one working session. The methodology comes out with the result.