The Science

Why this works — and what makes it work well.

45 peer-reviewed papers across political science, cognitive psychology, behavioral economics, and AI have tested the proposition that LLMs can simulate human populations. The findings are clear, nuanced, and directly shaped how AnthroSim was built. Here's the plain-English version.

AI can replicate human responses at scale.

The foundational question — can an AI actually respond the way a specific kind of person would? — has been rigorously tested across political science, behavioral economics, and cognitive psychology. The answer is yes.

In a landmark study, LLMs conditioned on sociodemographic backstories from American National Election Studies data reliably mirrored the complex belief structures of U.S. political subgroups. Separately, classic experiments including the Ultimatum Game and Milgram Shock Experiment were successfully replicated using AI participants. One analysis found a striking 0.95 correlation between GPT responses and human moral judgments across 464 distinct scenarios.

The replication base is now substantial. A 2025 study in Nature Computational Science replicated 156 psychology and management experiments using LLMs and found main effects held in the majority of cases. A review in Science called it a genuine transformation of social science workflows. The field has moved well past "does this work?" and into "how do we do it well?"

Argyle et al., 2023 · Political Analysis · Aher et al., 2023 · ICML · Dillion et al., 2023 · Trends in Cognitive Sciences · Cui et al., 2025 · Nature Computational Science · Grossmann et al., 2023 · Science

Shallow personas produce shallow results.

The promising results come with a significant caveat: LLMs given only basic demographic labels — age, gender, political party — consistently fail in predictable ways. Variance gets compressed. Responses cluster toward the moderate, educated, Western center. Answering patterns are influenced by option ordering more than semantic content.

In a rigorous replication of Argyle et al.'s results, nearly half of all regression coefficients differed significantly between synthetic and real human data. A separate study found that after correcting for systematic labeling biases, LLM survey responses trend toward uniformly random — indistinguishable from chance. A binary classifier could near-perfectly identify AI-generated responses vs. human ones.

More troublingly, research published in Nature Machine Intelligence in 2025 found that LLMs don't merely underrepresent identity groups — they can actively misportray and flatten them, generating stereotyped caricatures rather than nuanced individuals. Assigning a demographic persona changes not just tone and style but, in some cases, core reasoning patterns and competence. The problem is structural, not statistical noise.

Bisbee et al., 2024 · Political Analysis · Dominguez-Olmedo et al., 2024 · NeurIPS · Santurkar et al., 2023 · ICML · Wang et al., 2025 · Nature Machine Intelligence · Gupta et al., 2024 · ICLR

Persona depth is the key variable.

The research points to a clear explanation for why shallow personas fail: there isn't enough for the model to work with. When LLMs are conditioned on rich narrative backstories — detailed personal histories rather than a checklist of demographic labels — simulation fidelity improves dramatically.

Stanford's 2024 study paired LLMs with full qualitative interview transcripts of 1,052 real individuals. The result: AI agents replicated those participants' survey responses 85% as accurately as participants replicated their own answers two weeks later. A separate study found up to 18% improvement in matching real human response distributions when rich backstories replaced sparse demographic tuples — with 27% improvement in response consistency.

Park et al., 2024 · arXiv 2411.10109 · Moon et al., 2024 · EMNLP

Multiple distinct personas multiply realism.

A single AI instance asked to imagine different perspectives produces less diversity than you'd expect. Real diversity of output requires real diversity of persona — each grounded independently, each prompted as a distinct entity.

Research from Purdue's Design Science lab demonstrated that structuring prompts around multiple independent professional personas — especially in sequence — produced significantly more diverse and differentiated outputs than single-prompt approaches. Separately, steerability research found that LLMs given "incongruous personas" (real people don't fit neat ideological categories) produce more realistic, less stereotyped responses than models defaulting to demographic archetypes.

Feng, Hélie & Panchal, 2025 · Design Science · Liu et al., 2024 · Findings of ACL

Group dynamics are simulatable, too.

Individual responses are one thing. The harder — and more valuable — question is whether AI can simulate what happens when diverse people interact, argue, and update their views. The evidence is increasingly yes.

LLM agent networks have been shown to replicate known patterns of opinion polarization, including when confirmation bias is introduced. Google DeepMind demonstrated LLMs finding and articulating consensus positions that maximized expected approval across groups with opposing moral and political stances. The "Smallville" generative agents paper — now with over 5,000 citations — established the architectural foundation for believable multi-agent social simulation, showing 25 agents autonomously forming relationships, spreading information, and coordinating group activities.

Chuang et al., 2024 · Findings of NAACL · Bakker et al., 2022 · NeurIPS · Park et al., 2023 · UIST

Knowing the limits is part of doing it right.

Honest assessment of the literature: LLM-based population simulation has real, documented limitations. Western and liberal skew. Variance compression. Reduced performance outside English-speaking cultural contexts. Difficulty with personas that don't match ideological stereotypes.

The research consensus is that these systems are most valuable for upstream research — hypothesis generation, pilot studies, policy stress-testing, concept validation — rather than as final substitutes for primary human research. That's exactly how AnthroSim is positioned, and why acknowledging these boundaries is a feature, not a weakness.

Bail, 2024 · PNAS · Sarstedt et al., 2024 · Psychology & Marketing · von der Heyde et al., 2025 · Social Science Computer Review · Lin, 2026 · SAGE Open · Kaiser et al., 2025 · UMAP

What this means for AnthroSim

AnthroSim was designed against this research landscape. Every finding above directly shaped a product decision.

→ Quantitative grounding over label-dropping. Profiles derived from real population distributions address the variance compression and demographic skew documented in shallow-persona research.
→ Narrative depth built in. Every profile includes rich personal history — directly implementing the backstory-based fidelity findings from Moon et al. and Park et al.
→ Genuine persona differentiation. The architecture enforces real divergence across political, economic, psychological, and cultural dimensions — not surface variation on a common template.
→ Group dynamics as a first-class feature. Multi-participant structured conversations implement the deliberation and consensus patterns validated in the multi-agent literature.
→ Positioned for upstream research. AnthroSim is a tool for hypothesis generation, concept testing, and policy stress-testing — exactly where the evidence is strongest.

Full bibliography (45 papers, 2022–2026)

Foundational: LLMs as synthetic respondents

Argyle et al. "Out of One, Many." Political Analysis 31(3): 337–351, 2023. DOI: 10.1017/pan.2023.2

Aher, Arriaga & Kalai. "Using Large Language Models to Simulate Multiple Humans." ICML 2023. PMLR 202: 337–371.

Horton et al. "Large Language Models as Simulated Economic Agents." NBER WP 31122, 2023.

Bail. "Can Generative AI Improve Social Science?" PNAS 121(21): e2314021121, 2024.

Dillion et al. "Can AI Language Models Replace Human Participants?" Trends in Cognitive Sciences 27(7): 597–600, 2023.

Cui et al. "A large-scale replication of scenario-based experiments using LLMs." Nature Computational Science 5: 627–634, 2025. DOI: 10.1038/s43588-025-00840-7.

Broader social science applications

Grossmann et al. "AI and the transformation of social science research." Science 380(6650): 1108–1109, 2023. DOI: 10.1126/science.adi1778.

Demszky et al. "Using large language models in psychology." Nature Reviews Psychology 2: 688–701, 2023. DOI: 10.1038/s44159-023-00241-5.

Ziems et al. "Can Large Language Models Transform Computational Social Science?" Computational Linguistics 50(1): 237–291, 2024. arXiv: 2305.03514.

Gilardi et al. "ChatGPT outperforms crowd-workers for text-annotation tasks." PNAS 120: e2305016120, 2023. DOI: 10.1073/pnas.2305016120.

Silicon sampling: guidelines & applications

Sarstedt et al. "Using Large Language Models to Generate Silicon Samples." Psychology & Marketing 41(6): 1254–1270, 2024.

Sun et al. "Random Silicon Sampling." arXiv: 2402.18144, 2024.

Cao et al. "Specializing LLMs to Simulate Survey Response Distributions." NAACL 2025. arXiv: 2502.07068.

Persona simulation, depth & diversity

Moon et al. "Virtual Personas via an Anthology of Backstories." EMNLP 2024: 19864–19897.

Hu & Collier. "Quantifying the Persona Effect in LLM Simulations." ACL 2024: 10289–10307.

Feng, Hélie & Panchal. "Enhancing Design Concept Diversity: Multi-Persona Prompting Strategies." Design Science, 2025. DOI: 10.1017/dsj.2025.10037.

Bias, caricature & identity in persona simulation

Wang et al. "Large language models that replace human participants can harmfully misportray and flatten identity groups." Nature Machine Intelligence 7: 400–411, 2025. DOI: 10.1038/s42256-025-00986-z.

Gupta et al. "Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs." ICLR 2024.

Cheng et al. "Marked Personas: Using Natural Language Prompts to Measure Stereotypes in LLMs." ACL 2023.

Cheng et al. "CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations." EMNLP 2023.

Petrov et al. "Limited Ability of LLMs to Simulate Human Psychological Behaviours." arXiv: 2405.07248, 2024.

Persona prompting & steerability

Liu, Diab & Fried. "Evaluating LLM Biases in Persona-Steered Generation." Findings of ACL 2024: 9832–9850.

Miehling et al. "Evaluating the Prompt Steerability of Large Language Models." NAACL 2025. arXiv: 2411.12405.

Deshpande et al. "Toxicity in ChatGPT: Analyzing Persona-assigned Language Models." Findings of EMNLP 2023: 1236–1270.

Multi-agent group simulation & deliberation

Chuang et al. "Simulating Opinion Dynamics with Networks of LLM-based Agents." Findings of NAACL 2024: 3326–3346.

Bakker et al. "Fine-Tuning Language Models to Find Agreement Among Humans." NeurIPS 2022.

Du et al. "Improving Factuality through Multiagent Debate." ICML 2024. arXiv: 2305.14325.

Piatti et al. "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents." arXiv: 2404.16698, 2024.

Xie et al. "Can Large Language Model Agents Simulate Human Trust Behaviors?" NeurIPS 2024.

Generative agents

Park et al. "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023. DOI: 10.1145/3586183.3606763.

Park et al. "Social Simulacra." UIST 2022. DOI: 10.1145/3526113.3545616.

Park et al. "Generative Agent Simulations of 1,000 People." arXiv: 2411.10109, 2024.

Vezhnevets et al. "Generative Agent-Based Modeling with Concordia." arXiv: 2312.03664, 2023.

Multi-agent systems: surveys & infrastructure

Gao et al. "Large Language Models Empowered Agent-based Modeling and Simulation: A Survey." arXiv: 2312.11970, 2024.

Guo et al. "Large Language Model based Multi-Agents: A Survey of Progress and Challenges." arXiv: 2402.01680, 2024.

Polling, voter behavior & persuasion

von der Heyde, Haensch & Wenz. "Vox Populi, Vox AI?" Social Science Computer Review, 2025.

Zhang et al. "Donald Trumps in the Virtual Polls." arXiv: 2411.01582, 2024.

Salvi et al. "AI-personalized persuasion outperforms human debaters in controlled debate experiments." 2025. (900 participants; personalized AI condition significantly outperformed human debaters.)

Focus groups & qualitative simulation

Zhang et al. "Focus Agent: LLM-Powered Virtual Focus Group." IVA 2024. DOI: 10.1145/3652988.3673918.

Amirova et al. "Framework-Based Qualitative Analysis of LLM Free Responses." PLOS ONE 19(3): e0300024, 2024.

Validity, fidelity & known limits

Santurkar et al. "Whose Opinions Do Language Models Reflect?" ICML 2023: 29971–30004.

Durmus et al. "Towards Measuring the Representation of Subjective Global Opinions in LLMs." COLM 2024. arXiv: 2306.16388.

Bisbee et al. "Synthetic Replacements for Human Survey Data?" Political Analysis 32: 401–416, 2024.

Dominguez-Olmedo, Hardt & Mendler-Dünner. "Questioning the Survey Responses of Large Language Models." NeurIPS 2024. arXiv: 2306.07951.

Methods, guidance & responsible use

Lin. "Large Language Models as Psychological Simulators: A Methodological Guide." SAGE Open, 2026. DOI: 10.1177/25152459251410153.

Kaiser et al. "Simulating Human Opinions with LLMs: Opportunities and Challenges." UMAP 2025. DOI: 10.1145/3708319.3733685.