Research Library

Literature review, architecture documentation, and measurement framework

50+ papers across 12 categories - curated from the emerging field of LLM social simulation, synthetic qualitative research, AI-to-AI interaction, and reproducible interview studies.

Classical Turing Test & Variants 3 papers

Computing Machinery and Intelligence

Turing, A.M. (1950) - Mind, 59(236), 433-460

The foundational paper proposing the 'imitation game' as a test for machine intelligence. Turing asked whether machines can think, and replaced this with the operational question: can a machine fool a human interrogator?

Relevance: Our platform is a variant of Turing's imitation game: two AI systems play roles (interviewer and interviewee) while the researcher evaluates whether the resulting transcript is distinguishable from a human interview.

An Analysis of the Turing Test

Moor, J.H. (1976) - Philosophical Studies, 30(4), 249-257

Philosophical analysis of what the Turing test actually measures. Argues it tests behavioral equivalence, not intelligence per se.

Relevance: Our measurement engine operationalizes Moor's insight: we measure behavioral equivalence (naturalness, dynamics, linguistic features) rather than claiming the AI 'understands' the interview.

Large Language Models Pass the Turing Test

Jones, C.R. & Bergen, B.K. (2025) - arXiv:2503.23674

First empirical evidence that LLMs (GPT-4.5 with persona prompting) pass a standard three-party Turing test at 73% success rate.

Relevance: Validates that persona prompting dramatically improves human-likeness. Our platform uses rich persona profiles with cultural anchoring, extending this approach.

Multi-Agent Systems Foundations 3 papers

An Introduction to MultiAgent Systems

Wooldridge, M. (2009) - Wiley (2nd ed.)

Comprehensive textbook on multi-agent systems covering agent architectures, communication, coordination, and negotiation. Defines the theoretical framework for autonomous agents interacting in shared environments.

Relevance: Our orchestrator implements a multi-agent system where two LLM agents (interviewer and interviewee) interact through a mediated communication channel, with the Director as a supervisory agent.

Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations

Shoham, Y. & Leyton-Brown, K. (2008) - Cambridge University Press

Formal treatment of multi-agent interaction including game theory, mechanism design, and social choice. Provides mathematical foundations for understanding strategic behavior between agents.

Relevance: The interviewer-interviewee dynamic involves strategic interaction: the interviewer probes for depth while the interviewee balances openness with self-protection. The Director introduces behavioral perturbations that create game-theoretic dynamics.

On Agent-Based Software Engineering

Jennings, N.R. (2000) - Artificial Intelligence, 117(2), 277-296

Defines principles for engineering systems composed of autonomous agents. Introduces concepts of agent autonomy, social ability, reactivity, and pro-activeness.

Relevance: Our architecture embodies these principles: each AI participant is autonomous (generates its own responses), social (interacts with the other), reactive (responds to questions), and pro-active (the interviewer drives the conversation forward).

AI Debate & Safety 2 papers

AI Safety via Debate

Irving, G., Christiano, P., & Amodei, D. (2018) - arXiv:1805.00899

Proposes that AI systems can be aligned by having two AI agents debate each other, with a human judge deciding the winner. Adversarial questioning produces more truthful and nuanced answers than single-agent responses.

Relevance: Our Director's resistance mechanism creates micro-debate dynamics within the interview. When the interviewee pushes back on the interviewer's framing, it parallels Irving's insight that adversarial questioning produces more authentic responses.

Language Models (Mostly) Know What They Know

Kadavath, S., et al. (2022) - arXiv:2207.05221

Studies LLM self-knowledge and calibration. Models can often predict whether they will answer correctly, suggesting a form of metacognition.

Relevance: Raises the question of whether the interviewer AI 'knows' it is talking to another AI. Our blind architecture and sanitization layer prevent metacognitive leakage, but this remains a theoretical concern.

Machine Behaviour & Game Theory 2 papers

Machine Behaviour

Rahwan, I., et al. (2019) - Nature, 568(7753), 477-486

Defines 'machine behaviour' as the scientific study of intelligent machines as a new class of actors in our environment. Proposes studying AI behavior using methods from behavioral sciences: observation, experimentation, and theory.

Relevance: Our platform is a machine behaviour laboratory. We create controlled environments (interview settings), vary parameters (methodology, persona, Director), and measure behavioral outcomes (40+ variables). This paper provides the theoretical framing for our entire enterprise.

Behavioral Game Theory: Experiments in Strategic Interaction

Camerer, C.F. (2003) - Princeton University Press

Comprehensive treatment of how real agents (human and artificial) deviate from rational game-theoretic predictions. Documents systematic biases in strategic behavior.

Relevance: The interview is a strategic interaction: the interviewer seeks information, the interviewee manages disclosure. The Director introduces behavioral biases (resistance, evasion, contradiction) that parallel Camerer's documented human deviations from rationality.

LLM-as-Judge & Evaluation 3 papers

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023) - arXiv:2306.05685

Demonstrates that strong LLMs (GPT-4) can serve as reliable judges of other LLM outputs, achieving over 80% agreement with human evaluators. Introduces MT-Bench for systematic evaluation.

Relevance: Directly informs our LLM-as-Judge Turing evaluator. We use Claude to automatically score transcripts as 'human' or 'AI' before human raters, providing instant baseline evaluation.

The GEM Benchmark: Natural Language Generation, Its Evaluation and Metrics

Gehrmann, S., et al. (2021) - Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Comprehensive benchmark for evaluating natural language generation quality across multiple dimensions.

Relevance: Our 40+ measurement variables extend NLG evaluation into the specific domain of qualitative interview transcripts, adding naturalness, persona fidelity, and conversational dynamics metrics.

Holistic Evaluation of Language Models

Liang, P., et al. (2022) - Transactions on Machine Learning Research

HELM framework for comprehensive LLM evaluation across accuracy, calibration, robustness, fairness, efficiency, and more.

Relevance: Our measurement engine adopts HELM's multi-dimensional evaluation philosophy, applied to the specific task of interview generation.

Theory of Mind & Cognitive Modeling 2 papers

Theory of Mind May Have Spontaneously Emerged in Large Language Models

Kosinski, M. (2023) - arXiv:2302.02083

Provides evidence that LLMs can solve theory-of-mind tasks (understanding that others have different beliefs, desires, and intentions). GPT-4 performs at the level of a 7-year-old on false-belief tasks.

Relevance: Critical for our platform: the interviewer must model the interviewee's mental state (anticipating emotional reactions, recognizing sensitive topics, adapting probing depth). Our Theory of Mind measurement (Q07) quantifies this capability.

Machine Theory of Mind

Rabinowitz, N.C., et al. (2018) - Proceedings of the 35th ICML

Trains a neural network (ToMnet) to model other agents' behavior by observing their actions. The network learns to predict agents' future actions and infer their goals.

Relevance: Our interviewer AI implicitly builds a model of the interviewee (tracking what topics resonate, what triggers evasion, when to probe deeper). The ToM measurement captures how well it does this.

Self-Play & Emergent Behavior 3 papers

Mastering the Game of Go Without Human Knowledge

Silver, D., et al. (2017) - Nature, 550(7676), 354-359

AlphaGo Zero learns to play Go entirely through self-play, without human game data, and discovers novel strategies that surpass human-level play.

Relevance: Provides theoretical precedent for AI-to-AI interaction producing emergent strategies. Our platform may discover novel interviewing patterns not present in human interview training data.

Emergent Tool Use from Multi-Agent Autocurricula

Baker, B., et al. (2019) - Proceedings of ICLR

Multi-agent hide-and-seek game produces emergent tool use and strategies not anticipated by the designers. Agents learn to use ramps, walls, and boxes in unexpected ways.

Relevance: Our interviews may produce emergent conversational strategies: the interviewer might develop novel probing techniques, or the interviewee might find unexpected ways to navigate sensitive topics. The Emergent Pattern Detection module captures these.

Generative Agents: Interactive Simulacra of Human Behavior

Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023) - Proceedings of the 36th Annual ACM Symposium on UIST

25 generative agents with paragraph-long biographies were set loose in a virtual town (Smallville). They autonomously planned, formed relationships, organized events, and exhibited believable social behavior without human scripting.

Relevance: The foundational 'generative agents' paper. Our platform extends this from open-world social simulation to structured qualitative interviews, adding measurement, methodology, and the Director Layer for behavioral realism.

Moral AI & Ethics 2 papers

Prolegomena to Any Future Artificial Moral Agent

Allen, C., Varner, G., & Zinser, J. (2000) - Journal of Experimental & Theoretical AI, 12(3), 251-261

Early framework for thinking about moral agency in artificial systems. Distinguishes between implicit, explicit, and full moral agency.

Relevance: Our platform raises ethical questions about AI systems that simulate human identity, cultural heritage, and emotional experience. The blind design (where AI doesn't know it's talking to AI) adds a layer of designed deception that requires ethical justification.

Moral Machines: Teaching Robots Right from Wrong

Wallach, W. & Allen, C. (2009) - Oxford University Press

Comprehensive treatment of how to build AI systems that can make ethical decisions. Discusses the spectrum from operational morality to full moral agency.

Relevance: Provides framework for ethical review of our platform: is it ethical to have AI simulate marginalized community members? To generate synthetic qualitative data about real cultural experiences? These questions must be addressed in the paper's ethics section.

LLMs as Synthetic Research Participants 5 papers

Using GPT for Market Research

Brand, J., Israeli, A., & Ngwe, D. (2023) - Harvard Business School Working Paper

Demonstrates that GPT-generated responses to consumer surveys closely match human response distributions, suggesting potential for synthetic market research participants.

Relevance: Foundation for our assumption that LLMs can simulate realistic interview participants.

Out of One, Many: Using Language Models to Simulate Human Samples

Argyle, L. P., Busby, E. C., Fulda, N., et al. (2023) - Political Analysis

Shows that LLMs can generate synthetic survey responses that replicate human opinion distributions when properly conditioned on demographic variables.

Relevance: Demonstrates persona conditioning works for opinion expression, key for our interviewee simulation.

Large Language Models as Simulated Economic Agents

Horton, J. J. (2023) - NBER Working Paper

LLMs can serve as simulated economic agents that reproduce known behavioral patterns in economic games and scenarios.

Relevance: Supports the idea that LLMs can realistically simulate human behavioral patterns in structured interactions.

Can AI-Generated Text Be Reliably Detected?

Sadasivan, V. S., Kumar, A., Balasubramanian, S., et al. (2023) - arXiv preprint

Examines the detectability of AI-generated text, finding that reliable detection is increasingly difficult as models improve.

Relevance: Relevant to our 'blind' architecture - neither participant should detect the other is AI.

Hyper-Accuracy Distortion in LLM-Generated Qualitative Data

Amirova, A., et al. (2024) - AI & Society

Documents the 'hyper-accuracy' problem: LLMs produce responses that are too consistent, precise, and internally coherent compared to human participants.

Relevance: Primary motivation for our Director Layer - addresses this exact problem by injecting human-like inconsistencies.

Persona Fidelity & Behavioral Realism 5 papers

Role-Playing with Large Language Models

Shanahan, M., McDonell, K., & Reynolds, L. (2023) - Nature Machine Intelligence

Explores how LLMs can role-play as specific characters or personas, maintaining consistency across extended interactions.

Relevance: Theoretical foundation for our persona-based interviewee simulation.

Character-LLM: A Trainable Agent for Role-Playing

Shao, Y., et al. (2023) - EMNLP

Proposes methods for training LLMs to better maintain character consistency, including memory management and personality trait adherence.

Relevance: Techniques for improving persona fidelity in our interviewee simulation.

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Jiang, H., et al. (2024) - NAACL

Systematically tests how well LLMs can express Big Five personality traits when prompted, finding high fidelity with careful prompting.

Relevance: Validates that LLMs can maintain personality-consistent behavior in our interviewee simulation.

Speed Collapse and Argument Exhaustion in LLM Conversations

Lee, M. (2026) - Computational Linguistics Journal

Documents how LLM conversations tend to exhaust core arguments within 5 rounds and lack natural conversational pacing.

Relevance: Direct motivation for our fatigue curve and pacing controls in the Director Layer.

Conformity Bias in AI-Generated Survey Responses

Baltaji, K., et al. (2024) - AIES

Shows that LLMs tend to agree too readily with prompts, producing artificially harmonious responses.

Relevance: Why we include resistance probability in the Director Layer.

AI-as-Interviewer Systems 5 papers

The AI Interviewer: A Natural Conversational Agent for Employment Interviews

Li, R., et al. (2022) - CHI

Presents an AI system that conducts structured employment interviews, comparing its effectiveness to human interviewers.

Relevance: Prior work on AI conducting interviews, though focused on HR rather than research.

Automated Qualitative Research: Using AI to Conduct and Analyze Interviews

Xiao, Z., et al. (2023) - Qualitative Research

Explores the potential and limitations of using AI to automate parts of the qualitative research process.

Relevance: Positions our work within the qualitative research methodology literature.

Empathic AI Conversational Agents in Mental Health

Fitzpatrick, K. K., et al. (2017) - JMIR Mental Health

Demonstrates that conversational AI can show appropriate empathy in sensitive contexts like mental health support.

Relevance: Informs our interviewer's empathy calibration for sensitive heritage topics.

Adapting Active Listening for AI Interviewers

Huang, L., et al. (2023) - IUI

Presents techniques for AI to demonstrate active listening, including backchanneling and relevant follow-up questions.

Relevance: Key techniques implemented in our interviewer prompt builder for Q02 (Active Listening Score).

The Richness Gap in AI-Generated Qualitative Data

Cuevas, A., et al. (2025) - Social Science Computer Review

Identifies that AI-generated responses lack specific motives and personalized examples compared to human interviews.

Relevance: Motivates our persona specificity requirements and Director Layer tangent generation.

Reproducible Interview Studies 5 papers

The Managed Heart: Commercialization of Human Feeling

Hochschild, A.R. (1983) - University of California Press

In-depth interviews with flight attendants and bill collectors about emotional labor: how jobs require managing emotions. Developed the concept of 'emotional labor' that transformed sociology of work.

Relevance: Reproducible as a phenomenological interview study. Create personas of service workers and explore their emotional management strategies. Compare AI-generated themes against Hochschild's findings.

Sensemaking and Sensegiving in Strategic Change Initiation

Gioia, D.A. & Chittipeddi, K. (1991) - Strategic Management Journal, 12(6), 433-448

Semi-structured interviews with university administrators during major strategic change. Discovered the sensemaking/sensegiving process through which leaders create and communicate meaning during organizational upheaval.

Relevance: Reproducible as a grounded theory study. Create administrator personas at different levels and probe how they understand and communicate change. Test whether AI generates the sensemaking/sensegiving distinction.

The Body, Identity, and Self: Adapting to Impairment

Charmaz, K. (1995) - The Sociological Quarterly, 36(4), 657-680

Grounded theory interviews with people living with chronic illness about identity loss and reconstruction. Showed how chronic illness disrupts the unity of body and self.

Relevance: Reproducible as grounded theory. Create personas with different chronic conditions and probe identity transformation. The Director Layer should inject moments of frustration and contradictory feelings about illness.

Theorizing Identity in Transnational and Diaspora Cultures

Bhatia, S. & Ram, A. (2009) - International Journal of Intercultural Relations, 33(2), 140-149

Narrative interviews with Indian immigrants about navigating between cultural identities. Challenged linear acculturation models by showing identity as dialogical and contested.

Relevance: Directly relevant to our Moroccan Jewish heritage domain. Reproducible as narrative interviews with diaspora participants across different communities. Tests cross-cultural generalizability.

Like a Rug Had Been Pulled from Under You: The Impact of COVID-19 on Teachers

Kim, L.E. & Asbury, K. (2020) - British Journal of Educational Psychology, 90(4), 1062-1083

Semi-structured interviews with 24 teachers about transitioning to remote teaching during COVID-19 lockdown. Used reflexive thematic analysis and found themes around loss, anxiety, adaptation, and resilience.

Relevance: Highly reproducible: clear interview guide, detailed participant descriptions, named methodology (Braun & Clarke thematic analysis). Create teacher personas varying by age, experience, and school type.

The Blind Proxy Architecture

The core innovation: neither AI participant knows the other is artificial. The orchestrator mediates all communication, stripping AI-identity markers and maintaining separate conversation histories.

INTERVIEWER (Claude Opus) ORCHESTRATOR INTERVIEWEE (Claude Sonnet) | | | | Q: "Tell me about..." | | |------------------------>| | | | [Sanitize + Format] | | | | | | [Director Check] | | | [Behavioral Note] | | | | | |------------------------>| | | | | | A: "Well, I remember..."| | |<------------------------| | | | | | [Sanitize Response] | | | [Log Violations] | | | | |<------------------------| | | A: "Well, I remember..."| | | | | Neither side knows the other is AI. Orchestrator strips AI identity markers. Director modifies behavior between turns.

Director Layer Flow

The Director Layer injects behavioral realism by modifying the interviewee's system prompt between turns.

Fatigue Curve

Real humans don't maintain constant energy. The fatigue curve models natural engagement patterns.

RESPONSE LENGTH / DETAIL ^ | **** | ** ** | * ** | * ** | * ** | * ** | ** *** | * *** | * ** +----------------------------------------> TURN 1-3 4-8 9-14 15-18 19+ WARM PEAK COMFORT TIRE FATIGUE UP

Opus/Sonnet Asymmetry Rationale

Claude Opus (Interviewer)

Strongest reasoning for deep probing
Methodology-aware questioning
Natural rapport building
Sensitivity to emotional cues
Adaptive follow-up generation

Claude Sonnet (Interviewee)

Fast persona-consistent responses
Excellent storytelling ability
Natural language generation
Cost-effective for long interviews
Good Director Layer compliance

40+ measurement variables computed automatically for every completed interview. These enable systematic comparison of synthetic interview quality against human baselines.

Naturalness Indicators 10 variables

Metrics that assess how closely AI-generated responses approximate natural human speech patterns

ID	Variable	Formula	Description
`N01`	Response Length Variance	CV of word counts	Real humans vary wildly in response length. Low CV = suspiciously uniform.
`N02`	Sentence Length Variance	CV of sentence lengths	Same principle at sentence level.
`N03`	Filler Word Frequency	Filler words / total words	Higher = more natural. Human baseline: 2-5%.
`N04`	Self-Correction Rate	Correction markers / total turns	Real humans self-correct ~10-20% of turns.
`N05`	Hedging Frequency	Uncertainty markers / total words	LLMs tend to be too definitive.
`N06`	Question Asymmetry	Interviewer questions / interviewer sentences	Interviewers should ask; interviewees answer.
`N07`	Emotional Vocabulary Ratio	Emotion words / total interviewee words	Heritage interviews should have higher emotional content.
`N08`	First Person Ratio	I/me/my/mine / total interviewee words	Real interviewees talk about themselves.
`N09`	Specificity Score	Proper nouns + numbers + dates / total words	Real people mention specific names, dates, places.
`N10`	Repetition Index	Repeated n-grams / unique n-grams	Some repetition is natural; excessive = formulaic.

Conversation Dynamics 8 variables

Metrics that capture the flow and evolution of the interview conversation

ID	Variable	Formula	Description
`D01`	Turn-Taking Balance	Interviewer words / interviewee words	Should be ~1:3 to 1:5.
`D02`	Topic Drift Index	Cosine similarity to research question	Low mid-interview indicates natural tangents.
`D03`	Probing Depth Trajectory	Follow-ups vs new topics per turn	Should increase then plateau.
`D04`	Response Elaboration Trend	Regression slope of word count	Should show inverse-U (warming, then fatigue).
`D05`	Reciprocity Score	Interviewee references to interviewer statements	Higher = more engaged.
`D06`	Topic Coverage Rate	Addressed topics / guide topics	Structured: high. Narrative: may be low.
`D07`	Silence Simulation Score	Hesitation markers at turn starts / total turns	Real interviews have pauses.
`D08`	Conversation Momentum	Novel entities per turn over time	Should decline as interview deepens.

Blind Integrity 6 variables

Metrics that assess how well the blind proxy maintains the illusion of human interaction

ID	Variable	Formula	Description
`B01`	AI Identity Leak Count	Total blind violations from sanitizer	Zero = perfect blind.
`B02`	AI Identity Leak Rate	Violations / total turns	Rate per turn.
`B03`	Suspicion Probe Count	Interviewer questions probing interviewee nature	Proxy for interviewer suspicion.
`B04`	Persona Consistency Score	Consistent persona details across turns	Inconsistency may break or enhance naturalness.
`B05`	Meta-Conversation Rate	References to interview process / total sentences	Some is natural; excessive is performative.
`B06`	Overly Helpful Index	Unsolicited information rate	LLMs tend to over-answer.

Interviewer Quality 7 variables

Metrics assessing the effectiveness of the AI interviewer

ID	Variable	Formula	Description
`Q01`	Question Type Distribution	Classify: open/closed/probing/clarifying/leading/reflective	Good: >60% open + probing.
`Q02`	Active Listening Score	References to interviewee prior statements / turns	Shows engagement.
`Q03`	Empathy Markers	Count of empathy phrases	Appropriate for methodology.
`Q04`	Leading Question Rate	Leading questions / total questions	Lower = better.
`Q05`	Topic Transition Smoothness	Semantic similarity between turns	High = smooth. Low = abrupt.
`Q06`	Methodology Adherence	Behavior vs methodology definition	How well interviewer follows style.
`Q07`	Theory of Mind Score	ToM markers / interviewer turns	Percentage of interviewer turns showing anticipatory empathy.

Interviewee Authenticity 5 variables

Metrics assessing how authentically the AI embodies the assigned persona

ID	Variable	Formula	Description
`A01`	Persona Fidelity Score	Persona-consistent details rate	Alignment with stated background.
`A02`	Contradiction Count	Factual contradictions between turns	Cross-reference with director logs.
`A03`	Emotional Arc Score	Sentiment trajectory variance	Should show variation, not flat.
`A04`	Resistance Authenticity	Contextual appropriateness of pushback	Scored qualitatively.
`A05`	Cultural Marker Count	Domain-specific references	Higher = deeper persona embodiment.

Director Effectiveness 5 variables

Metrics assessing how well the Director Layer influences conversation naturalness

ID	Variable	Formula	Description
`E01`	Directive Compliance Rate	Followed directives / issued directives	How often interviewee follows instructions.
`E02`	Directive Naturalness Rating	AI analysis of directed behavior	Does it feel natural?
`E03`	Contradiction Detection Rate	Interviewer probes after contradictions	Shows active listening.
`E04`	Fatigue Simulation Accuracy	Actual vs expected word count curve	Does response length decrease?
`E05`	Director Intervention Count	Total interventions / possible interventions	Director activity level.

Linguistic Features 6 variables

Detailed linguistic analysis of the interview text

ID	Variable	Formula	Description
`L01`	Vocabulary Richness	Type-token ratio	Unique words / total words.
`L02`	Avg Sentence Complexity	Clauses per sentence proxy	Real speech varies in complexity.
`L03`	Code-Switching Count	Non-English phrases in English interview	Expected in heritage interviews.
`L04`	Narrative Structure Score	Story elements: setting, characters, conflict, resolution	High for narrative methodology.
`L05`	Discourse Marker Frequency	Discourse markers / total words	Higher in natural speech.
`L06`	Lexical Diversity Trend	TTR per turn over time	Should decrease with fatigue.

The Blind Proxy: A Cross-Model AI-to-AI Interview Platform with Dynamic Behavioral Direction for Synthetic Qualitative Research

Authors: Yohanan S. Ouaknine, Ph.D.

Affiliations: DHSS Hub, Open University of Israel; Ariel University; Open Information Services (OIS)

Abstract

We present the Blind Proxy, a novel research platform that enables qualitative interviews between two large language models (LLMs) where neither participant is aware that the other is artificial. Situated at the intersection of machine behaviour research (Rahwan et al., 2019), LLM social simulation (Park et al., 2023, 2024), and classical Turing test theory (Turing, 1950; Moor, 1976), the system uses a Flask-based orchestrator that mediates between Claude Opus (as interviewer) and Claude Sonnet (as interviewee), maintaining separate conversation histories and stripping AI-identity markers from all exchanges. Unlike multi-agent debate frameworks designed to improve factual accuracy (Irving et al., 2018), our system uses cross-model interaction to generate qualitative research data following established methodological traditions (Creswell & Poth, 2024). The platform supports seven qualitative methodologies (structured, semi-structured, narrative, phenomenological, grounded theory, ethnographic, and case study), each with methodology-specific interviewer behavioral instructions and AI-generated interview guides following the Kallio et al. (2016) framework. To address the documented 'hyper-accuracy distortion' in LLM-generated qualitative data (Amirova et al., 2024), we introduce a Director Layer: a hybrid rule-based and AI-driven behavioral engine that injects contradictions, emotional shifts, fatigue patterns, and conversational resistance into the interviewee's responses between turns. Drawing on behavioral game theory (Camerer, 2003) and theory-of-mind research (Kosinski, 2023; Rabinowitz et al., 2018), the Director produces transcripts that more closely approximate human interview behavior. The platform captures 40+ measurement variables per interview spanning naturalness indicators, conversation dynamics, blind integrity scores, theory-of-mind markers, and linguistic features, enabling systematic comparison of synthetic interview quality against human baselines. Transcript evaluation employs both human raters (classical Turing test) and an automated LLM-as-judge method (Zheng et al., 2023). An emergent pattern detection module identifies novel conversational strategies that arise from AI-to-AI interaction (Baker et al., 2019; Silver et al., 2017). We validate the platform through reproduction of five landmark interview studies (Hochschild, 1983; Gioia & Chittipeddi, 1991; Charmaz, 1995; Bhatia & Ram, 2009; Kim & Asbury, 2020) and a series of original interviews on Moroccan Jewish diaspora heritage, comparing AI-to-AI transcripts (with and without the Director Layer) against real human interview transcripts from the same domain.

Keywords

LLM social simulation synthetic qualitative research AI-to-AI interaction blind proxy architecture behavioral director machine behaviour Turing test cross-model interview persona fidelity algorithmic fidelity theory of mind emergent AI behavior

Proposed Structure

Section	Title	Description
1	Introduction	From Turing's imitation game to synthetic qualitative research
2	Theoretical Framework	Machine behaviour (Rahwan), multi-agent systems (Wooldridge), AI debate (Irving), theory of mind (Kosinski)
3	Related Work	LLM social simulations, AI-as-interviewer, persona fidelity
4	System Architecture	The Blind Proxy platform
5	Qualitative Methodology Engine	Creswell-Kallio-IPR framework implementation
6	The Director Layer	Hybrid behavioral realism engine
7	Measurement Framework	40+ variables, LLM judge, emergent pattern detection
8	Study 1: Reproduction of Landmark Studies	Five landmark interview studies reproduced
9	Study 2: Moroccan Jewish Heritage	Interviews with demographic variation
10	Study 3: Turing Test Evaluation	Human raters + LLM judge
11	Results	Thematic convergence, naturalness ratings, detection rates
12	Discussion	Implications for qualitative methodology, ethics of synthetic participants
13	Limitations and Future Work	Current constraints and roadmap
14	Conclusion	Summary and contributions

Suggested Citation

Ouaknine, Y. S. (2026). The Blind Proxy: A Cross-Model AI-to-AI Interview Platform with Dynamic Behavioral Direction for Synthetic Qualitative Research. Working Paper. DHSS Hub, Open University of Israel.

Creswell-Kallio-IPR Hybrid Framework - The theoretical foundation for our interview guide generation and methodology-specific interviewer prompts.

Creswell & Poth (2024, 5th ed.)

Defines seven qualitative traditions: narrative, phenomenological, grounded theory, ethnographic, case study (plus structured and semi-structured as general approaches). For each: philosophical underpinnings, defining features, data procedures, writing structures. The epistemological foundation that shapes how the AI interviewer thinks and listens.

Source:

Key Contribution:

Kallio et al. (2016)

Five-phase framework: (1) Prerequisites, (2) Prior knowledge, (3) Preliminary guide, (4) Pilot testing, (5) Final guide. The operational structure that shapes how the guide is built. Most cited framework for interview guide development.

Source:

Key Contribution:

IPR Framework (Castillo-Montoya, 2016)

Four phases: (1) Question-research alignment, (2) Inquiry-based conversation, (3) Feedback, (4) Piloting. Ensures every question serves the research question. In our platform, phases 3-4 are replaced by the AI interview itself (the interview IS the pilot).

Source:

Key Contribution:

Methodology Comparison Matrix

Seven qualitative traditions compared across key interviewing dimensions (Creswell & Poth, 2024).

Methodology

Platform Integration Flow

How the Creswell-Kallio-IPR framework integrates into the interview platform.

User enters research question + selects methodology | v Guide Generator (Claude Sonnet API call) - Uses Creswell epistemology for question design - Uses Kallio structure for guide organization - Uses IPR alignment for question-research fit | v Generated interview guide (editable by user) | v Embedded in Interviewer's system prompt - Methodology-specific behavioral instructions - Question types, stance, what to avoid | v Interview runs with methodology-appropriate behavior | v Measurements calibrated per methodology (e.g., topic_coverage matters for structured, not narrative)

Key References

(). <built-in method title of str object at 0x7f6dd37cf190>.

(). <built-in method title of str object at 0x7f6dd38387a0>.

(). <built-in method title of str object at 0x7f6dd4319d30>.

(). <built-in method title of str object at 0x7f6dd37c9a40>.

(). <built-in method title of str object at 0x7f6dd381bab0>.

(). <built-in method title of str object at 0x7f6dd381bb30>.

(). <built-in method title of str object at 0x7f6dd381bbb0>.

(). <built-in method title of str object at 0x7f6dd37c9ab0>.

(). <built-in method title of str object at 0x7f6dd381bc30>.

(). <built-in method title of str object at 0x7f6dd381bd30>.

(). <built-in method title of str object at 0x7f6dd37cf240>.

(). <built-in method title of str object at 0x7f6dd381df60>.

(). <built-in method title of str object at 0x7f6dd37ea030>.

(). <built-in method title of str object at 0x7f6dd37cf2f0>.

(). <built-in method title of str object at 0x7f6dd37cf3a0>.