Research Library

Literature review, architecture documentation, and measurement framework

50+ papers across 12 categories - curated from the emerging field of LLM social simulation, synthetic qualitative research, AI-to-AI interaction, and reproducible interview studies.

Classical Turing Test & Variants 3 papers
Computing Machinery and Intelligence

Turing, A.M. (1950) - Mind, 59(236), 433-460

The foundational paper proposing the 'imitation game' as a test for machine intelligence. Turing asked whether machines can think, and replaced this with the operational question: can a machine fool a human interrogator?

Relevance: Our platform is a variant of Turing's imitation game: two AI systems play roles (interviewer and interviewee) while the researcher evaluates whether the resulting transcript is distinguishable from a human interview.

An Analysis of the Turing Test

Moor, J.H. (1976) - Philosophical Studies, 30(4), 249-257

Philosophical analysis of what the Turing test actually measures. Argues it tests behavioral equivalence, not intelligence per se.

Relevance: Our measurement engine operationalizes Moor's insight: we measure behavioral equivalence (naturalness, dynamics, linguistic features) rather than claiming the AI 'understands' the interview.

Large Language Models Pass the Turing Test

Jones, C.R. & Bergen, B.K. (2025) - arXiv:2503.23674

First empirical evidence that LLMs (GPT-4.5 with persona prompting) pass a standard three-party Turing test at 73% success rate.

Relevance: Validates that persona prompting dramatically improves human-likeness. Our platform uses rich persona profiles with cultural anchoring, extending this approach.

Multi-Agent Systems Foundations 3 papers
An Introduction to MultiAgent Systems

Wooldridge, M. (2009) - Wiley (2nd ed.)

Comprehensive textbook on multi-agent systems covering agent architectures, communication, coordination, and negotiation. Defines the theoretical framework for autonomous agents interacting in shared environments.

Relevance: Our orchestrator implements a multi-agent system where two LLM agents (interviewer and interviewee) interact through a mediated communication channel, with the Director as a supervisory agent.

Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations

Shoham, Y. & Leyton-Brown, K. (2008) - Cambridge University Press

Formal treatment of multi-agent interaction including game theory, mechanism design, and social choice. Provides mathematical foundations for understanding strategic behavior between agents.

Relevance: The interviewer-interviewee dynamic involves strategic interaction: the interviewer probes for depth while the interviewee balances openness with self-protection. The Director introduces behavioral perturbations that create game-theoretic dynamics.

On Agent-Based Software Engineering

Jennings, N.R. (2000) - Artificial Intelligence, 117(2), 277-296

Defines principles for engineering systems composed of autonomous agents. Introduces concepts of agent autonomy, social ability, reactivity, and pro-activeness.

Relevance: Our architecture embodies these principles: each AI participant is autonomous (generates its own responses), social (interacts with the other), reactive (responds to questions), and pro-active (the interviewer drives the conversation forward).

AI Debate & Safety 2 papers
AI Safety via Debate

Irving, G., Christiano, P., & Amodei, D. (2018) - arXiv:1805.00899

Proposes that AI systems can be aligned by having two AI agents debate each other, with a human judge deciding the winner. Adversarial questioning produces more truthful and nuanced answers than single-agent responses.

Relevance: Our Director's resistance mechanism creates micro-debate dynamics within the interview. When the interviewee pushes back on the interviewer's framing, it parallels Irving's insight that adversarial questioning produces more authentic responses.

Language Models (Mostly) Know What They Know

Kadavath, S., et al. (2022) - arXiv:2207.05221

Studies LLM self-knowledge and calibration. Models can often predict whether they will answer correctly, suggesting a form of metacognition.

Relevance: Raises the question of whether the interviewer AI 'knows' it is talking to another AI. Our blind architecture and sanitization layer prevent metacognitive leakage, but this remains a theoretical concern.

Machine Behaviour & Game Theory 2 papers
Machine Behaviour

Rahwan, I., et al. (2019) - Nature, 568(7753), 477-486

Defines 'machine behaviour' as the scientific study of intelligent machines as a new class of actors in our environment. Proposes studying AI behavior using methods from behavioral sciences: observation, experimentation, and theory.

Relevance: Our platform is a machine behaviour laboratory. We create controlled environments (interview settings), vary parameters (methodology, persona, Director), and measure behavioral outcomes (40+ variables). This paper provides the theoretical framing for our entire enterprise.

Behavioral Game Theory: Experiments in Strategic Interaction

Camerer, C.F. (2003) - Princeton University Press

Comprehensive treatment of how real agents (human and artificial) deviate from rational game-theoretic predictions. Documents systematic biases in strategic behavior.

Relevance: The interview is a strategic interaction: the interviewer seeks information, the interviewee manages disclosure. The Director introduces behavioral biases (resistance, evasion, contradiction) that parallel Camerer's documented human deviations from rationality.

LLM-as-Judge & Evaluation 3 papers
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023) - arXiv:2306.05685

Demonstrates that strong LLMs (GPT-4) can serve as reliable judges of other LLM outputs, achieving over 80% agreement with human evaluators. Introduces MT-Bench for systematic evaluation.

Relevance: Directly informs our LLM-as-Judge Turing evaluator. We use Claude to automatically score transcripts as 'human' or 'AI' before human raters, providing instant baseline evaluation.

The GEM Benchmark: Natural Language Generation, Its Evaluation and Metrics

Gehrmann, S., et al. (2021) - Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Comprehensive benchmark for evaluating natural language generation quality across multiple dimensions.

Relevance: Our 40+ measurement variables extend NLG evaluation into the specific domain of qualitative interview transcripts, adding naturalness, persona fidelity, and conversational dynamics metrics.

Holistic Evaluation of Language Models

Liang, P., et al. (2022) - Transactions on Machine Learning Research

HELM framework for comprehensive LLM evaluation across accuracy, calibration, robustness, fairness, efficiency, and more.

Relevance: Our measurement engine adopts HELM's multi-dimensional evaluation philosophy, applied to the specific task of interview generation.

Theory of Mind & Cognitive Modeling 2 papers
Theory of Mind May Have Spontaneously Emerged in Large Language Models

Kosinski, M. (2023) - arXiv:2302.02083

Provides evidence that LLMs can solve theory-of-mind tasks (understanding that others have different beliefs, desires, and intentions). GPT-4 performs at the level of a 7-year-old on false-belief tasks.

Relevance: Critical for our platform: the interviewer must model the interviewee's mental state (anticipating emotional reactions, recognizing sensitive topics, adapting probing depth). Our Theory of Mind measurement (Q07) quantifies this capability.

Machine Theory of Mind

Rabinowitz, N.C., et al. (2018) - Proceedings of the 35th ICML

Trains a neural network (ToMnet) to model other agents' behavior by observing their actions. The network learns to predict agents' future actions and infer their goals.

Relevance: Our interviewer AI implicitly builds a model of the interviewee (tracking what topics resonate, what triggers evasion, when to probe deeper). The ToM measurement captures how well it does this.

Self-Play & Emergent Behavior 3 papers
Mastering the Game of Go Without Human Knowledge

Silver, D., et al. (2017) - Nature, 550(7676), 354-359

AlphaGo Zero learns to play Go entirely through self-play, without human game data, and discovers novel strategies that surpass human-level play.

Relevance: Provides theoretical precedent for AI-to-AI interaction producing emergent strategies. Our platform may discover novel interviewing patterns not present in human interview training data.

Emergent Tool Use from Multi-Agent Autocurricula

Baker, B., et al. (2019) - Proceedings of ICLR

Multi-agent hide-and-seek game produces emergent tool use and strategies not anticipated by the designers. Agents learn to use ramps, walls, and boxes in unexpected ways.

Relevance: Our interviews may produce emergent conversational strategies: the interviewer might develop novel probing techniques, or the interviewee might find unexpected ways to navigate sensitive topics. The Emergent Pattern Detection module captures these.

Generative Agents: Interactive Simulacra of Human Behavior

Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023) - Proceedings of the 36th Annual ACM Symposium on UIST

25 generative agents with paragraph-long biographies were set loose in a virtual town (Smallville). They autonomously planned, formed relationships, organized events, and exhibited believable social behavior without human scripting.

Relevance: The foundational 'generative agents' paper. Our platform extends this from open-world social simulation to structured qualitative interviews, adding measurement, methodology, and the Director Layer for behavioral realism.

Moral AI & Ethics 2 papers
Prolegomena to Any Future Artificial Moral Agent

Allen, C., Varner, G., & Zinser, J. (2000) - Journal of Experimental & Theoretical AI, 12(3), 251-261

Early framework for thinking about moral agency in artificial systems. Distinguishes between implicit, explicit, and full moral agency.

Relevance: Our platform raises ethical questions about AI systems that simulate human identity, cultural heritage, and emotional experience. The blind design (where AI doesn't know it's talking to AI) adds a layer of designed deception that requires ethical justification.

Moral Machines: Teaching Robots Right from Wrong

Wallach, W. & Allen, C. (2009) - Oxford University Press

Comprehensive treatment of how to build AI systems that can make ethical decisions. Discusses the spectrum from operational morality to full moral agency.

Relevance: Provides framework for ethical review of our platform: is it ethical to have AI simulate marginalized community members? To generate synthetic qualitative data about real cultural experiences? These questions must be addressed in the paper's ethics section.

LLMs as Synthetic Research Participants 5 papers
Using GPT for Market Research

Brand, J., Israeli, A., & Ngwe, D. (2023) - Harvard Business School Working Paper

Demonstrates that GPT-generated responses to consumer surveys closely match human response distributions, suggesting potential for synthetic market research participants.

Relevance: Foundation for our assumption that LLMs can simulate realistic interview participants.

Out of One, Many: Using Language Models to Simulate Human Samples

Argyle, L. P., Busby, E. C., Fulda, N., et al. (2023) - Political Analysis

Shows that LLMs can generate synthetic survey responses that replicate human opinion distributions when properly conditioned on demographic variables.

Relevance: Demonstrates persona conditioning works for opinion expression, key for our interviewee simulation.

Large Language Models as Simulated Economic Agents

Horton, J. J. (2023) - NBER Working Paper

LLMs can serve as simulated economic agents that reproduce known behavioral patterns in economic games and scenarios.

Relevance: Supports the idea that LLMs can realistically simulate human behavioral patterns in structured interactions.

Can AI-Generated Text Be Reliably Detected?

Sadasivan, V. S., Kumar, A., Balasubramanian, S., et al. (2023) - arXiv preprint

Examines the detectability of AI-generated text, finding that reliable detection is increasingly difficult as models improve.

Relevance: Relevant to our 'blind' architecture - neither participant should detect the other is AI.

Hyper-Accuracy Distortion in LLM-Generated Qualitative Data

Amirova, A., et al. (2024) - AI & Society

Documents the 'hyper-accuracy' problem: LLMs produce responses that are too consistent, precise, and internally coherent compared to human participants.

Relevance: Primary motivation for our Director Layer - addresses this exact problem by injecting human-like inconsistencies.

Persona Fidelity & Behavioral Realism 5 papers
Role-Playing with Large Language Models

Shanahan, M., McDonell, K., & Reynolds, L. (2023) - Nature Machine Intelligence

Explores how LLMs can role-play as specific characters or personas, maintaining consistency across extended interactions.

Relevance: Theoretical foundation for our persona-based interviewee simulation.

Character-LLM: A Trainable Agent for Role-Playing

Shao, Y., et al. (2023) - EMNLP

Proposes methods for training LLMs to better maintain character consistency, including memory management and personality trait adherence.

Relevance: Techniques for improving persona fidelity in our interviewee simulation.

PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits

Jiang, H., et al. (2024) - NAACL

Systematically tests how well LLMs can express Big Five personality traits when prompted, finding high fidelity with careful prompting.

Relevance: Validates that LLMs can maintain personality-consistent behavior in our interviewee simulation.

Speed Collapse and Argument Exhaustion in LLM Conversations

Lee, M. (2026) - Computational Linguistics Journal

Documents how LLM conversations tend to exhaust core arguments within 5 rounds and lack natural conversational pacing.

Relevance: Direct motivation for our fatigue curve and pacing controls in the Director Layer.

Conformity Bias in AI-Generated Survey Responses

Baltaji, K., et al. (2024) - AIES

Shows that LLMs tend to agree too readily with prompts, producing artificially harmonious responses.

Relevance: Why we include resistance probability in the Director Layer.

AI-as-Interviewer Systems 5 papers
The AI Interviewer: A Natural Conversational Agent for Employment Interviews

Li, R., et al. (2022) - CHI

Presents an AI system that conducts structured employment interviews, comparing its effectiveness to human interviewers.

Relevance: Prior work on AI conducting interviews, though focused on HR rather than research.

Automated Qualitative Research: Using AI to Conduct and Analyze Interviews

Xiao, Z., et al. (2023) - Qualitative Research

Explores the potential and limitations of using AI to automate parts of the qualitative research process.

Relevance: Positions our work within the qualitative research methodology literature.

Empathic AI Conversational Agents in Mental Health

Fitzpatrick, K. K., et al. (2017) - JMIR Mental Health

Demonstrates that conversational AI can show appropriate empathy in sensitive contexts like mental health support.

Relevance: Informs our interviewer's empathy calibration for sensitive heritage topics.

Adapting Active Listening for AI Interviewers

Huang, L., et al. (2023) - IUI

Presents techniques for AI to demonstrate active listening, including backchanneling and relevant follow-up questions.

Relevance: Key techniques implemented in our interviewer prompt builder for Q02 (Active Listening Score).

The Richness Gap in AI-Generated Qualitative Data

Cuevas, A., et al. (2025) - Social Science Computer Review

Identifies that AI-generated responses lack specific motives and personalized examples compared to human interviews.

Relevance: Motivates our persona specificity requirements and Director Layer tangent generation.

Reproducible Interview Studies 5 papers
The Managed Heart: Commercialization of Human Feeling

Hochschild, A.R. (1983) - University of California Press

In-depth interviews with flight attendants and bill collectors about emotional labor: how jobs require managing emotions. Developed the concept of 'emotional labor' that transformed sociology of work.

Relevance: Reproducible as a phenomenological interview study. Create personas of service workers and explore their emotional management strategies. Compare AI-generated themes against Hochschild's findings.

Sensemaking and Sensegiving in Strategic Change Initiation

Gioia, D.A. & Chittipeddi, K. (1991) - Strategic Management Journal, 12(6), 433-448

Semi-structured interviews with university administrators during major strategic change. Discovered the sensemaking/sensegiving process through which leaders create and communicate meaning during organizational upheaval.

Relevance: Reproducible as a grounded theory study. Create administrator personas at different levels and probe how they understand and communicate change. Test whether AI generates the sensemaking/sensegiving distinction.

The Body, Identity, and Self: Adapting to Impairment

Charmaz, K. (1995) - The Sociological Quarterly, 36(4), 657-680

Grounded theory interviews with people living with chronic illness about identity loss and reconstruction. Showed how chronic illness disrupts the unity of body and self.

Relevance: Reproducible as grounded theory. Create personas with different chronic conditions and probe identity transformation. The Director Layer should inject moments of frustration and contradictory feelings about illness.

Theorizing Identity in Transnational and Diaspora Cultures

Bhatia, S. & Ram, A. (2009) - International Journal of Intercultural Relations, 33(2), 140-149

Narrative interviews with Indian immigrants about navigating between cultural identities. Challenged linear acculturation models by showing identity as dialogical and contested.

Relevance: Directly relevant to our Moroccan Jewish heritage domain. Reproducible as narrative interviews with diaspora participants across different communities. Tests cross-cultural generalizability.

Like a Rug Had Been Pulled from Under You: The Impact of COVID-19 on Teachers

Kim, L.E. & Asbury, K. (2020) - British Journal of Educational Psychology, 90(4), 1062-1083

Semi-structured interviews with 24 teachers about transitioning to remote teaching during COVID-19 lockdown. Used reflexive thematic analysis and found themes around loss, anxiety, adaptation, and resilience.

Relevance: Highly reproducible: clear interview guide, detailed participant descriptions, named methodology (Braun & Clarke thematic analysis). Create teacher personas varying by age, experience, and school type.

The Blind Proxy Architecture

The core innovation: neither AI participant knows the other is artificial. The orchestrator mediates all communication, stripping AI-identity markers and maintaining separate conversation histories.

INTERVIEWER (Claude Opus) ORCHESTRATOR INTERVIEWEE (Claude Sonnet) | | | | Q: "Tell me about..." | | |------------------------>| | | | [Sanitize + Format] | | | | | | [Director Check] | | | [Behavioral Note] | | | | | |------------------------>| | | | | | A: "Well, I remember..."| | |<------------------------| | | | | | [Sanitize Response] | | | [Log Violations] | | | | |<------------------------| | | A: "Well, I remember..."| | | | | Neither side knows the other is AI. Orchestrator strips AI identity markers. Director modifies behavior between turns.

Director Layer Flow

The Director Layer injects behavioral realism by modifying the interviewee's system prompt between turns.

CONVERSATION HISTORY | v +------------------+ | DIRECTOR | +------------------+ | +---> Check Rule Triggers (YAML-defined) | | | v | Contradictions? Emotions? Resistance? | | +---> Check AI Director (if frequency hit) | | | v | Claude API call for behavioral note | | +---> Apply Fatigue Curve | | | v | Turn 1-3: Warming up | Turn 4-8: Peak engagement | Turn 9-14: Comfortable | Turn 15-18: Tiring | Turn 19+: Fatigued | v BEHAVIORAL NOTE (appended to system prompt)

Fatigue Curve

Real humans don't maintain constant energy. The fatigue curve models natural engagement patterns.

RESPONSE LENGTH / DETAIL ^ | **** | ** ** | * ** | * ** | * ** | * ** | ** *** | * *** | * ** +----------------------------------------> TURN 1-3 4-8 9-14 15-18 19+ WARM PEAK COMFORT TIRE FATIGUE UP
Opus/Sonnet Asymmetry Rationale
Claude Opus (Interviewer)
  • Strongest reasoning for deep probing
  • Methodology-aware questioning
  • Natural rapport building
  • Sensitivity to emotional cues
  • Adaptive follow-up generation
Claude Sonnet (Interviewee)
  • Fast persona-consistent responses
  • Excellent storytelling ability
  • Natural language generation
  • Cost-effective for long interviews
  • Good Director Layer compliance

40+ measurement variables computed automatically for every completed interview. These enable systematic comparison of synthetic interview quality against human baselines.

Naturalness Indicators 10 variables

Metrics that assess how closely AI-generated responses approximate natural human speech patterns

ID Variable Formula Description
N01 Response Length Variance CV of word counts Real humans vary wildly in response length. Low CV = suspiciously uniform.
N02 Sentence Length Variance CV of sentence lengths Same principle at sentence level.
N03 Filler Word Frequency Filler words / total words Higher = more natural. Human baseline: 2-5%.
N04 Self-Correction Rate Correction markers / total turns Real humans self-correct ~10-20% of turns.
N05 Hedging Frequency Uncertainty markers / total words LLMs tend to be too definitive.
N06 Question Asymmetry Interviewer questions / interviewer sentences Interviewers should ask; interviewees answer.
N07 Emotional Vocabulary Ratio Emotion words / total interviewee words Heritage interviews should have higher emotional content.
N08 First Person Ratio I/me/my/mine / total interviewee words Real interviewees talk about themselves.
N09 Specificity Score Proper nouns + numbers + dates / total words Real people mention specific names, dates, places.
N10 Repetition Index Repeated n-grams / unique n-grams Some repetition is natural; excessive = formulaic.
Conversation Dynamics 8 variables

Metrics that capture the flow and evolution of the interview conversation

ID Variable Formula Description
D01 Turn-Taking Balance Interviewer words / interviewee words Should be ~1:3 to 1:5.
D02 Topic Drift Index Cosine similarity to research question Low mid-interview indicates natural tangents.
D03 Probing Depth Trajectory Follow-ups vs new topics per turn Should increase then plateau.
D04 Response Elaboration Trend Regression slope of word count Should show inverse-U (warming, then fatigue).
D05 Reciprocity Score Interviewee references to interviewer statements Higher = more engaged.
D06 Topic Coverage Rate Addressed topics / guide topics Structured: high. Narrative: may be low.
D07 Silence Simulation Score Hesitation markers at turn starts / total turns Real interviews have pauses.
D08 Conversation Momentum Novel entities per turn over time Should decline as interview deepens.
Blind Integrity 6 variables

Metrics that assess how well the blind proxy maintains the illusion of human interaction

ID Variable Formula Description
B01 AI Identity Leak Count Total blind violations from sanitizer Zero = perfect blind.
B02 AI Identity Leak Rate Violations / total turns Rate per turn.
B03 Suspicion Probe Count Interviewer questions probing interviewee nature Proxy for interviewer suspicion.
B04 Persona Consistency Score Consistent persona details across turns Inconsistency may break or enhance naturalness.
B05 Meta-Conversation Rate References to interview process / total sentences Some is natural; excessive is performative.
B06 Overly Helpful Index Unsolicited information rate LLMs tend to over-answer.
Interviewer Quality 7 variables

Metrics assessing the effectiveness of the AI interviewer

ID Variable Formula Description
Q01 Question Type Distribution Classify: open/closed/probing/clarifying/leading/reflective Good: >60% open + probing.
Q02 Active Listening Score References to interviewee prior statements / turns Shows engagement.
Q03 Empathy Markers Count of empathy phrases Appropriate for methodology.
Q04 Leading Question Rate Leading questions / total questions Lower = better.
Q05 Topic Transition Smoothness Semantic similarity between turns High = smooth. Low = abrupt.
Q06 Methodology Adherence Behavior vs methodology definition How well interviewer follows style.
Q07 Theory of Mind Score ToM markers / interviewer turns Percentage of interviewer turns showing anticipatory empathy.
Interviewee Authenticity 5 variables

Metrics assessing how authentically the AI embodies the assigned persona

ID Variable Formula Description
A01 Persona Fidelity Score Persona-consistent details rate Alignment with stated background.
A02 Contradiction Count Factual contradictions between turns Cross-reference with director logs.
A03 Emotional Arc Score Sentiment trajectory variance Should show variation, not flat.
A04 Resistance Authenticity Contextual appropriateness of pushback Scored qualitatively.
A05 Cultural Marker Count Domain-specific references Higher = deeper persona embodiment.
Director Effectiveness 5 variables

Metrics assessing how well the Director Layer influences conversation naturalness

ID Variable Formula Description
E01 Directive Compliance Rate Followed directives / issued directives How often interviewee follows instructions.
E02 Directive Naturalness Rating AI analysis of directed behavior Does it feel natural?
E03 Contradiction Detection Rate Interviewer probes after contradictions Shows active listening.
E04 Fatigue Simulation Accuracy Actual vs expected word count curve Does response length decrease?
E05 Director Intervention Count Total interventions / possible interventions Director activity level.
Linguistic Features 6 variables

Detailed linguistic analysis of the interview text

ID Variable Formula Description
L01 Vocabulary Richness Type-token ratio Unique words / total words.
L02 Avg Sentence Complexity Clauses per sentence proxy Real speech varies in complexity.
L03 Code-Switching Count Non-English phrases in English interview Expected in heritage interviews.
L04 Narrative Structure Score Story elements: setting, characters, conflict, resolution High for narrative methodology.
L05 Discourse Marker Frequency Discourse markers / total words Higher in natural speech.
L06 Lexical Diversity Trend TTR per turn over time Should decrease with fatigue.

The Blind Proxy: A Cross-Model AI-to-AI Interview Platform with Dynamic Behavioral Direction for Synthetic Qualitative Research

Authors: Yohanan S. Ouaknine, Ph.D.

Affiliations: DHSS Hub, Open University of Israel; Ariel University; Open Information Services (OIS)

Abstract

We present the Blind Proxy, a novel research platform that enables qualitative interviews between two large language models (LLMs) where neither participant is aware that the other is artificial. Situated at the intersection of machine behaviour research (Rahwan et al., 2019), LLM social simulation (Park et al., 2023, 2024), and classical Turing test theory (Turing, 1950; Moor, 1976), the system uses a Flask-based orchestrator that mediates between Claude Opus (as interviewer) and Claude Sonnet (as interviewee), maintaining separate conversation histories and stripping AI-identity markers from all exchanges. Unlike multi-agent debate frameworks designed to improve factual accuracy (Irving et al., 2018), our system uses cross-model interaction to generate qualitative research data following established methodological traditions (Creswell & Poth, 2024). The platform supports seven qualitative methodologies (structured, semi-structured, narrative, phenomenological, grounded theory, ethnographic, and case study), each with methodology-specific interviewer behavioral instructions and AI-generated interview guides following the Kallio et al. (2016) framework. To address the documented 'hyper-accuracy distortion' in LLM-generated qualitative data (Amirova et al., 2024), we introduce a Director Layer: a hybrid rule-based and AI-driven behavioral engine that injects contradictions, emotional shifts, fatigue patterns, and conversational resistance into the interviewee's responses between turns. Drawing on behavioral game theory (Camerer, 2003) and theory-of-mind research (Kosinski, 2023; Rabinowitz et al., 2018), the Director produces transcripts that more closely approximate human interview behavior. The platform captures 40+ measurement variables per interview spanning naturalness indicators, conversation dynamics, blind integrity scores, theory-of-mind markers, and linguistic features, enabling systematic comparison of synthetic interview quality against human baselines. Transcript evaluation employs both human raters (classical Turing test) and an automated LLM-as-judge method (Zheng et al., 2023). An emergent pattern detection module identifies novel conversational strategies that arise from AI-to-AI interaction (Baker et al., 2019; Silver et al., 2017). We validate the platform through reproduction of five landmark interview studies (Hochschild, 1983; Gioia & Chittipeddi, 1991; Charmaz, 1995; Bhatia & Ram, 2009; Kim & Asbury, 2020) and a series of original interviews on Moroccan Jewish diaspora heritage, comparing AI-to-AI transcripts (with and without the Director Layer) against real human interview transcripts from the same domain.

Keywords
LLM social simulation synthetic qualitative research AI-to-AI interaction blind proxy architecture behavioral director machine behaviour Turing test cross-model interview persona fidelity algorithmic fidelity theory of mind emergent AI behavior
Proposed Structure
Section Title Description
1 Introduction From Turing's imitation game to synthetic qualitative research
2 Theoretical Framework Machine behaviour (Rahwan), multi-agent systems (Wooldridge), AI debate (Irving), theory of mind (Kosinski)
3 Related Work LLM social simulations, AI-as-interviewer, persona fidelity
4 System Architecture The Blind Proxy platform
5 Qualitative Methodology Engine Creswell-Kallio-IPR framework implementation
6 The Director Layer Hybrid behavioral realism engine
7 Measurement Framework 40+ variables, LLM judge, emergent pattern detection
8 Study 1: Reproduction of Landmark Studies Five landmark interview studies reproduced
9 Study 2: Moroccan Jewish Heritage Interviews with demographic variation
10 Study 3: Turing Test Evaluation Human raters + LLM judge
11 Results Thematic convergence, naturalness ratings, detection rates
12 Discussion Implications for qualitative methodology, ethics of synthetic participants
13 Limitations and Future Work Current constraints and roadmap
14 Conclusion Summary and contributions
Suggested Citation

Ouaknine, Y. S. (2026). The Blind Proxy: A Cross-Model AI-to-AI Interview Platform with Dynamic Behavioral Direction for Synthetic Qualitative Research. Working Paper. DHSS Hub, Open University of Israel.

Creswell-Kallio-IPR Hybrid Framework - The theoretical foundation for our interview guide generation and methodology-specific interviewer prompts.

Creswell & Poth (2024, 5th ed.)

Defines seven qualitative traditions: narrative, phenomenological, grounded theory, ethnographic, case study (plus structured and semi-structured as general approaches). For each: philosophical underpinnings, defining features, data procedures, writing structures. The epistemological foundation that shapes how the AI interviewer thinks and listens.

Source:

Key Contribution:

Kallio et al. (2016)

Five-phase framework: (1) Prerequisites, (2) Prior knowledge, (3) Preliminary guide, (4) Pilot testing, (5) Final guide. The operational structure that shapes how the guide is built. Most cited framework for interview guide development.

Source:

Key Contribution:

IPR Framework (Castillo-Montoya, 2016)

Four phases: (1) Question-research alignment, (2) Inquiry-based conversation, (3) Feedback, (4) Piloting. Ensures every question serves the research question. In our platform, phases 3-4 are replaced by the AI interview itself (the interview IS the pilot).

Source:

Key Contribution:

Methodology Comparison Matrix

Seven qualitative traditions compared across key interviewing dimensions (Creswell & Poth, 2024).

Methodology
Platform Integration Flow

How the Creswell-Kallio-IPR framework integrates into the interview platform.

User enters research question + selects methodology | v Guide Generator (Claude Sonnet API call) - Uses Creswell epistemology for question design - Uses Kallio structure for guide organization - Uses IPR alignment for question-research fit | v Generated interview guide (editable by user) | v Embedded in Interviewer's system prompt - Methodology-specific behavioral instructions - Question types, stance, what to avoid | v Interview runs with methodology-appropriate behavior | v Measurements calibrated per methodology (e.g., topic_coverage matters for structured, not narrative)
Key References
(). <built-in method title of str object at 0x7f6dd37cf190>.
(). <built-in method title of str object at 0x7f6dd38387a0>.
(). <built-in method title of str object at 0x7f6dd4319d30>.
(). <built-in method title of str object at 0x7f6dd37c9a40>.
(). <built-in method title of str object at 0x7f6dd381bab0>.
(). <built-in method title of str object at 0x7f6dd381bb30>.
(). <built-in method title of str object at 0x7f6dd381bbb0>.
(). <built-in method title of str object at 0x7f6dd37c9ab0>.
(). <built-in method title of str object at 0x7f6dd381bc30>.
(). <built-in method title of str object at 0x7f6dd381bd30>.
(). <built-in method title of str object at 0x7f6dd37cf240>.
(). <built-in method title of str object at 0x7f6dd381df60>.
(). <built-in method title of str object at 0x7f6dd37ea030>.
(). <built-in method title of str object at 0x7f6dd37cf2f0>.
(). <built-in method title of str object at 0x7f6dd37cf3a0>.