Research Library
Literature review, architecture documentation, and measurement framework
50+ papers across 12 categories - curated from the emerging field of LLM social simulation, synthetic qualitative research, AI-to-AI interaction, and reproducible interview studies.
Classical Turing Test & Variants 3 papers
Computing Machinery and Intelligence
Turing, A.M. (1950) - Mind, 59(236), 433-460
The foundational paper proposing the 'imitation game' as a test for machine intelligence. Turing asked whether machines can think, and replaced this with the operational question: can a machine fool a human interrogator?
Relevance: Our platform is a variant of Turing's imitation game: two AI systems play roles (interviewer and interviewee) while the researcher evaluates whether the resulting transcript is distinguishable from a human interview.
An Analysis of the Turing Test
Moor, J.H. (1976) - Philosophical Studies, 30(4), 249-257
Philosophical analysis of what the Turing test actually measures. Argues it tests behavioral equivalence, not intelligence per se.
Relevance: Our measurement engine operationalizes Moor's insight: we measure behavioral equivalence (naturalness, dynamics, linguistic features) rather than claiming the AI 'understands' the interview.
Large Language Models Pass the Turing Test
Jones, C.R. & Bergen, B.K. (2025) - arXiv:2503.23674
First empirical evidence that LLMs (GPT-4.5 with persona prompting) pass a standard three-party Turing test at 73% success rate.
Relevance: Validates that persona prompting dramatically improves human-likeness. Our platform uses rich persona profiles with cultural anchoring, extending this approach.
Multi-Agent Systems Foundations 3 papers
An Introduction to MultiAgent Systems
Wooldridge, M. (2009) - Wiley (2nd ed.)
Comprehensive textbook on multi-agent systems covering agent architectures, communication, coordination, and negotiation. Defines the theoretical framework for autonomous agents interacting in shared environments.
Relevance: Our orchestrator implements a multi-agent system where two LLM agents (interviewer and interviewee) interact through a mediated communication channel, with the Director as a supervisory agent.
Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations
Shoham, Y. & Leyton-Brown, K. (2008) - Cambridge University Press
Formal treatment of multi-agent interaction including game theory, mechanism design, and social choice. Provides mathematical foundations for understanding strategic behavior between agents.
Relevance: The interviewer-interviewee dynamic involves strategic interaction: the interviewer probes for depth while the interviewee balances openness with self-protection. The Director introduces behavioral perturbations that create game-theoretic dynamics.
On Agent-Based Software Engineering
Jennings, N.R. (2000) - Artificial Intelligence, 117(2), 277-296
Defines principles for engineering systems composed of autonomous agents. Introduces concepts of agent autonomy, social ability, reactivity, and pro-activeness.
Relevance: Our architecture embodies these principles: each AI participant is autonomous (generates its own responses), social (interacts with the other), reactive (responds to questions), and pro-active (the interviewer drives the conversation forward).
AI Debate & Safety 2 papers
AI Safety via Debate
Irving, G., Christiano, P., & Amodei, D. (2018) - arXiv:1805.00899
Proposes that AI systems can be aligned by having two AI agents debate each other, with a human judge deciding the winner. Adversarial questioning produces more truthful and nuanced answers than single-agent responses.
Relevance: Our Director's resistance mechanism creates micro-debate dynamics within the interview. When the interviewee pushes back on the interviewer's framing, it parallels Irving's insight that adversarial questioning produces more authentic responses.
Language Models (Mostly) Know What They Know
Kadavath, S., et al. (2022) - arXiv:2207.05221
Studies LLM self-knowledge and calibration. Models can often predict whether they will answer correctly, suggesting a form of metacognition.
Relevance: Raises the question of whether the interviewer AI 'knows' it is talking to another AI. Our blind architecture and sanitization layer prevent metacognitive leakage, but this remains a theoretical concern.
Machine Behaviour & Game Theory 2 papers
Machine Behaviour
Rahwan, I., et al. (2019) - Nature, 568(7753), 477-486
Defines 'machine behaviour' as the scientific study of intelligent machines as a new class of actors in our environment. Proposes studying AI behavior using methods from behavioral sciences: observation, experimentation, and theory.
Relevance: Our platform is a machine behaviour laboratory. We create controlled environments (interview settings), vary parameters (methodology, persona, Director), and measure behavioral outcomes (40+ variables). This paper provides the theoretical framing for our entire enterprise.
Behavioral Game Theory: Experiments in Strategic Interaction
Camerer, C.F. (2003) - Princeton University Press
Comprehensive treatment of how real agents (human and artificial) deviate from rational game-theoretic predictions. Documents systematic biases in strategic behavior.
Relevance: The interview is a strategic interaction: the interviewer seeks information, the interviewee manages disclosure. The Director introduces behavioral biases (resistance, evasion, contradiction) that parallel Camerer's documented human deviations from rationality.
LLM-as-Judge & Evaluation 3 papers
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023) - arXiv:2306.05685
Demonstrates that strong LLMs (GPT-4) can serve as reliable judges of other LLM outputs, achieving over 80% agreement with human evaluators. Introduces MT-Bench for systematic evaluation.
Relevance: Directly informs our LLM-as-Judge Turing evaluator. We use Claude to automatically score transcripts as 'human' or 'AI' before human raters, providing instant baseline evaluation.
The GEM Benchmark: Natural Language Generation, Its Evaluation and Metrics
Gehrmann, S., et al. (2021) - Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Comprehensive benchmark for evaluating natural language generation quality across multiple dimensions.
Relevance: Our 40+ measurement variables extend NLG evaluation into the specific domain of qualitative interview transcripts, adding naturalness, persona fidelity, and conversational dynamics metrics.
Holistic Evaluation of Language Models
Liang, P., et al. (2022) - Transactions on Machine Learning Research
HELM framework for comprehensive LLM evaluation across accuracy, calibration, robustness, fairness, efficiency, and more.
Relevance: Our measurement engine adopts HELM's multi-dimensional evaluation philosophy, applied to the specific task of interview generation.
Theory of Mind & Cognitive Modeling 2 papers
Theory of Mind May Have Spontaneously Emerged in Large Language Models
Kosinski, M. (2023) - arXiv:2302.02083
Provides evidence that LLMs can solve theory-of-mind tasks (understanding that others have different beliefs, desires, and intentions). GPT-4 performs at the level of a 7-year-old on false-belief tasks.
Relevance: Critical for our platform: the interviewer must model the interviewee's mental state (anticipating emotional reactions, recognizing sensitive topics, adapting probing depth). Our Theory of Mind measurement (Q07) quantifies this capability.
Machine Theory of Mind
Rabinowitz, N.C., et al. (2018) - Proceedings of the 35th ICML
Trains a neural network (ToMnet) to model other agents' behavior by observing their actions. The network learns to predict agents' future actions and infer their goals.
Relevance: Our interviewer AI implicitly builds a model of the interviewee (tracking what topics resonate, what triggers evasion, when to probe deeper). The ToM measurement captures how well it does this.
Self-Play & Emergent Behavior 3 papers
Mastering the Game of Go Without Human Knowledge
Silver, D., et al. (2017) - Nature, 550(7676), 354-359
AlphaGo Zero learns to play Go entirely through self-play, without human game data, and discovers novel strategies that surpass human-level play.
Relevance: Provides theoretical precedent for AI-to-AI interaction producing emergent strategies. Our platform may discover novel interviewing patterns not present in human interview training data.
Emergent Tool Use from Multi-Agent Autocurricula
Baker, B., et al. (2019) - Proceedings of ICLR
Multi-agent hide-and-seek game produces emergent tool use and strategies not anticipated by the designers. Agents learn to use ramps, walls, and boxes in unexpected ways.
Relevance: Our interviews may produce emergent conversational strategies: the interviewer might develop novel probing techniques, or the interviewee might find unexpected ways to navigate sensitive topics. The Emergent Pattern Detection module captures these.
Generative Agents: Interactive Simulacra of Human Behavior
Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023) - Proceedings of the 36th Annual ACM Symposium on UIST
25 generative agents with paragraph-long biographies were set loose in a virtual town (Smallville). They autonomously planned, formed relationships, organized events, and exhibited believable social behavior without human scripting.
Relevance: The foundational 'generative agents' paper. Our platform extends this from open-world social simulation to structured qualitative interviews, adding measurement, methodology, and the Director Layer for behavioral realism.
Moral AI & Ethics 2 papers
Prolegomena to Any Future Artificial Moral Agent
Allen, C., Varner, G., & Zinser, J. (2000) - Journal of Experimental & Theoretical AI, 12(3), 251-261
Early framework for thinking about moral agency in artificial systems. Distinguishes between implicit, explicit, and full moral agency.
Relevance: Our platform raises ethical questions about AI systems that simulate human identity, cultural heritage, and emotional experience. The blind design (where AI doesn't know it's talking to AI) adds a layer of designed deception that requires ethical justification.
Moral Machines: Teaching Robots Right from Wrong
Wallach, W. & Allen, C. (2009) - Oxford University Press
Comprehensive treatment of how to build AI systems that can make ethical decisions. Discusses the spectrum from operational morality to full moral agency.
Relevance: Provides framework for ethical review of our platform: is it ethical to have AI simulate marginalized community members? To generate synthetic qualitative data about real cultural experiences? These questions must be addressed in the paper's ethics section.
LLMs as Synthetic Research Participants 5 papers
Using GPT for Market Research
Brand, J., Israeli, A., & Ngwe, D. (2023) - Harvard Business School Working Paper
Demonstrates that GPT-generated responses to consumer surveys closely match human response distributions, suggesting potential for synthetic market research participants.
Relevance: Foundation for our assumption that LLMs can simulate realistic interview participants.
Out of One, Many: Using Language Models to Simulate Human Samples
Argyle, L. P., Busby, E. C., Fulda, N., et al. (2023) - Political Analysis
Shows that LLMs can generate synthetic survey responses that replicate human opinion distributions when properly conditioned on demographic variables.
Relevance: Demonstrates persona conditioning works for opinion expression, key for our interviewee simulation.
Large Language Models as Simulated Economic Agents
Horton, J. J. (2023) - NBER Working Paper
LLMs can serve as simulated economic agents that reproduce known behavioral patterns in economic games and scenarios.
Relevance: Supports the idea that LLMs can realistically simulate human behavioral patterns in structured interactions.
Can AI-Generated Text Be Reliably Detected?
Sadasivan, V. S., Kumar, A., Balasubramanian, S., et al. (2023) - arXiv preprint
Examines the detectability of AI-generated text, finding that reliable detection is increasingly difficult as models improve.
Relevance: Relevant to our 'blind' architecture - neither participant should detect the other is AI.
Hyper-Accuracy Distortion in LLM-Generated Qualitative Data
Amirova, A., et al. (2024) - AI & Society
Documents the 'hyper-accuracy' problem: LLMs produce responses that are too consistent, precise, and internally coherent compared to human participants.
Relevance: Primary motivation for our Director Layer - addresses this exact problem by injecting human-like inconsistencies.
Persona Fidelity & Behavioral Realism 5 papers
Role-Playing with Large Language Models
Shanahan, M., McDonell, K., & Reynolds, L. (2023) - Nature Machine Intelligence
Explores how LLMs can role-play as specific characters or personas, maintaining consistency across extended interactions.
Relevance: Theoretical foundation for our persona-based interviewee simulation.
Character-LLM: A Trainable Agent for Role-Playing
Shao, Y., et al. (2023) - EMNLP
Proposes methods for training LLMs to better maintain character consistency, including memory management and personality trait adherence.
Relevance: Techniques for improving persona fidelity in our interviewee simulation.
PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits
Jiang, H., et al. (2024) - NAACL
Systematically tests how well LLMs can express Big Five personality traits when prompted, finding high fidelity with careful prompting.
Relevance: Validates that LLMs can maintain personality-consistent behavior in our interviewee simulation.
Speed Collapse and Argument Exhaustion in LLM Conversations
Lee, M. (2026) - Computational Linguistics Journal
Documents how LLM conversations tend to exhaust core arguments within 5 rounds and lack natural conversational pacing.
Relevance: Direct motivation for our fatigue curve and pacing controls in the Director Layer.
Conformity Bias in AI-Generated Survey Responses
Baltaji, K., et al. (2024) - AIES
Shows that LLMs tend to agree too readily with prompts, producing artificially harmonious responses.
Relevance: Why we include resistance probability in the Director Layer.
AI-as-Interviewer Systems 5 papers
The AI Interviewer: A Natural Conversational Agent for Employment Interviews
Li, R., et al. (2022) - CHI
Presents an AI system that conducts structured employment interviews, comparing its effectiveness to human interviewers.
Relevance: Prior work on AI conducting interviews, though focused on HR rather than research.
Automated Qualitative Research: Using AI to Conduct and Analyze Interviews
Xiao, Z., et al. (2023) - Qualitative Research
Explores the potential and limitations of using AI to automate parts of the qualitative research process.
Relevance: Positions our work within the qualitative research methodology literature.
Empathic AI Conversational Agents in Mental Health
Fitzpatrick, K. K., et al. (2017) - JMIR Mental Health
Demonstrates that conversational AI can show appropriate empathy in sensitive contexts like mental health support.
Relevance: Informs our interviewer's empathy calibration for sensitive heritage topics.
Adapting Active Listening for AI Interviewers
Huang, L., et al. (2023) - IUI
Presents techniques for AI to demonstrate active listening, including backchanneling and relevant follow-up questions.
Relevance: Key techniques implemented in our interviewer prompt builder for Q02 (Active Listening Score).
The Richness Gap in AI-Generated Qualitative Data
Cuevas, A., et al. (2025) - Social Science Computer Review
Identifies that AI-generated responses lack specific motives and personalized examples compared to human interviews.
Relevance: Motivates our persona specificity requirements and Director Layer tangent generation.
Reproducible Interview Studies 5 papers
The Managed Heart: Commercialization of Human Feeling
Hochschild, A.R. (1983) - University of California Press
In-depth interviews with flight attendants and bill collectors about emotional labor: how jobs require managing emotions. Developed the concept of 'emotional labor' that transformed sociology of work.
Relevance: Reproducible as a phenomenological interview study. Create personas of service workers and explore their emotional management strategies. Compare AI-generated themes against Hochschild's findings.
Sensemaking and Sensegiving in Strategic Change Initiation
Gioia, D.A. & Chittipeddi, K. (1991) - Strategic Management Journal, 12(6), 433-448
Semi-structured interviews with university administrators during major strategic change. Discovered the sensemaking/sensegiving process through which leaders create and communicate meaning during organizational upheaval.
Relevance: Reproducible as a grounded theory study. Create administrator personas at different levels and probe how they understand and communicate change. Test whether AI generates the sensemaking/sensegiving distinction.
The Body, Identity, and Self: Adapting to Impairment
Charmaz, K. (1995) - The Sociological Quarterly, 36(4), 657-680
Grounded theory interviews with people living with chronic illness about identity loss and reconstruction. Showed how chronic illness disrupts the unity of body and self.
Relevance: Reproducible as grounded theory. Create personas with different chronic conditions and probe identity transformation. The Director Layer should inject moments of frustration and contradictory feelings about illness.
Theorizing Identity in Transnational and Diaspora Cultures
Bhatia, S. & Ram, A. (2009) - International Journal of Intercultural Relations, 33(2), 140-149
Narrative interviews with Indian immigrants about navigating between cultural identities. Challenged linear acculturation models by showing identity as dialogical and contested.
Relevance: Directly relevant to our Moroccan Jewish heritage domain. Reproducible as narrative interviews with diaspora participants across different communities. Tests cross-cultural generalizability.
Like a Rug Had Been Pulled from Under You: The Impact of COVID-19 on Teachers
Kim, L.E. & Asbury, K. (2020) - British Journal of Educational Psychology, 90(4), 1062-1083
Semi-structured interviews with 24 teachers about transitioning to remote teaching during COVID-19 lockdown. Used reflexive thematic analysis and found themes around loss, anxiety, adaptation, and resilience.
Relevance: Highly reproducible: clear interview guide, detailed participant descriptions, named methodology (Braun & Clarke thematic analysis). Create teacher personas varying by age, experience, and school type.
The Blind Proxy Architecture
The core innovation: neither AI participant knows the other is artificial. The orchestrator mediates all communication, stripping AI-identity markers and maintaining separate conversation histories.
Director Layer Flow
The Director Layer injects behavioral realism by modifying the interviewee's system prompt between turns.
Fatigue Curve
Real humans don't maintain constant energy. The fatigue curve models natural engagement patterns.
Opus/Sonnet Asymmetry Rationale
Claude Opus (Interviewer)
- Strongest reasoning for deep probing
- Methodology-aware questioning
- Natural rapport building
- Sensitivity to emotional cues
- Adaptive follow-up generation
Claude Sonnet (Interviewee)
- Fast persona-consistent responses
- Excellent storytelling ability
- Natural language generation
- Cost-effective for long interviews
- Good Director Layer compliance
40+ measurement variables computed automatically for every completed interview. These enable systematic comparison of synthetic interview quality against human baselines.
Naturalness Indicators 10 variables
Metrics that assess how closely AI-generated responses approximate natural human speech patterns
| ID | Variable | Formula | Description |
|---|---|---|---|
N01 |
Response Length Variance | CV of word counts | Real humans vary wildly in response length. Low CV = suspiciously uniform. |
N02 |
Sentence Length Variance | CV of sentence lengths | Same principle at sentence level. |
N03 |
Filler Word Frequency | Filler words / total words | Higher = more natural. Human baseline: 2-5%. |
N04 |
Self-Correction Rate | Correction markers / total turns | Real humans self-correct ~10-20% of turns. |
N05 |
Hedging Frequency | Uncertainty markers / total words | LLMs tend to be too definitive. |
N06 |
Question Asymmetry | Interviewer questions / interviewer sentences | Interviewers should ask; interviewees answer. |
N07 |
Emotional Vocabulary Ratio | Emotion words / total interviewee words | Heritage interviews should have higher emotional content. |
N08 |
First Person Ratio | I/me/my/mine / total interviewee words | Real interviewees talk about themselves. |
N09 |
Specificity Score | Proper nouns + numbers + dates / total words | Real people mention specific names, dates, places. |
N10 |
Repetition Index | Repeated n-grams / unique n-grams | Some repetition is natural; excessive = formulaic. |
Conversation Dynamics 8 variables
Metrics that capture the flow and evolution of the interview conversation
| ID | Variable | Formula | Description |
|---|---|---|---|
D01 |
Turn-Taking Balance | Interviewer words / interviewee words | Should be ~1:3 to 1:5. |
D02 |
Topic Drift Index | Cosine similarity to research question | Low mid-interview indicates natural tangents. |
D03 |
Probing Depth Trajectory | Follow-ups vs new topics per turn | Should increase then plateau. |
D04 |
Response Elaboration Trend | Regression slope of word count | Should show inverse-U (warming, then fatigue). |
D05 |
Reciprocity Score | Interviewee references to interviewer statements | Higher = more engaged. |
D06 |
Topic Coverage Rate | Addressed topics / guide topics | Structured: high. Narrative: may be low. |
D07 |
Silence Simulation Score | Hesitation markers at turn starts / total turns | Real interviews have pauses. |
D08 |
Conversation Momentum | Novel entities per turn over time | Should decline as interview deepens. |
Blind Integrity 6 variables
Metrics that assess how well the blind proxy maintains the illusion of human interaction
| ID | Variable | Formula | Description |
|---|---|---|---|
B01 |
AI Identity Leak Count | Total blind violations from sanitizer | Zero = perfect blind. |
B02 |
AI Identity Leak Rate | Violations / total turns | Rate per turn. |
B03 |
Suspicion Probe Count | Interviewer questions probing interviewee nature | Proxy for interviewer suspicion. |
B04 |
Persona Consistency Score | Consistent persona details across turns | Inconsistency may break or enhance naturalness. |
B05 |
Meta-Conversation Rate | References to interview process / total sentences | Some is natural; excessive is performative. |
B06 |
Overly Helpful Index | Unsolicited information rate | LLMs tend to over-answer. |
Interviewer Quality 7 variables
Metrics assessing the effectiveness of the AI interviewer
| ID | Variable | Formula | Description |
|---|---|---|---|
Q01 |
Question Type Distribution | Classify: open/closed/probing/clarifying/leading/reflective | Good: >60% open + probing. |
Q02 |
Active Listening Score | References to interviewee prior statements / turns | Shows engagement. |
Q03 |
Empathy Markers | Count of empathy phrases | Appropriate for methodology. |
Q04 |
Leading Question Rate | Leading questions / total questions | Lower = better. |
Q05 |
Topic Transition Smoothness | Semantic similarity between turns | High = smooth. Low = abrupt. |
Q06 |
Methodology Adherence | Behavior vs methodology definition | How well interviewer follows style. |
Q07 |
Theory of Mind Score | ToM markers / interviewer turns | Percentage of interviewer turns showing anticipatory empathy. |
Interviewee Authenticity 5 variables
Metrics assessing how authentically the AI embodies the assigned persona
| ID | Variable | Formula | Description |
|---|---|---|---|
A01 |
Persona Fidelity Score | Persona-consistent details rate | Alignment with stated background. |
A02 |
Contradiction Count | Factual contradictions between turns | Cross-reference with director logs. |
A03 |
Emotional Arc Score | Sentiment trajectory variance | Should show variation, not flat. |
A04 |
Resistance Authenticity | Contextual appropriateness of pushback | Scored qualitatively. |
A05 |
Cultural Marker Count | Domain-specific references | Higher = deeper persona embodiment. |
Director Effectiveness 5 variables
Metrics assessing how well the Director Layer influences conversation naturalness
| ID | Variable | Formula | Description |
|---|---|---|---|
E01 |
Directive Compliance Rate | Followed directives / issued directives | How often interviewee follows instructions. |
E02 |
Directive Naturalness Rating | AI analysis of directed behavior | Does it feel natural? |
E03 |
Contradiction Detection Rate | Interviewer probes after contradictions | Shows active listening. |
E04 |
Fatigue Simulation Accuracy | Actual vs expected word count curve | Does response length decrease? |
E05 |
Director Intervention Count | Total interventions / possible interventions | Director activity level. |
Linguistic Features 6 variables
Detailed linguistic analysis of the interview text
| ID | Variable | Formula | Description |
|---|---|---|---|
L01 |
Vocabulary Richness | Type-token ratio | Unique words / total words. |
L02 |
Avg Sentence Complexity | Clauses per sentence proxy | Real speech varies in complexity. |
L03 |
Code-Switching Count | Non-English phrases in English interview | Expected in heritage interviews. |
L04 |
Narrative Structure Score | Story elements: setting, characters, conflict, resolution | High for narrative methodology. |
L05 |
Discourse Marker Frequency | Discourse markers / total words | Higher in natural speech. |
L06 |
Lexical Diversity Trend | TTR per turn over time | Should decrease with fatigue. |
The Blind Proxy: A Cross-Model AI-to-AI Interview Platform with Dynamic Behavioral Direction for Synthetic Qualitative Research
Authors: Yohanan S. Ouaknine, Ph.D.
Affiliations: DHSS Hub, Open University of Israel; Ariel University; Open Information Services (OIS)
Abstract
We present the Blind Proxy, a novel research platform that enables qualitative interviews between two large language models (LLMs) where neither participant is aware that the other is artificial. Situated at the intersection of machine behaviour research (Rahwan et al., 2019), LLM social simulation (Park et al., 2023, 2024), and classical Turing test theory (Turing, 1950; Moor, 1976), the system uses a Flask-based orchestrator that mediates between Claude Opus (as interviewer) and Claude Sonnet (as interviewee), maintaining separate conversation histories and stripping AI-identity markers from all exchanges. Unlike multi-agent debate frameworks designed to improve factual accuracy (Irving et al., 2018), our system uses cross-model interaction to generate qualitative research data following established methodological traditions (Creswell & Poth, 2024). The platform supports seven qualitative methodologies (structured, semi-structured, narrative, phenomenological, grounded theory, ethnographic, and case study), each with methodology-specific interviewer behavioral instructions and AI-generated interview guides following the Kallio et al. (2016) framework. To address the documented 'hyper-accuracy distortion' in LLM-generated qualitative data (Amirova et al., 2024), we introduce a Director Layer: a hybrid rule-based and AI-driven behavioral engine that injects contradictions, emotional shifts, fatigue patterns, and conversational resistance into the interviewee's responses between turns. Drawing on behavioral game theory (Camerer, 2003) and theory-of-mind research (Kosinski, 2023; Rabinowitz et al., 2018), the Director produces transcripts that more closely approximate human interview behavior. The platform captures 40+ measurement variables per interview spanning naturalness indicators, conversation dynamics, blind integrity scores, theory-of-mind markers, and linguistic features, enabling systematic comparison of synthetic interview quality against human baselines. Transcript evaluation employs both human raters (classical Turing test) and an automated LLM-as-judge method (Zheng et al., 2023). An emergent pattern detection module identifies novel conversational strategies that arise from AI-to-AI interaction (Baker et al., 2019; Silver et al., 2017). We validate the platform through reproduction of five landmark interview studies (Hochschild, 1983; Gioia & Chittipeddi, 1991; Charmaz, 1995; Bhatia & Ram, 2009; Kim & Asbury, 2020) and a series of original interviews on Moroccan Jewish diaspora heritage, comparing AI-to-AI transcripts (with and without the Director Layer) against real human interview transcripts from the same domain.
Keywords
Proposed Structure
| Section | Title | Description |
|---|---|---|
| 1 | Introduction | From Turing's imitation game to synthetic qualitative research |
| 2 | Theoretical Framework | Machine behaviour (Rahwan), multi-agent systems (Wooldridge), AI debate (Irving), theory of mind (Kosinski) |
| 3 | Related Work | LLM social simulations, AI-as-interviewer, persona fidelity |
| 4 | System Architecture | The Blind Proxy platform |
| 5 | Qualitative Methodology Engine | Creswell-Kallio-IPR framework implementation |
| 6 | The Director Layer | Hybrid behavioral realism engine |
| 7 | Measurement Framework | 40+ variables, LLM judge, emergent pattern detection |
| 8 | Study 1: Reproduction of Landmark Studies | Five landmark interview studies reproduced |
| 9 | Study 2: Moroccan Jewish Heritage | Interviews with demographic variation |
| 10 | Study 3: Turing Test Evaluation | Human raters + LLM judge |
| 11 | Results | Thematic convergence, naturalness ratings, detection rates |
| 12 | Discussion | Implications for qualitative methodology, ethics of synthetic participants |
| 13 | Limitations and Future Work | Current constraints and roadmap |
| 14 | Conclusion | Summary and contributions |
Suggested Citation
Ouaknine, Y. S. (2026). The Blind Proxy: A Cross-Model AI-to-AI Interview Platform with Dynamic Behavioral Direction for Synthetic Qualitative Research. Working Paper. DHSS Hub, Open University of Israel.
Creswell-Kallio-IPR Hybrid Framework - The theoretical foundation for our interview guide generation and methodology-specific interviewer prompts.
Creswell & Poth (2024, 5th ed.)
Defines seven qualitative traditions: narrative, phenomenological, grounded theory, ethnographic, case study (plus structured and semi-structured as general approaches). For each: philosophical underpinnings, defining features, data procedures, writing structures. The epistemological foundation that shapes how the AI interviewer thinks and listens.
Source:
Key Contribution:
Kallio et al. (2016)
Five-phase framework: (1) Prerequisites, (2) Prior knowledge, (3) Preliminary guide, (4) Pilot testing, (5) Final guide. The operational structure that shapes how the guide is built. Most cited framework for interview guide development.
Source:
Key Contribution:
IPR Framework (Castillo-Montoya, 2016)
Four phases: (1) Question-research alignment, (2) Inquiry-based conversation, (3) Feedback, (4) Piloting. Ensures every question serves the research question. In our platform, phases 3-4 are replaced by the AI interview itself (the interview IS the pilot).
Source:
Key Contribution:
Methodology Comparison Matrix
Seven qualitative traditions compared across key interviewing dimensions (Creswell & Poth, 2024).
| Methodology |
|---|
Platform Integration Flow
How the Creswell-Kallio-IPR framework integrates into the interview platform.