Abstract: Speaker identification in narrative analysis is a challenging task due to complex dialogues, diverse utterance patterns, and ambiguous character references. Cosly and time-intensive manual annotation limits the scalability of high-quality dataset creation.
This study demonstrates a cost-efficient approach of constructing speaker identification datasets by combining small-scale manual annotation with LLM-based labeling. A subset of data is manually annotated and is used to guide LLM predictions with a few-shot approach followed by refinement through minimal human corrections.
Our results show that LLMs achieve approximately 90\% accuracy on challenging narratives, such as the ``Romance of the Three Kingdoms'' dataset, underscoring the importance of targeted human corrections. This approach proves effective for constructing scalable and cost-efficient datasets for japanese and complex narratives.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Narrative Analysis,LLMs,Speaker Identification
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: Japanese
Submission Number: 2346
Loading