Keywords: mentorship-focused QA, multilingual question answering, long-form content understanding, educational AI, multi-agent, LLM-based evaluation, low-resource languages
Abstract: Question answering systems are typically evaluated on factual correctness, yet many real-world applications—such as education and career guidance—require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long-form settings.
We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value.
Using MentorQA, we compare Single-Agent, Dual-Agent, RAG, and Multi-Agent QA architectures under controlled conditions. Multi-Agent pipelines consistently produce higher-quality mentorship responses, with especially strong gains for complex topics and lower-resource languages. We further analyze the reliability of automated LLM-based evaluation, observing substantial variation in alignment with human judgments.
Overall, this work establishes mentorship-focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://anonymous.4open.science/r/MentorQA/.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: commonsense QA; reading comprehension; logical reasoning;open-domain QA; question generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: english,hindi,chinese,romanian
Submission Number: 1049
Loading