Abstract: Earlier research on video-based question generation has primarily focused on generating questions about general objects and attributes, often neglecting the complexities of bilingual communication and entity-specific queries. This study addresses these limitations by developing a multimodal transformer framework capable of integrating video and textual inputs to generate semantically rich, entity-centric, and information-driven questions in a code-mixed Hindi-English format. Such a system is particularly significant for multilingual societies, offering applications in bilingual education, interactive learning platforms, conversational agents, and promoting cultural and linguistic relevance. To the best of our knowledge, there does not exist any large-scale Hindi-English (Hinglish) code-mixed dataset for video-based question generation. To address this limitation, we curated a subset of the TVQA dataset and annotated it by bilingual experts, ensuring fluency, contextual appropriateness, and adherence to the code-mixed structure. Empirical evaluation shows that CoQuEST demonstrated competitive performance with metrics of RQUGE: 1.649, BLEU-1: 0.04, CIDEr: 0.29, METEOR: 0.20, Distinct-1: 0.96, Distinct-2: 0.99, ROUGE-L: 0.20, and BERT-Score F1: 0.88, validating its practical utility and effectiveness. We make the code and dataset publicly available.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Dong_Guo4
Submission Number: 6910
Loading