Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking

ACL ARR 2025 May Submission5614 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present $\textit{Mahānāma}$, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the $\textit{Mahābhārata}$, the world’s longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of $\textit{Mahānāma}$, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. $\textit{Mahānāma}$ thus provides a unique benchmark for advancing entity resolution, especially in literary domains.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation,NLP datasets,datasets for low resource languages,benchmarking,language resources,multilingual corpora
Contribution Types: Data resources
Languages Studied: Sanskrit,English
Submission Number: 5614
Loading