Abstract: We present $\textit{Mahānāma}$, a large-scale annotated literary dataset for entity linking and named entity coreference in Sanskrit, a low-resource and morphologically rich language. Derived from the $\textit{Mahābhārata}$, the longest epic in world literature, it consists of 73K verses with 1.09M entity mentions, linked to an English knowledge base for cross-lingual resolution. Unlike previous datasets, $\textit{Mahānāma}$ encompasses a single long-form discourse with comprehensive entity annotations, serving as a unique testbed for end-to-end resolution tasks. The dataset poses challenges due to lexical variation, polysemous names, and long-range entity references. Experiments show that tested coreference models struggle with entity alignment across the discourse, while the entity linking model yields suboptimal performance in end-to-end linking. Cross-lingual descriptions and entity types contribute complementarily to disambiguation. $\textit{Mahānāma}$ provides a rich resource for studying entity linking and coreference in literary texts.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation,NLP datasets,datasets for low resource languages,benchmarking,language resources,multilingual corpora
Contribution Types: Data resources
Languages Studied: Sanskrit, English
Submission Number: 3853
Loading