Keywords: Annotation, Software
TL;DR: Antarlekhaka is a language-agnostic web-deployable multi-task annotation tool offering intuitive and accessible annotation capabilities towards a comprehesive set of NLP tasks which is being used in real-world annotation tasks.
Abstract: One of the primary obstacles in the advancement of Natural Language
Processing (NLP) technologies for low-resource languages is the lack of
annotated datasets for training and testing machine learning models. In this
paper, we present \emph{Antarlekhaka}, a tool for manual annotation of a
comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible,
language-agnostic, Web-deployable and supports distributed annotation by
multiple simultaneous annotators. The system sports user-friendly interfaces
for 8 categories of annotation tasks. These, in turn, enable the annotation
of a considerably larger set of NLP tasks. The task categories include two
linguistic tasks not handled by any other tool, namely, sentence boundary
detection and deciding canonical word order, which are important tasks for
text that is in the form of poetry. We propose the idea of \emph{sequential
annotation} based on small text units, where an annotator performs several
tasks related to a single text unit before proceeding to the next unit. The
research applications of the proposed mode of multi-task annotation are also
discussed. Antarlekhaka outperforms other annotation tools in objective
evaluation. It has been also used for two real-life annotation tasks on two
different languages, namely, Sanskrit and Bengali. The tool is available at
\url{https://github.com/Antarlekhaka/code}
Submission Number: 32
Loading