Genre Annotation for the Web: text-external and text-internal perspectives
Abstract: This paper describes a digital curation study aimed at comparing the composition of large Web
corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper
presents a Deep Learning model suitable for classifying texts from large Web corpora using a small
number of communicative functions, such as Argumentation or Reporting. Second, it describes the
results of applying the automatic classification model to these corpora and compares their composition.
Finally, the paper introduces a framework for interpreting the results of automatic genre classification
using linguistic features. The framework can help in comparing general reference corpora obtained from
the Web and in comparing corpora across languages
0 Replies
Loading