Abstract: We propose the hypothesis that word etymology is useful for NLP applications as a bridge between languages. We support this hypothesis with experiments in crosslanguage (English-Italian) document categorization. In a straightforward bag-ofwords experimental set-up we add etymological ancestors of the words in the documents, and investigate the performance of a model built on English data, on Italian test data (and viceversa). The results show not only statistically significant, but a large improvement ‐ a jump of almost 40 points in F1-score ‐ over the raw (vanilla bag-ofwords) representation.
0 Replies
Loading