Abstract: The Aranea Project offers a set of comparable corpora for two dozens of (mostly European)
languages providing a convenient dataset for NLP applications that require training on large
amounts of data. The article presents word embedding models trained on the Aranea corpora
and an online interface to query the models and visualize the results. The implementation
is aimed towards lexicographic use but can be also useful in other fields of linguistic study
since the vector space is a plausible model of semantic space of word meanings. Three
different models are available – one for a combination of part of speech and lemma, one for
raw word forms, and one based on FastText algorithm uses subword vectors and is not limited
to whole or known words in finding their semantic relations. The article is describing the
interface and major modes of its functionality; it does not try to perform detailed linguistic
analysis of presented examples.
Loading