SogouT-16: A New Web Corpus to Embrace IR ResearchOpen Website

2017 (modified: 11 Nov 2022)SIGIR 2017Readers: Everyone
Abstract: Web collection is essential for many Web based researches such as Web Information Retrieval (IR), Web data mining, Corpus linguistics and so on. However, it is usually expensive and time-consuming to collect a large scale of Web pages in lab-based environment and public-available collection becomes a necessity for these researches. In this study, we present a Chinese Web collection, SogouT-16, which is the largest free-of-charge public Chinese Web collection so far. We provide a variety of descriptive characteristics of SogouT-16 and discuss its adoption in a newly-designed ad-hoc retrieval task in NTCIR-13, We Want Web. SogouT-16 also provides online retrieval service and contains a number of auxiliary resources including hyperlink structure graph, query logs, word embedding, and etc. We believe that SogouT-16 will provide new opportunities for novel investigations and applications in IR and other related communities.
0 Replies

Loading