FastText and XGBoost Content-Based Classification for Employment Web Scraping

Arkadiusz Talun, Pawel Drozda, Leszek Bukowski, Rafal Scherer

Published: 2020, Last Modified: 16 Oct 2023ICAISC (2) 2020Readers: Everyone

Abstract: The purpose of this paper is to present the design and results of experiments that focus on universal, autonomous data extraction (web scraping) system fed by publicly available online job listings. In particular, methods of automated crawling, preprocessing and classifying data from job offers will be presented together with the aggregation of the acquired data stored in large-scale, structured databases. We tested two models to classify the content of job portals: fastText and XGBoost. We obtained promising results in the experimental phase, with 88% accuracy by both methods.

0 Replies