Abstract: The World Wide Web provides an excellent platform for investors to discover new partnership opportunities with a variety of companies. Analysts can categorize websites according to their business domains to retain relevant investment opportunities. Classifying websites manually is too expensive and time-consuming; thus, automatic classification tools are necessary. In this paper, we present FinDX (Financial Data EXploration), a tool for automatic website content classification for the financial technology (fintech) domain. At the core of our system is a keyword-based web crawler that extracts text from the landing page and relevant subpages, such as the About or Product pages of company websites. After cleaning the text and filtering it using part-of-speech tagging, we use a Linear Support Vector Machine (SVM) or Multilayer Perceptron (MLP) to classify a company website as fintech or non-fintech. FinDX achieves high binary classification accuracy on two different datasets of business websites, attaining a maximal F-score of 96%. In addition, our flexible tool is easily adaptable to any business domain and is not resource-expensive. This makes FinDX ideal for use in startup environments.
0 Replies
Loading