A Proposal for a Hybrid Syllabus Search Tool that Combines Keyword Search and Content Based Classification

Takayuki Sekiya, Tomohiro Tatejima, Yoshitatsu Matsuda, Kazunori Yamaguchi

Published: 2021, Last Modified: 24 May 2025EDUCON 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: A syllabus is one of the most important clues in the analysis of the educational activities. Our previous works reported that the course syllabi of computer science (CS) curricula from about 47 universities can disclose the interesting structures in the CS curricula. However, the course syllabi were collected manually. Therefore, it was difficult to increase the number of syllabi largely, and semi-automatic crawling of massive course syllabi is needed for further analysis. We have been studying to collect syllabus information based on the contents of a large number of web pages downloaded from the university's website by using a general-purpose web crawler. We discovered the structures of the syllabus pages to some extent automatically by using the linear support vector machine (linear SVM). We used the top page of the target department educating bachelor's degree in CS field as a start page of crawling for each university. To look for such a department's page, we sometimes used Google search. Google Custom Search API 1,(Google API) is expected to provide an efficient way to gather syllabus information while saving computation time, storage, and other resources. In this study, we propose a hybrid method which combines Google API as a general keyword search engine and linear SVM as content-based classification models. We developed a system to support the syllabus collection process. The system consists of three subsystems: Crawler, Classifier, and Database. Crawler is the combination of Google API and general-purpose web crawler. We can search syllabus-related web pages from university websites using Google API with syllabus-related search keywords and domain names of the websites. Classifier ranks pages related to CS syllabus from a large number of web pages according to the confidence scores of the linear SVM. We trained the decision model of linear SVM using the syllabus pages we collected in the former studies. Using the pages obtained from Google API and linear SVM, we can find a list of CS syllabus pages from more universities than using each method alone. Combining the top nine of Google API results and the top two of linear SVM's decision model, we obtained the CS syllabus pages from more than 96.6% of the 58 universities.