Investigation on University Websites for Semi-automated Syllabus Crawling

Takayuki Sekiya, Yoshitatsu Matsuda, Kazunori Yamaguchi

Published: 2019, Last Modified: 24 May 2025FIE 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This Research Paper presents investigation results on the websites of about 100 universities for enabling the semiautomatic crawling of massive course syllabi. A syllabus gives fundamental information about a course in a university. For students, a syllabus is one of the most important documents to take a course, because the students can grasp the topics covered by the course through the syllabus. For faculties, a set of syllabi is useful to understand the curriculum offered by the university. Thus, a syllabus is one of the most important clues in the analysis of the educational activities. Some previous works reported that the massive course syllabi of computer science (CS) curricula from about 50 universities can disclose the interesting structures in the CS curricula. In addition, they were useful to develop a tool for supporting students and faculties. However, the course syllabi were collected manually. Therefore, it was difficult to increase the number of syllabi largely. It was also difficult to keep the collected syllabi up-to-date. The semi-automatic crawling of massive course syllabi is needed for further analysis. In this paper, we investigate the pages of the course syllabi at the websites of the top 100 universities in Times Higher Education World University Rankings 2018 (THE2018) with computer science as subject. The pages were collected by a simple crawling and selected manually. From this, we could find the syllabus pages at the websites of about 60 universities. Then, we discovered that the structures of the syllabus pages can be categorized into three types: Link Type, Whole Type, and Database Type. A Link Type consists of a directory page with the hyperlinks to many syllabus pages, where each page corresponds to one syllabus. A Whole Type consists of a whole page which includes many syllabi in it. A Database Type consists of an entrance page to the database from which all the syllabi can be searched. The numbers of Link Type, Whole Type, and Database Type were 31, 17, and 12, respectively. Furthermore, we found that only a few key pages (namely, the directory pages, the whole pages, and the entrance pages) can be discovered automatically to a certain degree by the linear support vector machine. Especially, the directory pages and the whole pages could be found quite accurately. These results are expected to be useful to enable the semi-automatic crawling of the syllabi from websites.