Abstract: To build a programming by demonstration (PBD) web scraping tool for end users, one needs two central components: a list finder, and a record and replay tool. A list finder extracts logical tables from a webpage. A record and replay (R+R) system records a user's interactions with a webpage, and replays them programmatically. The research community has invested substantial work in list finding --- variously called wrapper induction, structured data extraction, and template detection. In contrast, researchers largely considered the browser R+R problem solved until recently, when webpage complexity and interactivity began to rise. We argue that the increase in interactivity necessitates the use of new, more robust R+R approaches, which will facilitate the PBD web tools of the future. Because robust R+R is difficult to build and understand, we argue that tool developers need an R+R layer that they can treat as a black box. We have designed an easy-to-use API that allows programmers to use and even customize R+R, without having to understand R+R internals. We have instantiated our API in Ringer, our robust R+R tool. We use the API to implement WebCombine, a PBD scraping tool. A WebCombine user demonstrates how to collect the first row of a relational dataset, and the tool collects all remaining rows. WebCombine uses the Ringer API to handle navigation between pages, enabling users to scrape from modern, interaction-heavy pages. We demonstrate WebCombine by collecting a 3,787,146 row dataset from Google Scholar that allows us to explore the relationship between researchers' years of experience and their papers' citation counts.
0 Replies
Loading