Tool: Automatically Extracting Hardware Descriptions from PDF Technical Documentation
Keywords: hardware-dependent software, technical documentation, knowledge graph, code generation, open source
TL;DR: The paper describes a tool that implements a modular processor for extracting detailed data sets from technical documentation using deterministic table processing for thousands of microcontrollers.
Abstract: The ever-increasing variety of microcontrollers aggravates the challenge of porting embedded software to new devices through much manual work, whereas code generators can be used only in special cases. Moreover, only little technical documentation for these devices is available in machine-readable formats that could facilitate automating porting efforts. Instead, the bulk of documentation comes as print-oriented PDFs. We hence identify a strong need for a processor to access the PDFs and extract their data with a high quality to improve the code generation for embedded software. In this paper, we design and implement a modular processor for extracting detailed datasets from PDF files containing technical documentation using deterministic table processing for thousands of microcontrollers. Namely, we systematically extract device identifiers, interrupt tables, package and pinouts, pin functions, and register maps. In our evaluation, we compare the documentation from STMicro against existing machine-readable sources. Our results show that our processor matches 96.5 % of almost 6 million reference data points, and we further discuss identified issues in both sources. Hence, our tool yields very accurate data with only limited manual effort and can enable and enhance a significant amount of existing and new code generation use cases in the embedded software domain that are currently limited by a lack of machine-readable data sources.
Area: Computer Architecture
Previous Version: https://openreview.net/forum?id=YLpU1vHVxC
Submission Number: 1