Container-driven Reproducible Research Made Simple

Published: 31 May 2024, Last Modified: 24 Feb 2026Berkeley-Stanford Workshop on Veridical Data Science 2024EveryoneRevisionsCC BY 4.0
Abstract: Scholarship in data science consists of a complete software development environment along with instructions for all the results and figures. However, fully specifying and reproducing an arbitrary data science workflow can often be challenging, especially with the increasing complexity of software configurations/dependencies and computational infrastructure. Previously, research in reproducibility focused on language specific tools that could not generalize to an arbitrary specification. Furthermore, reproducibility that relies on documentation can involve specialized adjustments and tweaking that many researchers may not have the time or necessary background for. To address this common deficiency, we introduce a computational research framework to the data science community that can specify complex computational environments using an OS-level virtualization technology called containers. We show that the container-driven reproducibility approach balances flexibility and ease of use through Visual Studio Code, a popular code editor. Containerization is a form of virtualization with operating system-level fine-tuning while being much more lightweight than a virtual machine, which emulates system hardware and an entire guest operating system. As such, containers are more suitable for running locally on a laptop, with the added benefit of more straightforward portability to other systems, such as cloud platforms or high-performance computing environments. Containerization for reproducible research has become of interest in recent years with the encouraged use of both domain specific proprietary and open source solutions to research environments. Despite the usefulness of containers for computational reproducibility, using them has a steep initial learning curve. To alleviate this, we outline the use of development containers with Visual Studio Code for computational research reproducibility and introduce our code-generating template repository for further simplifying the initial setup of Python and/or R-based workflows commonly used in data science. The template is used to create an initial set of files for Visual Studio Code development containers with the entire computational environment's information in two files, including interface with code files, remote servers, JupyterLab, and RStudio. These generated project files are portable and easily transferable between users and systems. In addition, the configuration is expandable and customizable to include additional software tools. This setup has been tested and used on local machines, servers with concurrent users, and high performance computing environments such as Jetstream2. The required containerization software setup is simple to configure to support these systems. This framework provides a generic way to set up, reproduce, and distribute research with minimal interaction at the end of the researcher all while alleviating the need for hands-on reproduction of a given research environment.
Loading