- Abstract: There are many barriers to data access and data sharing, especially in the domain of machine learning on health care data. Legal constraints such as HIPAA protect patient privacy but slow access to data and limit reproducibility. We provide a description of an end-to-end system called Kung Faux Pandas for easily generating de-identified or synthetic data which is statistically similar to real data but lacks sensitive information. This system focuses on data synthesis and de-identification narrowed to a specific research question to allow for self-service data access without the complexities required to generate an entire population of data that is not needed for a given research project. Kung Faux Pandas is an open source publicly available1 system that lowers barriers to HIPAA- and GDPR-compliant data sharing for enabling reproducibility and other purposes.
- Keywords: pandas, reproducibility, python, healthcare, data synthesis
- TL;DR: Reproducibility in machine learning is hard. Kung Faux Pandas breaks down barriers and makes data synthesis easier.