Keywords: Synthetic Data, Synthetic Population, Agent-based Modelling, Statistical Methods, Machine Learning
TL;DR: This paper provides a hybrid framework to generate country-scale synthetic population and also provides metrics to assess the quality of our population.
Abstract: Population censuses are vital to public policy decision-making. They provide insights into human resources, demography, culture, and economic structure at local, regional, and national levels. However, such surveys are very expensive (especially for low and middle income countries with high populations, such as India), and may also raise privacy concerns, depending upon the kinds of data collected.
We introduce a novel hybrid framework which can combine data from multiple real-world surveys (with different, partially overlapping sets of attributes) to produce a real-scale synthetic population of humans. Critically, our population maintains family structures comprising individuals with demographic, socioeconomic, health, and geolocation attributes: this means that our "fake" people live in realistic locations, have realistic families, etc. Such data can be used for a variety of purposes: we explore one such use case, agent-based modelling of infectious disease in India.
We use both machine learning and statistical metrics to gauge the quality of our synthetic population. Our experimental results show that synthetic data can realistically simulate the population for various administrative units of India, producing real-scale, detailed data at the desired level of zoom -- from cities, to districts, to states, eventually combining to form a country-scale synthetic population.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )
7 Replies
Loading