# LBIDD - selection of 50 covariates

This repository contains the code for generating a dataset with 50 covariates sourced from the LBIDD dataset. The resulting dataset is crafted to include complete observations.

## Original Data
The original data is taken from the ["Linked Birth-Infant Death Data 1995 (LBIDD_95)"](https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/periodlinkedus/LinkPE95US.zip)  that can be found on the [NCHS website](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Period_Linked). Its  documentation can be found [here](https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/periodlinked/LinkPE95Guide.pdf).

## Setup

Run the following commands in a terminal to clone this repo, create the Conda environment and open jupyter lab:
```
cd lbidd_data
conda env create -f environment.yml
conda activate lbidd_env
python3 -m ipykernel install --user --name=lbidd_env
jupyter lab
```
## Usage

#### Step 1: Selection features

This notebook utilizes the denominator file from [LBIDD_95](https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/periodlinkedus/LinkPE95US.zip) as input to generate the dataset 'LIBDD_den_dataset_tot.csv'. The resulting dataset contains 50 features with no missing values. 


#### Step 2: Recoding 
This notebook takes as input the file 'LBIDD_den_dataset_tot.csv' generated in step 1 and recodes some of the variables to make them suitable for the experiments made in the paper ["Robust prediction under missingness shifts"]().
The outputs of this notebook are the dataset 'LBIDD_den_final.csv' and its random subsets 'LBIDD_den_final_200000.csv', which variables are described in the excel file "LBIDD_50_description.xlsx".

## Sources
[1] Marian F. MacDorman and Jonnae O. Atkinson. Infant mortality statistics from the linked birth/infant death data set - 
1995 period data. Mon Vital Stat Rep, 46(suppl 2):1-22, 1998.  
[2] [NCHS website](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Period_Linked).

