# Raw Data Processing CoreLogic

This directory contains the scripts to process the raw CoreLogic data into the top 4 ZIP codes.

Before running the scripts, set the environment variable CORELOGIC_DATA_PATH to the directory to save the data.
```
export CORELOGIC_DATA_PATH=/path/to/corelogic/data
```
<!-- /share/data/llm_mortgages/original_data  -->
## Downloading raw data
```
python load_corelogic_origination.py
python load_corelogic.py
```
Will download origination_data.csv and performance_data.csv to the directory specified by CORELOGIC_DATA_PATH.
Note that the performance_data.csv is very large (around 870GB), ensure you have enough space on your machine.

## Filtering to top 4 ZIP codes
To get the top 4 ZIP codes, for the origination and performance data, run the following script.
```
python filter_origination_by_zip.py
python get_loan_ids_from_filtered_origination.py
python filter_performance_by_loan_id.py
```

### Dataset Statistics
<!--
# Total nr rows originiation: 205 000 000
# Total nr rows filtered: 7 800 000 (Top 100 zip codes)
# Total nr rows top 6 zip codes: 593 530
# Nr rows in top 5 zip codes performance:  

#Zip Code: CA000, Count: 132830 (not included)
#Zip Code: 92677, Count: 109642 (Orange County)
#Zip Code: FL000, Count: 101733 (not included)
#Zip Code: 93065, Count: 98673  (Simi Valley, near LA)
#Zip Code: 80015, Count: 97532  (Aurora Colorado)
#Zip Code: 80013, Count: 97392  (Aurora Colorado)
#Zip Code: 91709, Count: 95497 (Chino Hills, near LA)
#Zip Code: 92336, Count: 94794 (Fontana, near LA)

# Expected time: 9 000 000 000 rows, 20 seconds for 10 000 000 rows
# We will need 900 iterations, each of 20 secons. 
# 3 iterations per minute, 180 iterations per hour, 900 iterations is 5 hours

# Number of rows in big table: 915 chunks!


# Total rows filtered corelogic top 6: 21844312

# Top one: zip code 92677, Orange County, CA: 109 642 loan ids
# Total rows in filtered dataset: 3689879

# Filtered zip codes (top 4):
#Zip Code: 92677, Count: 109642
#Zip Code: 93065, Count: 98673
#Zip Code: 91709, Count: 95497
#Zip Code: 92336, Count: 94794
# Total nr rows in dataset: 14650331
-->