1. `crawl_c4.py` to download HTML source code for C4 data
2. `summarize_c4.py` to summarize key features
3. (optional) subsample HTML files to ensure diversity
4. `main_construct_data.py`