# Who's Gaming the System?

This is the code repository for the NeurIPS submission Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation, which is under review. Thank you for your interest in our work!

Upon code release, we will provide contact information here.

# Running gaming detection models

For all approaches, predicted rankings, a model pickle file, and summary statistics will be saved at `estimators` in a subdirectory specified by the `--name` command line argument.

## Causal approaches

Here, we provide commands for running each type of model.

```
python upcoding_cate.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]
```
## Non-causal approaches

Here, we provide commands for running each type of model.
```
python non_causal_usod.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]
```

To overwrite any results, pass the `--overwrite` flag. By default, the scripts throw an error if the subdirectory of `estimators/` specified by `--name` already exists in order to prevent accidental overwriting.

All configs used for our experiments are provided in `config/experiments`, which we enumerate here. All config paths are relative to `config/experiments.`

| Model    | Config file path | Dataset | Entry script |
| -------- | ------- | ------ | ----- | 
| Payout-only  |  `od/payout.yml` | Synthetic | `non_causal_usod.py` | 
| Random |  `od/random.yml`   | Synthetic |  `non_causal_usod.py` |
| KNN  | `od/knn.yml`   | Synthetic  | `non_causal_usod.py`  |
| ECOD | `od/ecod.yml` | Synthetic | `non_causal_usod.py` |
| DIF | `od/dif.yml` | Synthetic | `non_causal_usod.py` |
| PSM | `psm.yml` | Synthetic | `upcoding_cate.py` |
| S-Learner | `s_final.yml` | Synthetic | `upcoding_cate.py` |
| T-Learner | `t_final.yml` | Synthetic | `upcoding_cate.py` |
| DragonNet | `dragonnet_final.yml` | Synthetic | `upcoding_cate.py` |
| R-Learner | `r_final.yml` | Synthetic | `upcoding_cate.py` |
| S+IPW | `sipw.yml` | Synthetic | `upcoding_cate.py` |
| S+IPW | `ffs_slearner_pw.yml` | Medicare | `upcoding_cate.py` |

While each config file specifies a default dataset, we recommend overriding this directly via the `--dataset` argument. A valid list of datasets can be found in the keys of `config/data_pathspec.yml`.

The config files also include information on hyperparameters, as reported in the Appendix.


# Data generation

## Fully synthetic data

We have provided the synthetic datasets used for each experiment exactly as they were generated in the `analytic/synthetic` directory. However, if you'd like to regenerate your own synthetic datasets, you can follow the instructions below.

### Dataset creation

Example command:
```
python create_dataset.py --config config/datasets/synth_spread[#.#].yaml --overwrite
```
where `[#.#]` is replaced with the mean range ({0.0, 0.1, ... 1.0}).


## FFS Data Extraction

This set of scripts runs HCC extraction and cost calculation for a year's beneficiary diagnoses. This is intended for when you need a quick (~1 hour) way to analyze a small subset (~1%) of the data. 

## Order of operations

1. Run `scan.py`, *e.g.*
```
	python scan.py --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb}
```
and this will scan through the original SAS7BDAT files with chunk size as specified in the command, filter out all beneficiaries with `BENE_ID` ending in `XX`, and then write them to `subset/*.csv`. The runtime is approximately <1hr for the longest claims file, caching 1% of the data.

2. Then, run `data_model.py`:
```
	python data_model.py --name WHATEVER --include-claims {dme,hha,medpar,op,ptb} --format csv
```

The runtime is approximately 5 minutes for the longest claims file (based on 1% of the data generated via `scan.py`.

You can also use `data_model.py` to process the SAS7BDAT files directly, but this is not recommended. The command would be
```
	python data_model.py --name WHATEVER --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb} --format sas7bdat
```

Both approaches save intermediate dataframes for each claim type at `./intermediate/WHATEVER/_staging_*.csv`. The runtime is approximately 2 days for the longest claims file, caching 1% of the data.

3. If you did not run `data_model.py` for all claims simultaneously, you need to run `combine_staging.py`:
```
	python combine_staging.py --stage-dir intermediate/WHATEVER
```

and your final analytic file will be at `intermediate/WHATEVER/data.csv`. The runtime is <1 min.

4. To prepare the final dataset for the modeling scripts, we provide a column-mapping/value-remapping utility script in `create_observational_dataset.py`, which can be used as follows:

```
	python create_observational_dataset.py --dataset medicare_ffs --config config/data/remap_ffs.yml
```

## Extracting state summary statistics

We have provided data processing code for the state summary statistics in `Create state-level summaries.ipynb`, included in `notebooks` for convenience. We have redacted the outputs to comply with data usage requirements. 

The files are publicly-available, and hosted at the following links:
* [NANDA](https://www.openicpsr.org/openicpsr/project/120907/version/V3/view): `https://www.openicpsr.org/openicpsr/project/120907/version/V3/view`
* [Provider of Service](https://data.nber.org/pos/web_update/orig/): `https://data.nber.org/pos/web_update/orig/`. We used the file titled `pos_other_Q42018.zip`. 

The data dictionary for the 2018 Provider of Service file is available separately [here](https://www.hhs.gov/guidance/document/2018-pos-file-0) at the link titled `"December 2018 POS OTHER FLAT File and Layouts - Opens in a new window"`.

# Regenerating figures 

Figures were created in the Jupyter notebook titled `Hit rate plots.ipynb`, included in `notebooks` for convenience. The original results figures in the paper are also included in the `notebooks/` directory. 
