# README

## Project Introduction
This project includes two callable functions:
1. **Call the API to process data** - `modify_and_run`
2. **Call Stata to perform data analysis** - `main`

## Environment Requirements

This project requires the installation of the following libraries:

### Stata Dependencies

This project requires local installation of Stata18, and the following Stata packages:

```stata
ssc install reghdfe
ssc install ftools
```

### Python Dependencies

This project depends on the following Python libraries:

- `stata_setup` (for interaction with Stata)  
  Refer to the official documentation for deploying pystata:  
  https://www.stata.com/python/pystata18/notebook/Magic%20Commands1.html
- `pandas` (for data processing)
- `numpy` (for numerical computation)
- `scipy.stats.binom` (for statistical calculation)

## Usage Examples

### 1. Call API to process data (`modify_and_run`)
#### Input parameters:
- `file_path` (str): Path to the Python file to be modified.
- `new_values` (dict): Configuration values to update, including:
  - `OPENROUTER_API_KEY` (str)
  - `MODEL_NAME` (str)
  - `API_URL` (str)
  - `providers` (list[str])
  - `output_dir` (str)

#### Example call:
```python
file_path = "/Users/yuki/Desktop/batch/DeepSeek_R1_Example.py"

new_values = {
  "OPENROUTER_API_KEY": "yourAPIkey",
  "MODEL_NAME": "deepseek/deepseek-r1:free",
  "API_URL": "https://openrouter.ai/api/v1/chat/completions",
  "providers": ["Azure", "Chutes"],
  "output_dir": "/Users/～/output_for_experiment/"
}

modify_and_run(file_path, new_values)
```

The prediction results generated by LLMs will be saved to the specified `output_dir`.

### 2. Call Stata to perform data analysis (`main`)

#### Input parameters:

- `model` (str): Model name to be analyzed.
- `path_base` (str): Base path of the local dataset.
- `json_path` (str): Path to the label information JSON file.
- `output_dir` (str): Directory to save the results.

#### Example call:

```python
path_base = r"/Users/～/law_ethnics"
json_path = r"/Users/～/params.json"
output_dir = r"/Users/～/output"
model = "NOVALite"

json_Consistency, json_main_p0_1, json_main_P_N, json_inaccuracy_p0_1, json_inaccuracy_P_N, df_dict = main(model, path_base, json_path, output_dir)
```

### Output format

- `json_Consistency`: JSON data for Part I: Consistency Analysis table
- `json_main_p0_1`: JSON data for Part II: Bias Analysis (p < 0.1 table)
- `json_main_P_N`: JSON data for Part II: Bias Analysis (detailed classification of biased labels)
- `json_inaccuracy_p0_1`: JSON data for Part III: Unfair Inaccuracy Analysis (p < 0.1 table)
- `json_inaccuracy_P_N`: JSON data for Part III: Unfair Inaccuracy Analysis (detailed classification)
- `df_dict`: Statistical values used in report text

  Example:

  ```json
  [
      {
          "model": "llama3_1",
          "main_p0_1value": 2.1403870629445287e-14,
          "inaccuracy_p0_1value": 2.1403870629445287e-14,
          "main_p0_05value": 2.7161306139462403e-17,
          "inaccuracy_p0_05value": 2.7161306139462403e-17,
          "main_p0_01value": 3.812785434269691e-22,
          "inaccuracy_p0_01value": 3.812785434269691e-22,
          "avg_valid_id_ratio": 0.1743215495253777,
          "avg_mae": 61.44939024123505,
          "avg_mape": 142.9436988284994,
          "main_total_biased_labels": 31,
          "inaccuracy_total_biased_labels": 21
      }
  ]
  ```

Stata analysis log files are saved in: `path_base/model/log`  
Plots generated by Stata are saved in: `path_base/model/figure`

The analysis results will be saved in `.csv` and `.json` format under `output_dir`:

- `Bias_Analysis_P.csv`: Labels with p < 0.1 in main regression  
- `Bias_Analysis_Pnum.csv`: Detailed classification of significantly biased labels  
- `crime_clustered_P.csv`: Robust standard errors (clustered by crime) with p < 0.1  
- `crime_clustered_Pnum.csv`: Detailed classification (clustered crime)  
- `df_dict.json`: Summary statistics including weighted average MAE, MAPE, valid sample rate  
- `inaccuracy_p.csv`: Inaccuracy regression p < 0.1  
- `inaccuracy_results_Pnum.csv`: Detailed biased labels in inaccuracy regression  
- `original_main_P.csv`: Raw main regression p < 0.1  
- `original_main_Pnum.csv`: Raw main regression detailed biased labels  
- `output_Consistency.csv`: Consistency analysis table  
- `post_2014_P.csv`: Regression results on post-2014 samples (p < 0.1)  
- `post_2014_Pnum.csv`: Detailed classification for post-2014  
- `result.json`: Raw output from Stata analysis  
- `results_special_P.csv`: Special crime categories with p < 0.1  
- `results_special_Pnum.csv`: Special crime category detailed classification  
- `robust_standard_errors_P.csv`: Robust standard errors with p < 0.1  
- `robust_standard_errors_Pnum.csv`: Robust standard error classification  
- `robustness_lgxq_llm_full_P.csv`: Regression on full lgxq LLM with p < 0.1  
- `robustness_lgxq_llm_full_Pnum.csv`: Detailed classification for lgxq LLM full results
