Steps:
1. print all the possible tasks: `all_possible_task.py` -> `all_possible_tasks.csv`
2. print all the possible initial system prompts (iterate all the possible tasks): `prompts` -> `all_possible_init_system_prompt.txt`
3. run experiments: `experiments`
    1. data: `results/{model_name}/`
    2. terminal output: `logs/{model_name}/`
4. analyze all the results: `results/{model_name}/`
5. output final outcomes: `results/{model_name}/`


Models:

- ChatGPT
    - Tested:
        - model_name = "gpt-4o-mini"
        - model_name = "gpt-4.1-mini-2025-04-14"
        - model_name = "o3-mini-2025-01-31"
        - model_name = "o3-2025-04-16"
    - Considered:
        - chat models:
            - model_name = "gpt-4o-mini"  $0.15, model: 2024-07-18
            - model_name = "gpt-4.1-mini-2025-04-14"  $0.4
            - model_name = "gpt-4-turbo-2024-04-09" not efficient
            - model_name = "gpt-4o-2024-08-06"   $2.5  out of date
            - model_name = "gpt-4o-2024-11-20"   $2.5  not efficient
        - reasoning models:
            - model_name = "o1-mini-2024-09-12"   $1.1  out of date
            - model_name = "o3-mini-2025-01-31"   $1.1
            - model_name = "o4-mini-2025-04-16"   $1.1  format failure
            - model_name = "o1-preview-2024-09-12"  $15  out of date
            - model_name = "o1-2024-12-17"  $15  out of date, not efficient
            - model_name = "o3-2025-04-16"  $10
            - model_name = "o1-pro-2025-03-19"  $ not efficient
- DeppSeek
    - 


```
model_name = "gpt-4o-mini"
model_name = "gpt-4.1-mini-2025-04-14"
model_name = "o3-mini-2025-01-31"
model_name = "o3-2025-04-16"
```

