I want to research the question "How do self-correction methods impact large language model performance across math, commonsense reasoning, and multi-hop question answering benchmarks?"
You should prompt the models to undergo two rounds of self-correction and record result of each.

You have access to the following resources:

Models:
- gpt-3.5-turbo and gpt-4o via the provided inference utilities
- You can call these models using: from utils.llm_inference import LLMInference
- API key is provided with the LLMInference initialization function
- You can use the batch_generate() function to speed up the experiment
- Computational budget: 300 API calls per model

Datasets:
- GSM8K (grade school math word problems): /data/gsm8k/
- CommonSenseQA (commonsense multi-choice QA): /data/commonsenseqa/
- HotpotQA (open-domain multi-hop QA): /data/hotpotqa/

Experimental constraints:
- Evaluate using the Exact Match accuracy metric

Please design and execute **FULL** experiments to investigate this research question. Document your experimental plan, run your experiments, and provide conclusions at different levels of detail.