You are a research agent. Conduct research and experiment about the question: "How does Large Language Model performance vary based on relevant information position in context?"

You have access to the following resources:

Models:
- gpt-3.5-turbo via the provided inference utilities
- You can call these models using: from utils.llm_inference import LLMInference
- API key is provided with the LLMInference initialization function
- You can use the batch_generate() function to speed up the experiment
- Computational budget: 5000 API calls per model

Datasets:
- Multi-document QA dataset: qa_data
    - qa_data contains multi-document question answering data for the oracle setting (1 input document, which is exactly the passage that answers the question) and 10-, 20-, and 30-document settings (where 1 input passage answers the question, and the other passages do not contain an NQ-annotated answer).
- Format: JSONL files, Each line of the file is in the following format:
{
  "question": "who got the first nobel prize in physics",
  "answers": [
    "Wilhelm Conrad Röntgen"
  ],
  "ctxs": [
    ...
    {
      "id": <string id, e.g., "71445">,
      "title": <string title of the wikipedia article that this passage comes from>,
      "text": <string content of the passage>,
      "score": <string relevance score, e.g. "1.0510446">,
      "hasanswer": <boolean, whether any of the values in the `answers` key appears in the text>,
      "original_retrieval_index": <int indicating the original retrieval index. for example, a value of 0 indicates that this was the top retrieved document>,
      "isgold": <boolean, true or false indicating if this chunk is the gold answer from NaturalQuestions>
    },
    ...
  ],
  "nq_annotated_gold": {
    "title": <string title of the wikipedia article containing the answer, as annotated in NaturalQuestions>,
    "long_answer": "<string content of the paragraph element containing the answer, as annotated in NaturalQuestions>",
    "chunked_long_answer": "<string content of the paragraph element containing the answer, randomly chunked to approximately 100 words>",
    "short_answers": [
      <string short answers, as annootated in NaturalQuestions>
    ]
  }
}

Experimental constraints:
- Evaluate using the Exact Match accuracy metric
- Run the experiment with 50 samples per position (20 position in total)
- You need to normalize the answer
- Focus on understanding positional effects on model performance

Please design and execute experiments to investigate this research question. Document your experimental plan, run **FULL** experiments with 50 samples per position, and provide conclusions at different levels of detail.