Abstract: Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often available in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model’s reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: **\[Comparison with agentic frameworks in the future work\]** We have added a comparison with agentic frameworks such as TS-Reasoner in Section 9 (Discussion and Future Work).
**\[Cohesive guide on integrating and applying the proposed strategies\]** We have now included a new Section 8: Unifying the Proposed Strategies, that discusses how the strategies can be integrated cohesively in a framework for practitioners seeking to deploy systems in the real-world.
**\[Detailed discussion on the human verification of ground truth reasoning traces\]** We have now included a detailed discussion on how each ground truth reasoning trace was verified, in App C.4.1, including each task whose reasoning trace was modified.
**\[Manual reasoning quality evaluation with human judges\]** To validate the quality of the LLM-judge evaluation of reasoning correctness, we ran a human evaluation experiment where humans were used as the judges and asked to compare a ground truth reasoning trace with a model’s reasoning trace. We detail the procedure and discuss the correlation with the LLM judge evaluation in App. C.6, along with the statistics of agreement for each model under study.
**\[Multiple strategies in tandem\]** We ran two sets of experiments that demonstrate the complementary nature of the proposed strategies, and the potential value in combining them:
1. We combined ICDP and CorDP, where the model is expected to perform a forecast correction and if further provided with a historical in-context example of a forecast correction. The results of this experiment are in App. D.3.
2. We used RouteDP with IC-DP and CorDP in two separate experiments, where the downstream model performs IC-DP or CorDP instead of DP, respectively. Here, routing is similarly performed with predicted difficulty scores from the router model, and the routing curve is compared with the respective random and ideal routing curves. We present and discuss these results in App. F.4.
**\[Experiments with closed-source LLMs\]** We conducted experiments with the IC-DP and CorDP methods, with two closed-source LLMs, namely the GPT-4o and GPT-4o-mini models. We have added these results to the paper, in the respective tables (Table 2 and Table 3\) and figures (Fig 2\) concerning each method in the main text and appendix. The respective paragraphs interpreting the results were also updated to reflect the new results.
**\[Inference Time and Cost\]** We have added the inference time and cost of all experiments that were run in the paper in Appendix H.
Assigned Action Editor: ~Yan_Liu1
Submission Number: 5626
Loading