Abstract: Real-world forecasting requires models to integrate not only historical data but also relevant contextual information provided in textual form. While large language models (LLMs) show promise for context-aided forecasting, critical challenges remain: we lack diagnostic tools to understand failure modes, performance remains far below their potential, and high computational costs limit practical deployment. We introduce a unified framework of four strategies that address these limitations along three orthogonal dimensions: model diagnostics, accuracy, and efficiency. Through extensive evaluation across model families from small open-source models to frontier models including Gemini, GPT, and Claude, we uncover both fundamental insights and practical solutions. Our findings span three key dimensions: diagnostic strategies reveal the “Execution Gap” where models correctly explain how context affects forecasts but fail to apply this reasoning; accuracy-focused strategies achieve substantial performance improvements of 25-50%; and efficiency-oriented approaches show that adaptive routing between small and large models can approach large model accuracy on average while significantly reducing inference costs. These orthogonal strategies can be flexibly integrated based on deployment constraints, providing practitioners with a comprehensive toolkit for practical LLM-based context-aided forecasting.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=CLuE0GL5xE
Changes Since Last Submission: # Changes post-rebuttal (May 2026):
All post-rebuttal changes appear in blue text in the PDF.
- CorDP (Sec 5.3). Promoted the task-type finding into an explicit practitioner selection criterion (SampleWise-CorDP for partial-RoI tasks; Median-CorDP for full-shape/constrained tasks).
- CorDP (Sec 5.2). Clarified that Median-CorDP and SampleWise-CorDP are mechanistically distinct and complementary, not combinable.
- RouteDP (Table 2). Added a note that percentage labels are nominal and actual task counts are floored.
- RouteDP (Sec 7.2 + App F.2). Clarified that the setup supports per-task routing via a score threshold, equivalent to the batch setting; added score-range analysis across router sizes.
- FxDP (Sec 4.1). Added a note that without FxDP, both steps are performed implicitly in a single forward pass, and that FxDP externalizes this process to make intermediate failures diagnosable.
- FxDP (Sec 4.3). Replaced Fig 3 with an improved version: the Execution Gap segment is now highlighted in orange (previously gray) to draw attention to the central finding, and bracket annotations below the x-axis explicitly group models and label the two key results.
- FxDP (Sec 4.3). Added the evaluation reliability finding (no incorrect explanation leads to an improved forecast) in the results section.
- Discussion and Future Work (Sec 9). Expanded on the connections between CorDP and tool-use agents, emphasized on the need for more unconstrained benchmarks, and elaborated on future directions related to finetuning.
# Initial submission (Feb 2026):
In this submission we address all major concerns from the previous submission through new human evaluation studies that validate our ground truth and confirm agreement between our LLM judges and human annotators, a unified presentation of our strategies, and updated results with recent frontier models.
## Major changes
* **FxDP: Ground truth forecast effects.** We ran a human evaluation on the Prolific platform to verify the ground truth forecast effects. Each task was evaluated by five independent annotators; we aggregated responses by majority vote and used attention checks to ensure annotation quality. We obtained at least 4 out of 5 annotator agreement on all tasks and 80–100% agreement per item, thereby validating the gold standard. Further details are given in Appendix C.2.
* **FxDP: LLM judges.** We compare models’ forecast effects to the ground truth using a panel of LLM judges (GPT-5.2, Claude-Sonnet-4.5, Gemini-2.5-Pro). All pairs of judges show high agreement, which indicates that the evaluation protocol provides clear, consistent, and objective criteria.
* **FxDP: Human–LLM alignment.** To verify that LLM judge results align with human judgment, we designed a human evaluation on a subset of items. Five independent annotators per item performed the same comparison task as the judges, with majority-vote aggregation and attention checks, following standard annotation practice. Agreement between human evaluators and the LLM judges is 91.7–93.3% (Cohen’s κ = 0.81–0.84), with high inter-annotator and inter-judge agreement; see Appendix C.3. We open-source all evaluation artifacts, and our original findings remain unchanged.
* **FxDP: Threshold analysis.** We use a 50% relative improvement threshold in RoI as the primary criterion and report sensitivity analyses at 30%–80% in Appendix C.4. We show that our findings hold across this range of thresholds.
* **FxDP, CorDP, IC-DP.** We updated results for GPT-5.2, Gemini-2.5-Pro, and Claude-Sonnet-4.5. We also report costs and inference times for all of these models.
* **Unification.** We present a unified view of our proposed strategies in a new Figure 1 and report results with multiple strategies combined.
* **Supplementary material.** We provide code for all proposed methods.
## Minor changes
* **Renaming ReDP to FxDP:** We renamed ReDP (Direct Prompting with Reasoning over Context) to FxDP (Direct Prompt with Forecast Effect Explanation) to reflect our focus on the model’s perceived effect of the context on the forecast rather than full step-by-step reasoning traces. We now call the outputs “forecast effect explanations.” The methodology and main results are unchanged.
* **FxDP:** We add Figure 2, which summarizes the major results, and add the full results table in the appendix.
Assigned Action Editor: ~Zachary_B._Charles1
Submission Number: 7655
Loading