Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking
Keywords: LLM forecasting, leakage-aware benchmarking, real-time nowcasting, macro retrieval, equity factor ranking
TL;DR: We show that real-time CPI nowcasts are crucial for leakage-controlled LLM factor-ranking benchmarks, and that a kNN macro-analog baseline recovers much of the median signal while the LLM’s marginal value appears in extreme rankings.
Abstract: Forecasting benchmarks for retrieval-augmented LLMs routinely confound model capability with information leakage: features labeled with a target’s timestamp are often not observable at the system’s decision time. We study leakage-controlled equity factor ranking with a retrieval-augmented 7B open-source LLM forecaster. At each month-end from 2023-04 to 2026-03, the forecaster observes only decision-time information: lag-shifted FRED macro variables, recent macro-event summaries, and the Cleveland Fed’s archived daily CPI nowcast for unreleased current-month inflation. A macro-analog retrieval module selects historical states, a critic LLM compresses them into one tactical rule, and an actor LLM maps the current state and recent rules into scores for seven U.S. equity style factors. The full pipeline obtains a median monthly Spearman rank IC of +0.154, with positive means across three non-overlapping contiguous 12-month subwindows; the mean IC remains statistically underpowered, with a bootstrap 95% confidence interval that includes zero. Non-LLM baselines under the same decision-time constraint demonstrate that a kNN macro-analog model recovers a comparable median IC, indicating that real-time inflation information and macro-similar retrieval explain much of the median signal. The LLM pipeline retains higher mean IC and a stronger long-short allocation sanity check, suggesting that any marginal benefit is concentrated in the extreme rankings that drive long-short portfolio formation. A descriptive audit of the 36 critic rules and per-month case studies appears in the appendix.
Submission Number: 178
Loading