RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

ICLR 2026 Conference Submission20075 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Stock Movement Prediction, Large Language Model, Inference-Time Scaling, Reinforcement Learning, Self-Supervised Fine-Tuning

TL;DR: RETuning boosts LLMs’ independent reasoning for stock prediction using analytical evidence frameworks and a new large-scale dataset.

Abstract: Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities and inference-time scaling on mathematical and coding tasks. However, their application to financial tasks—especially the most fundamental task of stock movement prediction—remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs are easily swayed by contextual viewpoints, tending to follow analysts' opinions rather than exhibit a systematic, independent analytical logic in their chain-of-thoughts (CoTs). (2) LLMs often list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose **R**eflective **E**vidence **Tuning** (**RETuning**), a cold-start method prior to reinforcement learning, to enhance prediction ability. While generating CoT, **RETuning* encourages dynamically constructing an analytical framework from diverse information sources, organizing and scoring evidence for price up or down based on that framework—rather than on contextual viewpoints—and finally reflecting to derive the prediction. This approach maximally aligns the model with its learned analytical framework, ensuring independent logical reasoning and reducing undue influence from context. We also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks, with long contexts (32K tokens) and over 200K samples. In addition to price and news, it incorporates analysts' opinions, quantitative reports, fundamental data, macroeconomic indicators, and similar stocks. Experiments on this new dataset show that, as a cold-start method, **RETuning** successfully unlocks the model's reasoning ability in the financial domain. During reinforcement learning, response length steadily increases under the designed curriculum setting. Furthermore, inference-time scaling still works even after 6 months or on out-of-distribution stocks, since the models gain valuable insights about stock movement prediction.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20075

Loading