Abstract: Predicting stock movements from financial disclosures remains challenging due to noisy market signals and sparse supervision. We construct a large-scale dataset of over 25{,}000 SEC filings (10-K and DEF 14A), aligned with daily stock prices and economic indicators for S&P 500 companies from 2000–2024. We formulate a three-class classification task (Up, Down, Stable) based on a 7-day input window, and compare model performance under two regimes: an unbalanced setting with a ±2% stability threshold, and a more balanced one at ±0.5%. Deep models like GRUs and Transformers tend to collapse to the majority class, while XGBoost and SGD with RBF kernel outperform in the unbalanced and balanced settings, respectively. We also incorporate a Retrieval-Augmented Generation (RAG) chatbot for querying filings and generating grounded explanations. Our results highlight the robustness of combining traditional models with static textual features for financial trend prediction and document understanding.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: Information Retrieval and Text Mining, Interpretability and Analysis of Models for NLP, NLP Applications, Generation, Machine Learning for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English
Keywords: Information Retrieval and Text Mining, Interpretability and Analysis of Models for NLP, NLP Applications, Generation, Machine Learning for NLP
Submission Number: 451
Loading