FinReflectKG - MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence

Published: 21 Nov 2025, Last Modified: 14 Jan 2026GenAI in Finance PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Graphs, Large Language Models, Information Retrieval, SEC Filings, Financial Benchmark Dataset, Natural Language Processing, Multi-Hop Question Answering, Financial AI
TL;DR: We introduce FinReflectKG - MultiHop, a knowledge-graph-guided multi-hop QA benchmark over S&P 100 filings, showing that precise KG-based retrieval enhances multi-hop reasoning accuracy and efficiency for LLMs on complex finance questions.
Abstract: Multi-hop reasoning over financial disclosures is often a retrieval problem before it becomes a reasoning or generation problem: relevant facts are dispersed across sections, filings, companies, and years, and LLMs often expend excessive tokens navigating noisy context. Without precise Knowledge Graph (KG)-guided selection of relevant context, even strong reasoning models either fail to answer or consume excessive tokens, whereas KG-linked evidence enables models to focus their reasoning on composing already retrieved facts. We present FinReflectKG - MultiHop, a benchmark built on FinReflectKG, a temporally indexed financial KG that links audited triples to source chunks from S&P 100 filings (2022-2024). Mining frequent 2-3 hop subgraph patterns across sectors (via GICS taxonomy), we generate financial analyst style questions with exact supporting evidence from the KG. A two-phase pipeline first creates QA pairs via pattern-specific prompts, followed by a multi-criteria quality control evaluation to ensure QA validity. We evaluate three controlled retrieval scenarios: (S1) precise KG-linked paths; (S2) text-only page windows centered on relevant text spans; and (S3) relevant page windows with randomizations and distractors. Across both reasoning and non-reasoning models, KG-guided precise retrieval yields substantial gains on the FinReflectKG - MultiHop QA benchmark dataset, boosting correctness scores by ~ 24% while reducing token utilization by ~ 84.5% compared to the page-window setting, which reflects the traditional vector retrieval paradigm. Spanning intra-document, inter-year, and cross-company scopes, our work underscores the pivotal role of knowledge graphs in efficiently connecting evidence for multi-hop financial QA. We also release a curated subset of the benchmark (555 QA pairs) to catalyze further research. Dataset link: https://github.com/finreflectkg/finreflectkg-multihopqa
Submission Number: 66
Loading