Multi-Modal Understanding of FOMC Press Conferences for Question Generation via Visual and Textual Cues

Khaled Alnuaimi, Mohamad Alansari, Mohammed Salah, Hasan Al Marzouqi, Andreas Henschel

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE AccessEveryoneRevisionsCC BY-SA 4.0
Abstract: Federal Open Market Committee (FOMC) press conferences represent critical information channels through which monetary policy decisions impact financial markets. In the FOMC context, Question Generation (QG) plays an important role in probing economic outlooks and policy intentions. Traditional analysis methods of FOMC press conferences focused solely on textual content, while visual features such as facial expressions and gestures encode valuable, complementary signals. To address this limitation, this work proposes leveraging Vision-Language Models (VLMs) for enhanced financial QG by jointly modeling textual and visual modalities. To support this approach, we introduce FOMC-QA, a large-scale multi-modal dataset comprising 40 hours of press conference video segments aligned with their corresponding transcripts, context paragraphs, and audience questions. Using this dataset, state-of-the-art (SOTA) VLMs (Sa2VA, VideoGLaMM, and Video-ChatGPT) are rigorously benchmarked against their text-only Large Language Model (LLM) counterparts (Qwen2.5, Phi-3 Mini, and Vicuna). The obtained results show that VLMs outperform their LLM-only versions across both semantic similarity and question relevance, highlighting the benefit of visual grounding. The dataset is publicly released for replicability at: https://www.kaggle.com/datasets/gautiermarti/fomc-press-conferences-qa-evasive-answers
Loading