OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Document Archive

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM
Abstract: The opioid crisis is a serious public health issue that requires innovative solutions for effective analysis and deeper understanding. Despite the vast amounts of data in the Opioid Industry Documents Archive (OIDA), the complexity, multimodal nature, and specialized characteristics of healthcare data necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. We extract extensive multimodal information from each document, including textual, visual, and layout information, to capture a wide range of features. Given the extracted dense information, we collect a comprehensive dataset comprising over 3 million question-answer pairs with the assistance of multiple AI models. We further develop domain-specific Large Language Models (LLMs) and investigate the impact of multimodal data on task performance. Our benchmarking and model efforts strive to produce an AI assistant system which can efficiently process the dataset and extract valuable insights. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks, highlighting the effectiveness of proposed benchmark in addressing the opioid crisis. The data and model will be made publicly available for research.
Supplementary Material: pdf
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3167
Loading