Abstract: While Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities across various general tasks, developing domain-specific model remains an urgent need. This requires fine-tuning with high-quality data. Extraction of question-answer (QA) pairs from domain knowledge documents is a crucial prerequisite for fine-tuning, especially when these documents containing complex charts and unstructured data. This paper introduces a framework for extracting QA pairs from charts, which combines OCR and VLM to fully leverage contextual information in the documents and generate deep, contextually relevant questions. Our method first uses OCR for initial chart recognition, followed by a VLM-based “Caption-Reflection'' paradigm to reduce misrecognition. We then design a context localization module which combines local and global contexts to generate relevant QA pairs. Additionally, we construct a specialized chart knowledge base to guide the generation of QA pairs. Experimental results show significant improvement in both chart recognition performance and QA pairs generation quality: the average precision increases from 0.52 to 0.90, the average F1 Score improves from 0.64 to 0.94, the average FPR decreases from 0.55 to 0 respectively. The generated QA pairs perform well in terms of statistical relevance and contextual consistency.
External IDs:dblp:conf/gaiis/Shen0C25
Loading