Abstract: Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving answer without explicit prompts. Next, SRC searches a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.
Lay Summary: This work introduces Self-Rationale Calibration (SRC), a framework to improve Large Vision-Language Models (LVLMs) by addressing the misalignment between their generated rationales and answers, which often leads to reasoning errors. SRC first fine-tunes LVLMs to explicitly output rationales before answers, then generates diverse rationale-answer candidates. A novel R-Scorer evaluates these pairs for rationale quality and factual consistency, informing a confidence-weighted preference fine-tuning process. This iterative calibration significantly enhances LVLM perception, reasoning, and generalization, underscoring the importance of rationale-answer alignment.
Primary Area: Deep Learning->Foundation Models
Keywords: LVLM; Post-training;
Submission Number: 1477
Loading