Abstract: With the surge in popularity of Text-to-Image (TTI) models
it has become crucial that we are able to quantify the reliability of such models. This "reliability" is closely related to how strictly these models are able to adhere to a given prompt and not generate incorrect/unnecessary details, also called "hallucinations". Although alot of work has gone into classifying coarse-grained hallucinations, efforts still have to be made in detecting and mitigating fine-grained or attribute-level hallucinations such as colour, number and position have not been looked at.
To this end, in this paper, our contribution is multi-fold:
(i) we first formalize our proposed definition of fine-grained hallucinations and describe its various types;
(ii) subsequently, we propose a modularized fine-grained hallucination detection framework to detect hallucinations (including fine-grained) in TTIMs and;
(iii) propose a novel metric for quantifying these hallucinations.
Our pipelined framework for automatically detecting these attribute-level hallucinations consists of four sub-modules: (i) a detection and segmentation module, (ii) a dense captioning module, for generating captions for targeted regions of the image, (iii) a meta-model, which comprises of an LLM, to cohesively reconstitute the dense captions and (iv) finally, a tree-matching module, which computes targeted attribute level metrics using the syntax trees of both the input prompt and the generated meta-caption. Through extensive experiments with open-source TTIMs, using well-known datasets, we establish the efficacy and adaptability of our proposed methodology.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1965
Loading