Abstract: Despite the longstanding adage “an image is
worth a thousand words,” generating accurate
hyper-detailed image descriptions remains unsolved. Trained on short web-scraped imagetext, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric
approach with ImageInWords (IIW), a carefully
designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains
compared to recent datasets (+66%) and GPT4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show
that fine-tuning with IIW data improves these
metrics by +31% against models trained with
prior work, even with only 9k samples. Lastly,
we evaluate IIW models with text-to-image generation and vision-language reasoning tasks.
Our generated descriptions result in the highest fidelity images, and boost compositional
reasoning by up to 6% on ARO, SVO-Probes,
and Winoground datasets. We release the IIWEval benchmark with human judgement labels,
object and image-level annotations from our
framework, and existing image caption datasets
enriched via IIW-model.
Loading