PUMGPT: A Large Vision-Language Model for Product Understanding

PUMGPT: A Large Vision-Language Model for Product Understanding

ACL ARR 2024 June Submission3600 Authors

16 Jun 2024 (modified: 12 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce \textbf{\textsc{PumGPT}}, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. \textbf{\textsc{PumGPT}} focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce \textbf{\textsc{PumBench}}, a benchmark to evaluate product understanding across LVLMs. Our experiments show that \textbf{\textsc{PumGPT}} outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of \textsc{PumGPT}, demonstrating the necessity for a specialized model in the e-commerce domain.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering;

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 3600

Loading