Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

Somesh Kumar Singh; Harini S I; Yaman Kumar Singla; Changyou Chen; Rajiv Ratn Shah; Veeky Baths; Balaji Krishnamurthy

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

Somesh Kumar Singh, Harini S I, Yaman Kumar Singla, Changyou Chen, Rajiv Ratn Shah, Veeky Baths, Balaji Krishnamurthy

Published: 22 Jan 2025, Last Modified: 08 Apr 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, human behavior, content understanding, behavioral science, vision language models, large language models, memorability, video understanding, image understanding, language understanding, persuasion strategy, marketing, advertising, datasets, behavior in the wild

Abstract: Communication is defined as "*Who* says *what* to *whom* with *what* effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision-language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text, and audio over 26 benchmark datasets across both zero-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by up to 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free lunch. We also release **BLIFT**, our **Behaviour-LLaVA IFT** dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at [behavior-in-the-wild.github.io/behavior-llava](https://behavior-in-the-wild.github.io/behavior-llava).

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9805

Loading