Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging

Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging

ICLR 2026 Conference Submission9677 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self-supervised learning, social media images, computer vision, human in the loop, visualization or interpretation of learned representations

TL;DR: A collection of topical image posts, 16,657, from Instagram related to the anti-feminicide movement in Mexico were collected and analyzed using computer vision models and human in the loop evaluation.

Abstract: Overview: Social media images are never truly stand-alone, they are grouped to share a specific message sprinkled across a user's feed. Understanding the message of these groups is increasingly important as more people get their information from social media, especially about critical news topics. Using images to understand messaging systems has been done qualitatively for smaller batches, particularly in the news context, but less has been done from a quantitative approach on domain specific images. To review the messaging structure of a larger number of topical images, 16,567 image posts from Instagram related to the anti-feminicide movement in Mexico were collected and analyzed. The analysis included using unsupervised (ResNet50) and self-supervised computer vision models (CLIP, BLIP-2’s embedding model) on the image feature embeddings, and evaluated with tuned density based metrics (HDBSCAN). Human in the loop evaluation is also applied through a content analysis of top images within each cluster to compare the various facets of representation within the image collection. Clustering shows that the embeddings are densely packed, representing visual overlap across the collection of topical images. Human in the loop content analysis enabled a closer reading of the visual images, especially those that contained text which found a range of the topics including, woman/male comparisons, accusers, life examples, domestic violence, gender violence, protest phrasing, support or solidarity of someone or a cause. The comparison showed that the best separation results came from the CLIP model, which suggest that 9 clusters are the best cluster output for the data. It is important to note that the clusters show overlap, found both in the quantitative and qualitative evaluation. This overlap can be argued that there is nuance within the image feature embeddings. Human in the Loop Reasoning: Organizing social movement knowledge as interdependent activities that represent a multitude of various frames, within clusters has the ability to identify topics and themes within the content, and explore similarities. The inductive content analysis used in this work groups data together through the process of abstraction to answer a research question using concepts, categories, or themes on the unit of analysis. Here, the inductive coding process is implemented by using a systematic approach which is considered a qualitative approach to build knowledge within the research. It calls on the image’s collection of the metadata, image id, main hashtag, alternative text provided by Instagram, the account poster, the number of likes, the accompanying post comment, and the data it was posted. The content analysis is implemented two-fold. First by getting a review of the actual contents within the image by reviewing its composition to identify icons or objects within the image, the background, foreground, focal point, style, frame, prominent color, and sentiment; and second by theorizing possible sentiment or bias brought about by the contents in the image, its design influence, as well as the potential influence from the post’s number of likes and the type of account, group account, individual account, organizational account, etc. Ultimately, this coding schema can provide support for how the image might provide context for understanding the overall representation of the collection of topical images. Future Work: Future work for this research will create labels and annotate the data to identify more clear groups within the data. Specifically, working with multimodal LLMs to label the data and compare that against a sample of human-annotated data could be a strong way to create clear groupings and analyze how each image is related to another. These relations could be analyzed using semantic similarities connected in a graph structure, including more human in the loop evaluation. Because this data was collected from Instagram, their policies do not enable open sharing of the content. In this case, particularly to make this work reproducible, would be interesting to create synthetic data and put it through this work to see how the analysis pipeline works in comparison, and so others can use it as a baseline for their own collection of topical images.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9677

Loading