Large Vision Language Models as Algorithmic Reasoners for Multimodal Annotations

Shaina Raza; Mahveen Raza

Large Vision Language Models as Algorithmic Reasoners for Multimodal Annotations

Shaina Raza, Mahveen Raza

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, reasoning, multimodals, bias

TL;DR: LLM as reasoners

Abstract: Large vision–language models (LVLMs) can function as algorithmic annotators by not only assigning labels to multimodal inputs but also generating structured reasoning traces that justify those labels. We introduce \textbf{Reasoning-as-Annotation (RaA)}, a paradigm in which an LVLM outputs a human-interpretable rationale, calibrated confidence, and evidence pointers alongside each label, effectively acting as both classifier and explainer. We evaluate RaA on bias detection in images using a curated dataset of \~2,000 examples with human gold labels. Across closed- and open-source LVLMs, RaA preserves accuracy relative to black-box labeling while adding transparency: rationales were coherent and grounded in 75–90\% of cases, evidence pointers auditable in 70–85\%, and confidence scores correlated with correctness (\$r=0.60\$–\$0.76\$). These results show RaA is model-agnostic and maintains predictive quality while producing interpretable, auditable annotations. We position that RaA offers a scalable way to transform opaque labels into reusable reasoning traces for supervision and evaluation.

Track: Track 2: ML by Muslim Authors

Submission Number: 20

Loading