Take Another Look: Improving Information Extraction From Images With Multiple Encoders

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-encoder learning, model averaging, computer vision, deep learning
Abstract: Unstructured image data are increasingly used across diverse applications, yet standard practices for extracting predictive features remain underexplored. We redefine encoder choice as an ensemble and present three strategies for leveraging multiple pre-trained encoders to mitigate model risk associated with using a single encoder. These strategies include: feature union, which concatenates encoder features before model training, and two forms of model averaging, which weight predictions from single encoders using either equal weights or weights chosen with regression. Across six prediction applications—house prices from exterior images, poverty rates from satellite imagery, breast cancer and pneumonia detection from chest X-rays, rice disease classification from leaf images, and facial age —our results show three key findings: (i) using multiple encoders consistently outperform single encoders, with out-of-sample $R^2$ for house prices, for example, increasing from 15.01\% (best single encoder) to 24.1\% with feature union; (ii) model averaging reduces error rates across all classification tasks, for example from 13.31\% to 8.68\% offering consistent gains over both single encoders and feature union; and (iii) using multiple encoders methods mitigate model risk, as accuracy varies widely across individual encoders in different applications. These results demonstrate that using multiple encoders provides robust, high-performing pipelines for image-based prediction without requiring extensive task-specific fine-tuning.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3811
Loading