PixelWorld: Towards Perceiving Everything as Pixels

Zhiheng Lyu; Xueguang Ma; Wenhu Chen

PixelWorld: Towards Perceiving Everything as Pixels

Zhiheng Lyu, Xueguang Ma, Wenhu Chen

Published: 11 Oct 2025, Last Modified: 11 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent agentic language models increasingly accept raw camera pixels rather than tokenized text, underscoring the need for a unified perception paradigm. We explore this idea through Perceive Everything as Pixels (PEAP) and release PixelWorld, a benchmark that renders natural-language, tabular, mathematical and diagrammatic inputs into a single pixel space. Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks, indicating that a vision transformer can capture global textual semantics without explicit tokens. In contrast, reasoning-intensive benchmarks (math and code) exhibit sharp performance drops; however, Chain-of-Thought prompting partially mitigates this gap, hinting that explicit reasoning traces compensate for the missing token structure. We also observe that scenarios with tightly intertwined visual--text cues benefit from the unified pixel view, reducing preprocessing overhead and ambiguity relative to split-modality baselines. PixelWorld therefore provides a compact yet challenging yardstick and encourages wider adoption of PEAP for holistic evaluation of next-generation vision–language agents.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=uJIgM48Hrw

Changes Since Last Submission: Change the fonts and spacing to meet TMLR requirement

Assigned Action Editor: ~Stephen_James1

Submission Number: 5316

Loading