Abstract: Recent agentic language models increasingly accept raw camera pixels rather than tokenized text, underscoring the need for a unified perception paradigm. We explore this idea through Perceive Everything as Pixels (PEAP) and release PixelWorld, a benchmark that renders natural-language, tabular, mathematical and diagrammatic inputs into a single pixel space. Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks, indicating that a vision transformer can capture global textual semantics without explicit tokens. In contrast, reasoning-intensive benchmarks (math and code) exhibit sharp performance drops; however, Chain-of-Thought prompting partially mitigates this gap, hinting that explicit reasoning traces compensate for the missing token structure. We also observe that scenarios with tightly intertwined visual--text cues benefit from the unified pixel view, reducing preprocessing overhead and ambiguity relative to split-modality baselines. PixelWorld therefore provides a compact yet challenging yardstick and encourages wider adoption of PEAP for holistic evaluation of next-generation vision–language agents.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=uJIgM48Hrw
Changes Since Last Submission: Change the fonts and spacing to meet TMLR requirement
Assigned Action Editor: ~Stephen_James1
Submission Number: 5316
Loading