Keywords: Captcha;Web Agent;Multimodal Large Language Model
Abstract: The path to fully autonomous web agents is currently hindered by a critical bottleneck: their limited ability to handle CAPTCHA. Existing agent benchmarks largely ignore this practical challenge, failing to assess an agent's true capacity in cracking CAPTCHA. To bridge this gap, we comprehensively analyze the CAPTCHA distributions in the real world, and introduce MirrorCAPTCHA benchmark, annotated with Weighted Pass Rate and a novel proposed metric: Completion Degree. This benchmark is designed to serve as a ``mirror'' that accurately reflects the automation capabilities of agents in real scenarios. We filter out 2,095 websites from the Common Crawl, identifying the active CAPTCHA puzzles and classifying them into 18 distinct categories using the K-means clustering algorithm. To ensure practicality, we extract a web subgraph from Common Crawl covering these websites and employ random walks to simulate real-world CAPTCHA encounter frequencies, yielding a realistic measure of agents’ ability. Additionally, we develop a lightweight synthetic data pipeline to train a model, Ovis2-Agent-CAPTCHA-8B, which significantly outperforms current state-of-the-art closed-source models on the MirrorCAPTCHA benchmark, achieving a 9.4% higher average Weighted Pass Rate and a 2.13% higher average Completion Degree compared with the second-place, Gemini-2.5-Pro.
Primary Area: datasets and benchmarks
Submission Number: 12006
Loading