MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: web automation, large language models, benchmark, code generation, Selenium, Python, DOM-based interaction, safety evaluation, macro synthesis, vision-less agents, programmatic testing
TL;DR: We introduce MacroBench, a benchmark for evaluating LLMs on synthesizing reusable browser macros over HTML/DOM. Across 561 tasks on seven synthetic websites, four LLMs show strong performance on simple tasks but complete failure on complex workflows.
Abstract: We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize *reusable* browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites—Airbnb-like, TikTok-like, Reddit-like, Instagram-like, Facebook-like, Discord-like, and Threads-like—covering **681** tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across **2,636** model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at [https://github.com/hyunjun1121/MacroBench](https://github.com/hyunjun1121/MacroBench) to enable reproducible assessment of macro synthesis for web automation.
Submission Number: 55
Loading