Skills, Benchmarks, and Verification Are What AI-Assisted Research Needs

Patrik Reizinger; Wieland Brendel

Skills, Benchmarks, and Verification Are What AI-Assisted Research Needs

Patrik Reizinger, Wieland Brendel

Published: 01 Mar 2026, Last Modified: 20 Mar 2026P-AGIEveryoneRevisionsBibTeXCC BY 4.0

Track: Track 2: Socio-Economical and Future Visions

Keywords: ai4research, AI-assisted research, verification infrastructure, skills marketplace, test-driven research, jagged frontier, research workflows, benchmarks, citation verification, AI tooling

TL;DR: We propose the Research Agora—a skills marketplace, benchmarks, and test-driven research—as integrated infrastructure for verified AI-assisted research, also releasing 30+ working skills.

Abstract: AI adoption in research has outpaced our understanding of its limitations. Researchers embraced these tools for everything from literature review to code generation, yet the jagged frontier—the uneven boundary where AI excels at some tasks and fails unpredictably at others—remains unmapped. We already see real-world consequences: papers with fabricated citations passed peer review at NeurIPS 2025, and the rising tide of "AI slop"—low-quality, machine-generated submissions—is straining review systems across venues. A preliminary survey of AI researchers reveals a similar finding: while many actively use vanilla AI coding pipelines, there is relatively little adoption and development of robust and verified workflows for other parts of the scientific process like ideation, literature research, or experiment design. This suggests the path forward: build infrastructure for discovery, comparison, and verification. We propose the Research Agora to close this gap—a marketplace for discovering reusable AI workflows (skills), benchmarks for comparing their effectiveness, and test-driven research for verifying outputs before they propagate. We have built working examples—from reference checking that catches hallucinated citations to structured writing with layered quality checks—demonstrating how different verification levels apply to different tasks. We release these as a starting point and call on the research community to extend, improve, and benchmark them.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 25

Loading