Keywords: Large Language Models (LLMs), Multi-agent Systems, Self-Correction, LLM-As-A-Judge
Abstract: Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models genuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.
Submission Number: 324
Loading