Themisto: Jupyter-Based Runtime Benchmark

Konstantin Grotov; Sergey Titov

Themisto: Jupyter-Based Runtime Benchmark

Konstantin Grotov, Sergey Titov

Published: 06 Mar 2025, Last Modified: 19 Apr 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: benchmarks, code generation, Jupyter notebooks

TL;DR: Benchmark Themisto reveals LLMs struggle with Jupyter notebook tasks, showing limitations in output prediction and next cell generation using runtime context.

Abstract: In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 63

Loading