WellLogBench: A Domain-Expert Curated Benchmark for Evaluating LLM Reasoning on Subsurface Well Log Data
Keywords: Well log interpretation, Petrophysics, Domain-specific benchmark, Large language models, Long-context reasoning, Context engineering, Tool-augmented reasoning, Multimodal evaluation
Abstract: Large language models (LLMs) have shown promising capabilities in text summarization, reasoning, and time-series analysis, offering significant potential to simplify workflows in the oil and gas industry by enhancing and accelerating subsurface exploration/development. However, their ability to reason over multivariate well-log data remains largely unexplored. Accurate interpretation requires integrating domain expertise with numerical reasoning over long, depth-indexed sequences—challenges unmet by existing NLP or tabular benchmarks. To the best of our knowledge, we provide the first expert-curated benchmark focused on well log reasoning, using well log curves as the primary input modality. We introduce WellLogBench, the first comprehensive benchmark for well log reasoning, comprising 1,085 expert-annotated QA pairs based on well log data curated from multiple wells across three geological basins in different parts of the world. The benchmark spans diverse petrophysical categories and reasoning complexities, from single-formation to multi-well analysis. Addressing the ultra-long context of raw logs and domain gap, we propose a context-engineering framework that transforms high-dimensional well log data into compact, petrophysics-rich context optimized for LLM processing. We evaluate state-of-the-art open and closed-source models; the top performer (Gemini 2.5 Pro) achieves 74\% and the closest open-source alternative is Qwen3-235B-Instruct (57\%) on our composite metric, highlighting substantial opportunities for improvement. WellLogBench establishes a rigorous foundation for developing and evaluating AI systems for petrophysics.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Resources and Evaluation, Language Modeling, Interpretability and Analysis of Models for NLP, Machine Learning for NLP, NLP Application,Human-Centered NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Data resources, Data analysis
Languages Studied: English
Submission Number: 7870
Loading