{
  "title": "SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables",
  "introduction": "In this paper, we propose a novel dataset SCITAB, which fulfills these stated criteria. It contains 1,225 challenging scientific claims, each demanding compositional reasoning for verification using scientific tables. Our data is derived from the SciGen dataset (Moosavi et al., 2021), a resource that includes scientific tables and claims crawled from arXiv.org. We first manually filter out the checkworthy scientific claims from the raw data. Following this, we employ a strategy of human–model collaboration, as depicted in Figure 2, to generate claims that are either contradicted or unverifiable based on the table’s content. Figure 1 shows a claim from SCITAB and the corresponding reasoning process to verify it. Compared with existing benchmarks, SCITAB is closer to real-world scientific fact-checking in terms of more realistic claims and table-based evidence. Through data analysis, we further show that the claims in SCITAB necessitate a more comprehensive and nuanced set of reasoning skills for verification, e.g., numerical reasoning and commonsense knowledge, etc.",
  "related_work": {
    "Scientific Fact-Checking Datasets": "Existing datasets for scientific fact-checking are summarized in a recent survey from Vladika and Matthes (2023). These datasets differ in: 1) domain: biology (Wadden et al., 2020; Akhtar et al., 2022), COVID-19 (Saakyan et al., 2021; Sarrouti et al., 2021; Mohr et al., 2022; Wang et al., 2023), and climate (Diggelmann et al., 2020), 2) claim creation: crowd-sourced claims v.s. natural claims, and 3) evidence source: Wikipedia articles (Diggelmann et al., 2020) or research papers (Wadden et al., 2020, 2022; Sarrouti et al., 2021). However, most of these datasets rely on text evidence to verify claims. SEM-TAB-FACTS (Wang et al., 2021) is the only existing dataset based on scientific tables, but it is limited to simple, crowd-sourced claims. To bridge this gap, we construct SCITAB which contains complex claims from authentic scientific papers with table-based evidence.",
    "Table-based Reasoning": "Table-based reasoning requires reasoning over both free-form natural language queries and (semi-)structured tables. Early works either rely on executable languages (e.g., SQL and SPARQL) to access the tabular data (Yin et al., 2016; Yu et al., 2018) or employ graph neural networks to capture logical structure in statements, e.g., LogicFactChecker (Zhong et al., 2020) and ProgVGAT (Yang et al., 2020). However, these approaches often struggle with generalization, as they are tightly bound to specific table formats and language patterns. To address this, we have seen a shift toward table pre-training, with the advent of TableBERT (Chen et al., 2020), TAPAS (Herzig et al., 2020), SaMoE (Zhou et al., 2022), PASTA (Gu et al., 2022), and DATER (Ye et al., 2023). These methods encode sentence-table pairs using language models and transform table-based reasoning into question-answering or natural language inference. In our work, we focus on evaluating pretraining-based methods on SCITAB because they not only demonstrate superior performance but also offer the benefits of few-shot learning."
},
  "dataset": {
    "name": "SCITAB",
    "size": "1,225 claims",
    "source": "Derived from the SciGen dataset, which includes scientific tables and claims crawled from arXiv.org."
  },
  "sample_structure": {
        "paper": "Attention-Based Capsule Networks with Dynamic Routing for Relation Extraction",
    "paper_id": "1812.11321v1",
    "table_caption": "Table 3: Ablation study of capsule net and word-level attention on Wikidata dataset.",
    "table_column_names": [
      "Recall",
      "0.1",
      "0.2",
      "0.3",
      "AUC"
    ],
    "table_content_values": [
      [
        "-Word-ATT",
        "0.648",
        "0.515",
        "0.395",
        "0.389"
      ],
      [
        "-Capsule",
        "0.635",
        "0.507",
        "0.413",
        "0.386"
      ],
      [
        "Our Model",
        "0.650",
        "0.519",
        "0.422",
        "0.405"
      ]
    ],
    "id": "3addae78-e7aa-4757-a6e9-dc12dac74787",
    "claim": "The results prove the effectiveness of word-level attention to exploit the local interactions in link prediction task.",
    "label": "NEI",
    "table_id": "33a87cb6-8a23-46f2-8fee-97555075aab1"
  }
}
