[
  {
    "id": "Title: SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables\n\nIntroduction: In this paper, we propose a novel dataset SCITAB, which fulfills these stated criteria. It contains 1,225 challenging scientific claims, each demanding compositional reasoning for verification using scientific tables. Our data is derived from the SciGen dataset (Moosavi et al., 2021), a resource that includes scientific tables and claims crawled from arXiv.org. We first manually filter out the checkworthy scientific claims from the raw data. Following this, we employ a strategy of human–model collaboration, as depicted in Figure 2, to generate claims that are either contradicted or unverifiable based on the table’s content. Figure 1 shows a claim from SCITAB and the corresponding reasoning process to verify it. Compared with existing benchmarks, SCITAB is closer to real-world scientific fact-checking in terms of more realistic claims and table-based evidence. Through data analysis, we further show that the claims in SCITAB necessitate a more comprehensive and nuanced set of reasoning skills for verification, e.g., numerical reasoning and commonsense knowledge, etc.\n\nRelated Work:\nScientific Fact-Checking Datasets: Existing datasets for scientific fact-checking are summarized in a recent survey from Vladika and Matthes (2023). These datasets differ in: 1) domain: biology (Wadden et al., 2020; Akhtar et al., 2022), COVID-19 (Saakyan et al., 2021; Sarrouti et al., 2021; Mohr et al., 2022; Wang et al., 2023), and climate (Diggelmann et al., 2020), 2) claim creation: crowd-sourced claims v.s. natural claims, and 3) evidence source: Wikipedia articles (Diggelmann et al., 2020) or research papers (Wadden et al., 2020, 2022; Sarrouti et al., 2021). However, most of these datasets rely on text evidence to verify claims. SEM-TAB-FACTS (Wang et al., 2021) is the only existing dataset based on scientific tables, but it is limited to simple, crowd-sourced claims. To bridge this gap, we construct SCITAB which contains complex claims from authentic scientific papers with table-based evidence.\nTable-Based Reasoning: Table-based reasoning requires reasoning over both free-form natural language queries and (semi-)structured tables. Early works either rely on executable languages (e.g., SQL and SPARQL) to access the tabular data (Yin et al., 2016; Yu et al., 2018) or employ graph neural networks to capture logical structure in statements, e.g., LogicFactChecker (Zhong et al., 2020) and ProgVGAT (Yang et al., 2020). However, these approaches often struggle with generalization, as they are tightly bound to specific table formats and language patterns. To address this, we have seen a shift toward table pre-training, with the advent of TableBERT (Chen et al., 2020), TAPAS (Herzig et al., 2020), SaMoE (Zhou et al., 2022), PASTA (Gu et al., 2022), and DATER (Ye et al., 2023). These methods encode sentence-table pairs using language models and transform table-based reasoning into question-answering or natural language inference. In our work, we focus on evaluating pretraining-based methods on SCITAB because they not only demonstrate superior performance but also offer the benefits of few-shot learning.\n\nDataset Information:\nDataset Name: SCITAB\nSize: 1,225 claims\nDomains: ",
    "generated_response": "{\n  \"ideas\": [\n    {\n      \"title\": \"GroundedCell: Weakly Supervised Evidence Grounding for Table Fact-Checking\",\n      \"core_idea\": \"Train models to first select evidence cells and then verify claims, using weakly supervised cell-level rationales derived from number/entity alignment and model self-training. This separates grounding from inference, encouraging faithful reasoning over the actual table content rather than superficial cues.\",\n      \"hypothesis\": \"Explicit cell-level grounding, even with silver labels, improves robustness and accuracy on SCITAB, and reduces spurious correlations and hallucinations.\",\n      \"why_it_matters\": \"Grounded verification increases interpretability and trust, enabling step-wise diagnosis and error analysis while improving performance on compositional reasoning and unverifiable detection.\",\n      \"possible_methods\": [\n        \"Multitask fine-tuning of TAPAS/PASTA/DATER with an auxiliary cell-pointer head\",\n        \"Two-stage pipeline: bi-encoder row/column retrieval followed by cross-encoder grounding\",\n        \"Graph neural networks over table cells with header/cell/stub nodes and typed edges\",\n        \"Weak supervision via number/entity alignment, attention roll-out, and self-training for cell masks\",\n        \"Joint loss with Dice/Focal loss for masks, cross-entropy for labels, and consistency regularization\"\n      ],\n      \"experimental_design\": [\n        \"Environment: Python 3.10, PyTorch 2.x, Transformers 4.x; set seeds for reproducibility\",\n        \"Data prep: Convert SCITAB tables to a unified JSONL with cell coordinates, headers, units, and claim metadata\",\n        \"Silver rationale generation: align claim numerals/entities to table cells; expand with header context; filter by type/unit compatibility\",\n        \"Model: start from TAPAS/PASTA; add a token-to-cell pointer (sigmoid mask) and a classification head; share encoder\",\n        \"Training: optimize combined loss (mask + label); apply curriculum from high-confidence silver masks to self-trained masks\",\n        \"Evaluation: report label accuracy/F1, cell-level precision/recall/F1, evidence sufficiency (coverage of all required cells)\",\n        \"Stress tests: row/column permutation, value swaps, header shuffles; measure label flips when evidence changes\",\n        \"Ablations: without grounding head; with gold vs silver vs none; with/without header context encoding\",\n        \"Release: training/evaluation scripts, config files, and a visualization tool for predicted evidence masks\"\n      ]\n    },\n    {\n      \"title\": \"Exec-PoT: Executable Programs-of-Thought for Scientific Tables\",\n      \"core_idea\": \"Represent claim verification as a compact program over a table-specific DSL (select, filter, group, aggregate, compare) and execute it to obtain a verdict. Use an LLM planner with execution guidance to generate valid, compositional programs that ground to table cells.\",\n      \"hypothesis\": \"Constraining reasoning to executable, grounded programs improves compositional generalization and provides faithful, debuggable rationales on SCITAB.\",\n      \"why_it_matters\": \"Executable intermediates reduce hallucination, expose precise failure modes, and enable step-level evaluation and training signals that go beyond final-label supervision.\",\n      \"possible_methods\": [\n        \"Design a minimal DSL mapped to Pandas operations with cell/row/column addressing\",\n        \"Derive silver programs via distant supervision (aligners, heuristics) and iteratively refine with EM-style bootstrapping\",\n        \"Execution-guided decoding and beam search to prune invalid or unsupported steps\",\n        \"Neural-symbolic training: sequence-to-program model with teacher forcing on silver programs\",\n        \"Program verification: runtime checks for index validity, unit compatibility, and non-empty selections\"\n      ],\n      \"experimental_design\": [\n        \"Define DSL operators and implement an executor backed by Pandas; include unit-safe arithmetic and comparisons\",\n        \"Create an alignment module to propose candidate columns/rows for entities and numerals mentioned in claims\",\n        \"Generate initial silver programs by enumerating short operator sequences that yield the gold verdict; cache execution traces\",\n        \"Model: fine-tune a seq2seq LLM (e.g., T5/BART) to predict programs; enforce schema-constrained decoding\",\n        \"Use execution-guided beam search: keep beams that execute without errors and move toward the correct label\",\n        \"Train with mixed supervision: gold labels + silver programs; add step-level losses for operator and argument prediction\",\n        \"Metrics: final label F1, program exact match, operator accuracy, execution success rate, cell coverage\",\n        \"Ablations: with/without execution guidance; free-form rationale vs program; different DSL granularities\",\n        \"Robustness: adversarial cell swaps and unit perturbations; measure whether programs adapt appropriately\"\n      ]\n    },\n    {\n      \"title\": \"CounterTab: Counterfactual Table Augmentation and Contrastive Training\",\n      \"core_idea\": \"Generate minimally perturbed tables that flip or preserve claim truth values and use them for contrastive learning. This forces models to attend to the exact cells and operations that determine veracity, improving sensitivity to small but meaningful changes.\",\n      \"hypothesis\": \"Contrastive training on counterfactual tables improves calibration, reduces shortcut reliance, and strengthens recognition of unverifiable claims on SCITAB.\",\n      \"why_it_matters\": \"Counterfactuals provide strong supervision for grounding and reasoning by making the decision boundary explicit and testable.\",\n      \"possible_methods\": [\n        \"Rule-based and learned perturbations: value nudging within credible ranges, row swaps, header renaming, unit scaling\",\n        \"Label-preserving vs label-flipping augmentations to build hard positive/negative pairs\",\n        \"InfoNCE/triplet losses atop the classification objective using table-claim embeddings and/or evidence masks\",\n        \"Adversarial training: gradient-based value perturbations constrained by plausibility priors\",\n        \"Automatic plausibility checks using domain heuristics (e.g., non-negative counts, unit consistency)\"\n      ],\n      \"experimental_design\": [\n        \"Implement perturbation generators for numeric cells (±k%, unit multipliers), categorical cells (nearest-neighbor header variants), and structure (row/column reorder)\",\n        \"Validate perturbations: ensure cell types and units remain consistent; discard unrealistic changes with rule checks\",\n        \"Construct contrastive batches: (original table, counterfactual table, claim) with desired label relationships\",\n        \"Model: baseline TAPAS/PASTA/DATER; add projection head for contrastive objectives; optionally include evidence mask supervision\",\n        \"Training: combine cross-entropy with contrastive loss; schedule perturbation difficulty from easy to hard\",\n        \"Evaluation: standard SCITAB metrics; counterfactual sensitivity (probability flip when label flips); calibration (ECE, Brier)\",\n        \"Robustness: OOD tables created via heavier structural edits; measure performance degradation\",\n        \"Ablations: only value vs only structure perturbations; with/without evidence grounding; different contrastive temperatures\",\n        \"Release utilities to regenerate counterfactuals deterministically from a random seed\"\n      ]\n    },\n    {\n      \"title\": \"HeaderSense: Hierarchical Header and Table Schema Modeling\",\n      \"core_idea\": \"Explicitly encode multi-row/column headers, stubs, and table hierarchies to better map claims to the correct substructure and units. Learn header-aware embeddings and span linking between claims and schema elements.\",\n      \"hypothesis\": \"Modeling hierarchical headers and schema semantics yields more accurate grounding and fewer header-related errors on SCITAB.\",\n      \"why_it_matters\": \"Scientific tables often rely on complex headers and groupings; misinterpreting them leads to wrong evidence and incorrect conclusions.\",\n      \"possible_methods\": [\n        \"Parse tables into a hierarchy (section → header levels → stubs → cells) and build a tree/graph representation\",\n        \"Pretrain with schema-aware objectives: header reconstruction, next-header prediction, and header-cell alignment\",\n        \"Span linking from claims to schema nodes using pointer networks or contrastive alignment\",\n        \"Integrate unit tokens and header qualifiers into cell representations\",\n        \"Hybrid encoder: transformer for text + GNN/Tree-LSTM for schema, fused via cross-attention\"\n      ],\n      \"experimental_design\": [\n        \"Preprocess tables to extract header levels, merged cells, and group boundaries; store as a typed graph\",\n        \"Implement a schema encoder (GraphTransformer/Tree-LSTM) and fuse with a text encoder via cross-attention\",\n        \"Pretrain on schema tasks using SCITAB tables plus external scientific tables if available\",\n        \"Fine-tune on SCITAB claim verification with optional evidence mask supervision tied to schema nodes\",\n        \"Evaluation: label F1, header-grounding accuracy (precision/recall of linked header nodes), and error taxonomy on header confusions\",\n        \"Stress tests: synthetic header re-labelling and merging/splitting; track performance under schema perturbations\",\n        \"Ablations: flat vs hierarchical encoding; with/without unit tokens; different fusion strategies\",\n        \"Provide visualization of attention over header hierarchy to audit grounding behavior\"\n      ]\n    },\n    {\n      \"title\": \"UnitMath: Unit-Aware Numerical Reasoning and Dimensional Consistency\",\n      \"core_idea\": \"Augment table reasoning with a unit parser and a symbolic calculator that handles conversions, percentage points vs percent, fold-changes, ratios, and confidence intervals. Enforce dimensional consistency during reasoning and comparisons.\",\n      \"hypothesis\": \"Unit-aware normalization and symbolic numeric reasoning reduce arithmetic and scale errors, improving performance on claims requiring quantitative comparisons in SCITAB.\",\n      \"why_it_matters\": \"Many scientific errors stem from misinterpreting units or arithmetic; addressing this explicitly can produce large, reliable gains on compositional numerical claims.\",\n      \"possible_methods\": [\n        \"Unit and quantity extraction from headers/cells/claims with a rule-enhanced neural tagger\",\n        \"Symbolic reasoning module for conversions and operations (add, diff, ratio, percentage points, CI overlap)\",\n        \"Neural-symbolic fusion: model proposes operations; calculator executes and returns features\",\n        \"Consistency constraints: block comparisons across incompatible dimensions\",\n        \"Data augmentation: unit-rescaling counterfactuals to probe sensitivity to unit semantics\"\n      ],\n      \"experimental_design\": [\n        \"Build a quantity/units tagger for claims and table schema; map to a canonical unit ontology\",\n        \"Implement a calculator API exposing safe numeric ops and conversions; integrate as a differentiable plug-in via feature returns\",\n        \"Modify the model to output an operation sketch (e.g., compare(diff(colA,rowX), value)) which the calculator executes\",\n        \"Training: supervise labels plus soft supervision on operation types via distant cues (keywords like 'increase by', '% points')\",\n        \"Evaluation: label F1, numeric error categories (scale, unit, arithmetic), and a CI-overlap correctness metric where applicable\",\n        \"Stress tests: unit rescaling (mg→g), percentage vs percentage points swaps; measure invariance/equivariance\",\n        \"Ablations: without calculator, without unit normalization, with noisy unit detection\",\n        \"Package a lightweight unit-conversion library and templates for integrating with existing table encoders\"\n      ]\n    },\n    {\n      \"title\": \"DecompEval: Operator-Level Decomposition and Stepwise Metrics for SCITAB\",\n      \"core_idea\": \"Automatically decompose claims into operator sequences (filter, select, aggregate, compare) and evaluate models at the step level, not just final labels. Use the decompositions to provide intermediate training signals and curriculum learning.\",\n      \"hypothesis\": \"Stepwise supervision and evaluation improve model reliability and reveal failure modes, leading to better final performance on SCITAB.\",\n      \"why_it_matters\": \"Final labels hide where reasoning breaks; operator-level metrics encourage models to learn each skill required for compositional verification.\",\n      \"possible_methods\": [\n        \"LLM-based decomposition with schema-aware constraints and verification by table execution\",\n        \"Operator taxonomy design aligned with common scientific table operations\",\n        \"Step-level scoring: operator accuracy, cell coverage, execution consistency, and monotonicity checks\",\n        \"Curriculum learning that orders training examples by decomposition length and operator difficulty\",\n        \"Auxiliary losses to predict next operator and referenced schema nodes\"\n      ],\n      \"experimental_design\": [\n        \"Define an operator schema and implement a validator that checks decompositions against the table\",\n        \"Generate silver decompositions using constrained LLM prompts; discard steps failing execution\",\n        \"Augment SCITAB with decomposition annotations and release a standardized evaluation harness\",\n        \"Training: multi-task model predicts next operator and arguments; final label predicted after executing steps\",\n        \"Metrics: final label F1, operator F1, cell coverage, execution success; report correlation between step scores and correctness\",\n        \"Ablations: without curriculum; with random step order; with noisy decompositions\",\n        \"Human study: annotate a subset with gold decompositions to calibrate silver quality and upper bounds\",\n        \"Provide scripts to parse predictions, align steps to evidence, and compute stepwise metrics\"\n      ]\n    },\n    {\n      \"title\": \"VeriGuard: Unverifiable-Aware Calibration and Abstention for Table Claims\",\n      \"core_idea\": \"Model unverifiability explicitly by estimating table coverage of claim arguments and combining it with uncertainty calibration and selective prediction. Use conformal methods to set reliable abstention thresholds.\",\n      \"hypothesis\": \"Explicit unverifiable modeling and calibrated abstention reduce false support/contradiction errors and improve decision quality on SCITAB.\",\n      \"why_it_matters\": \"In real-world scientific checking, knowing when evidence is insufficient is as important as being correct; calibrated abstention improves practical reliability.\",\n      \"possible_methods\": [\n        \"Coverage estimator that predicts which claim slots (entity, measure, time, condition) are grounded in the table\",\n        \"Joint training of label prediction with coverage scores and uncertainty (e.g., temperature scaling, Dirichlet evidence)\",\n        \"Conformal prediction to derive label- and instance-specific abstention thresholds\",\n        \"Negative sampling with partial-evidence and missing-evidence counterfactuals\",\n        \"Decision-theoretic evaluation under different abstention costs\"\n      ],\n      \"experimental_design\": [\n        \"Define claim slot templates and extract slot mentions; train a classifier to map slots to table schema or mark missing\",\n        \"Augment the model with a coverage head and uncertainty head; share encoder with the label predictor\",\n        \"Train with multi-task losses: label CE, slot coverage BCE, and calibration regularizers; include unverifiable claims heavily\",\n        \"Apply inductive conformal prediction on a held-out calibration set to set per-class thresholds for selective prediction\",\n        \"Evaluation: standard label F1, AUROC/AUPRC for unverifiable detection, coverage accuracy, and selective risk curves\",\n        \"Ablations: without coverage head; without conformal calibration; different uncertainty estimators\",\n        \"Scenario tests: inject partial tables (remove columns/rows) and measure whether abstention rises appropriately\",\n        \"Release code for end-to-end selective prediction and a dashboard to visualize coverage vs decision outcomes\"\n      ]\n    }\n  ]\n}",
    "response_json": {"ideas": [
      {
        "title": "GroundedCell: Weakly Supervised Evidence Grounding for Table Fact-Checking",
        "core_idea": "Train models to first select evidence cells and then verify claims, using weakly supervised cell-level rationales derived from number/entity alignment and model self-training. This separates grounding from inference, encouraging faithful reasoning over the actual table content rather than superficial cues.",
        "hypothesis": "Explicit cell-level grounding, even with silver labels, improves robustness and accuracy on SCITAB, and reduces spurious correlations and hallucinations.",
        "why_it_matters": "Grounded verification increases interpretability and trust, enabling step-wise diagnosis and error analysis while improving performance on compositional reasoning and unverifiable detection.",
        "possible_methods": [
          "Multitask fine-tuning of TAPAS/PASTA/DATER with an auxiliary cell-pointer head",
          "Two-stage pipeline: bi-encoder row/column retrieval followed by cross-encoder grounding",
          "Graph neural networks over table cells with header/cell/stub nodes and typed edges",
          "Weak supervision via number/entity alignment, attention roll-out, and self-training for cell masks",
          "Joint loss with Dice/Focal loss for masks, cross-entropy for labels, and consistency regularization"
        ],
        "experimental_design": [
          "Environment: Python 3.10, PyTorch 2.x, Transformers 4.x; set seeds for reproducibility",
          "Data prep: Convert SCITAB tables to a unified JSONL with cell coordinates, headers, units, and claim metadata",
          "Silver rationale generation: align claim numerals/entities to table cells; expand with header context; filter by type/unit compatibility",
          "Model: start from TAPAS/PASTA; add a token-to-cell pointer (sigmoid mask) and a classification head; share encoder",
          "Training: optimize combined loss (mask + label); apply curriculum from high-confidence silver masks to self-trained masks",
          "Evaluation: report label accuracy/F1, cell-level precision/recall/F1, evidence sufficiency (coverage of all required cells)",
          "Stress tests: row/column permutation, value swaps, header shuffles; measure label flips when evidence changes",
          "Ablations: without grounding head; with gold vs silver vs none; with/without header context encoding",
          "Release: training/evaluation scripts, config files, and a visualization tool for predicted evidence masks"
        ]
      },
      {
        "title": "Exec-PoT: Executable Programs-of-Thought for Scientific Tables",
        "core_idea": "Represent claim verification as a compact program over a table-specific DSL (select, filter, group, aggregate, compare) and execute it to obtain a verdict. Use an LLM planner with execution guidance to generate valid, compositional programs that ground to table cells.",
        "hypothesis": "Constraining reasoning to executable, grounded programs improves compositional generalization and provides faithful, debuggable rationales on SCITAB.",
        "why_it_matters": "Executable intermediates reduce hallucination, expose precise failure modes, and enable step-level evaluation and training signals that go beyond final-label supervision.",
        "possible_methods": [
          "Design a minimal DSL mapped to Pandas operations with cell/row/column addressing",
          "Derive silver programs via distant supervision (aligners, heuristics) and iteratively refine with EM-style bootstrapping",
          "Execution-guided decoding and beam search to prune invalid or unsupported steps",
          "Neural-symbolic training: sequence-to-program model with teacher forcing on silver programs",
          "Program verification: runtime checks for index validity, unit compatibility, and non-empty selections"
        ],
        "experimental_design": [
          "Define DSL operators and implement an executor backed by Pandas; include unit-safe arithmetic and comparisons",
          "Create an alignment module to propose candidate columns/rows for entities and numerals mentioned in claims",
          "Generate initial silver programs by enumerating short operator sequences that yield the gold verdict; cache execution traces",
          "Model: fine-tune a seq2seq LLM (e.g., T5/BART) to predict programs; enforce schema-constrained decoding",
          "Use execution-guided beam search: keep beams that execute without errors and move toward the correct label",
          "Train with mixed supervision: gold labels + silver programs; add step-level losses for operator and argument prediction",
          "Metrics: final label F1, program exact match, operator accuracy, execution success rate, cell coverage",
          "Ablations: with/without execution guidance; free-form rationale vs program; different DSL granularities",
          "Robustness: adversarial cell swaps and unit perturbations; measure whether programs adapt appropriately"
        ]
      },
      {
        "title": "CounterTab: Counterfactual Table Augmentation and Contrastive Training",
        "core_idea": "Generate minimally perturbed tables that flip or preserve claim truth values and use them for contrastive learning. This forces models to attend to the exact cells and operations that determine veracity, improving sensitivity to small but meaningful changes.",
        "hypothesis": "Contrastive training on counterfactual tables improves calibration, reduces shortcut reliance, and strengthens recognition of unverifiable claims on SCITAB.",
        "why_it_matters": "Counterfactuals provide strong supervision for grounding and reasoning by making the decision boundary explicit and testable.",
        "possible_methods": [
          "Rule-based and learned perturbations: value nudging within credible ranges, row swaps, header renaming, unit scaling",
          "Label-preserving vs label-flipping augmentations to build hard positive/negative pairs",
          "InfoNCE/triplet losses atop the classification objective using table-claim embeddings and/or evidence masks",
          "Adversarial training: gradient-based value perturbations constrained by plausibility priors",
          "Automatic plausibility checks using domain heuristics (e.g., non-negative counts, unit consistency)"
        ],
        "experimental_design": [
          "Implement perturbation generators for numeric cells (±k%, unit multipliers), categorical cells (nearest-neighbor header variants), and structure (row/column reorder)",
          "Validate perturbations: ensure cell types and units remain consistent; discard unrealistic changes with rule checks",
          "Construct contrastive batches: (original table, counterfactual table, claim) with desired label relationships",
          "Model: baseline TAPAS/PASTA/DATER; add projection head for contrastive objectives; optionally include evidence mask supervision",
          "Training: combine cross-entropy with contrastive loss; schedule perturbation difficulty from easy to hard",
          "Evaluation: standard SCITAB metrics; counterfactual sensitivity (probability flip when label flips); calibration (ECE, Brier)",
          "Robustness: OOD tables created via heavier structural edits; measure performance degradation",
          "Ablations: only value vs only structure perturbations; with/without evidence grounding; different contrastive temperatures",
          "Release utilities to regenerate counterfactuals deterministically from a random seed"
        ]
      },
      {
        "title": "HeaderSense: Hierarchical Header and Table Schema Modeling",
        "core_idea": "Explicitly encode multi-row/column headers, stubs, and table hierarchies to better map claims to the correct substructure and units. Learn header-aware embeddings and span linking between claims and schema elements.",
        "hypothesis": "Modeling hierarchical headers and schema semantics yields more accurate grounding and fewer header-related errors on SCITAB.",
        "why_it_matters": "Scientific tables often rely on complex headers and groupings; misinterpreting them leads to wrong evidence and incorrect conclusions.",
        "possible_methods": [
          "Parse tables into a hierarchy (section → header levels → stubs → cells) and build a tree/graph representation",
          "Pretrain with schema-aware objectives: header reconstruction, next-header prediction, and header-cell alignment",
          "Span linking from claims to schema nodes using pointer networks or contrastive alignment",
          "Integrate unit tokens and header qualifiers into cell representations",
          "Hybrid encoder: transformer for text + GNN/Tree-LSTM for schema, fused via cross-attention"
        ],
        "experimental_design": [
          "Preprocess tables to extract header levels, merged cells, and group boundaries; store as a typed graph",
          "Implement a schema encoder (GraphTransformer/Tree-LSTM) and fuse with a text encoder via cross-attention",
          "Pretrain on schema tasks using SCITAB tables plus external scientific tables if available",
          "Fine-tune on SCITAB claim verification with optional evidence mask supervision tied to schema nodes",
          "Evaluation: label F1, header-grounding accuracy (precision/recall of linked header nodes), and error taxonomy on header confusions",
          "Stress tests: synthetic header re-labelling and merging/splitting; track performance under schema perturbations",
          "Ablations: flat vs hierarchical encoding; with/without unit tokens; different fusion strategies",
          "Provide visualization of attention over header hierarchy to audit grounding behavior"
        ]
      },
      {
        "title": "UnitMath: Unit-Aware Numerical Reasoning and Dimensional Consistency",
        "core_idea": "Augment table reasoning with a unit parser and a symbolic calculator that handles conversions, percentage points vs percent, fold-changes, ratios, and confidence intervals. Enforce dimensional consistency during reasoning and comparisons.",
        "hypothesis": "Unit-aware normalization and symbolic numeric reasoning reduce arithmetic and scale errors, improving performance on claims requiring quantitative comparisons in SCITAB.",
        "why_it_matters": "Many scientific errors stem from misinterpreting units or arithmetic; addressing this explicitly can produce large, reliable gains on compositional numerical claims.",
        "possible_methods": [
          "Unit and quantity extraction from headers/cells/claims with a rule-enhanced neural tagger",
          "Symbolic reasoning module for conversions and operations (add, diff, ratio, percentage points, CI overlap)",
          "Neural-symbolic fusion: model proposes operations; calculator executes and returns features",
          "Consistency constraints: block comparisons across incompatible dimensions",
          "Data augmentation: unit-rescaling counterfactuals to probe sensitivity to unit semantics"
        ],
        "experimental_design": [
          "Build a quantity/units tagger for claims and table schema; map to a canonical unit ontology",
          "Implement a calculator API exposing safe numeric ops and conversions; integrate as a differentiable plug-in via feature returns",
          "Modify the model to output an operation sketch (e.g., compare(diff(colA,rowX), value)) which the calculator executes",
          "Training: supervise labels plus soft supervision on operation types via distant cues (keywords like 'increase by', '% points')",
          "Evaluation: label F1, numeric error categories (scale, unit, arithmetic), and a CI-overlap correctness metric where applicable",
          "Stress tests: unit rescaling (mg→g), percentage vs percentage points swaps; measure invariance/equivariance",
          "Ablations: without calculator, without unit normalization, with noisy unit detection",
          "Package a lightweight unit-conversion library and templates for integrating with existing table encoders"
        ]
      },
      {
        "title": "DecompEval: Operator-Level Decomposition and Stepwise Metrics for SCITAB",
        "core_idea": "Automatically decompose claims into operator sequences (filter, select, aggregate, compare) and evaluate models at the step level, not just final labels. Use the decompositions to provide intermediate training signals and curriculum learning.",
        "hypothesis": "Stepwise supervision and evaluation improve model reliability and reveal failure modes, leading to better final performance on SCITAB.",
        "why_it_matters": "Final labels hide where reasoning breaks; operator-level metrics encourage models to learn each skill required for compositional verification.",
        "possible_methods": [
          "LLM-based decomposition with schema-aware constraints and verification by table execution",
          "Operator taxonomy design aligned with common scientific table operations",
          "Step-level scoring: operator accuracy, cell coverage, execution consistency, and monotonicity checks",
          "Curriculum learning that orders training examples by decomposition length and operator difficulty",
          "Auxiliary losses to predict next operator and referenced schema nodes"
        ],
        "experimental_design": [
          "Define an operator schema and implement a validator that checks decompositions against the table",
          "Generate silver decompositions using constrained LLM prompts; discard steps failing execution",
          "Augment SCITAB with decomposition annotations and release a standardized evaluation harness",
          "Training: multi-task model predicts next operator and arguments; final label predicted after executing steps",
          "Metrics: final label F1, operator F1, cell coverage, execution success; report correlation between step scores and correctness",
          "Ablations: without curriculum; with random step order; with noisy decompositions",
          "Human study: annotate a subset with gold decompositions to calibrate silver quality and upper bounds",
          "Provide scripts to parse predictions, align steps to evidence, and compute stepwise metrics"
        ]
      },
      {
        "title": "VeriGuard: Unverifiable-Aware Calibration and Abstention for Table Claims",
        "core_idea": "Model unverifiability explicitly by estimating table coverage of claim arguments and combining it with uncertainty calibration and selective prediction. Use conformal methods to set reliable abstention thresholds.",
        "hypothesis": "Explicit unverifiable modeling and calibrated abstention reduce false support/contradiction errors and improve decision quality on SCITAB.",
        "why_it_matters": "In real-world scientific checking, knowing when evidence is insufficient is as important as being correct; calibrated abstention improves practical reliability.",
        "possible_methods": [
          "Coverage estimator that predicts which claim slots (entity, measure, time, condition) are grounded in the table",
          "Joint training of label prediction with coverage scores and uncertainty (e.g., temperature scaling, Dirichlet evidence)",
          "Conformal prediction to derive label- and instance-specific abstention thresholds",
          "Negative sampling with partial-evidence and missing-evidence counterfactuals",
          "Decision-theoretic evaluation under different abstention costs"
        ],
        "experimental_design": [
          "Define claim slot templates and extract slot mentions; train a classifier to map slots to table schema or mark missing",
          "Augment the model with a coverage head and uncertainty head; share encoder with the label predictor",
          "Train with multi-task losses: label CE, slot coverage BCE, and calibration regularizers; include unverifiable claims heavily",
          "Apply inductive conformal prediction on a held-out calibration set to set per-class thresholds for selective prediction",
          "Evaluation: standard label F1, AUROC/AUPRC for unverifiable detection, coverage accuracy, and selective risk curves",
          "Ablations: without coverage head; without conformal calibration; different uncertainty estimators",
          "Scenario tests: inject partial tables (remove columns/rows) and measure whether abstention rises appropriately",
          "Release code for end-to-end selective prediction and a dashboard to visualize coverage vs decision outcomes"
        ]
      }
    ]
  },
    "user_prompt": "\n        Your task is to generate a list of original, feasible, and impactful research ideas.\n        The goal is to design research projects that focus on enhancing table grounding techniques or introducing reasoning steps that are useful for evaluation and performance improvement.\n        The information below includes details about the dataset, related work, and parts of the introduction: Title: SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables\n\nIntroduction: In this paper, we propose a novel dataset SCITAB, which fulfills these stated criteria. It contains 1,225 challenging scientific claims, each demanding compositional reasoning for verification using scientific tables. Our data is derived from the SciGen dataset (Moosavi et al., 2021), a resource that includes scientific tables and claims crawled from arXiv.org. We first manually filter out the checkworthy scientific claims from the raw data. Following this, we employ a strategy of human–model collaboration, as depicted in Figure 2, to generate claims that are either contradicted or unverifiable based on the table’s content. Figure 1 shows a claim from SCITAB and the corresponding reasoning process to verify it. Compared with existing benchmarks, SCITAB is closer to real-world scientific fact-checking in terms of more realistic claims and table-based evidence. Through data analysis, we further show that the claims in SCITAB necessitate a more comprehensive and nuanced set of reasoning skills for verification, e.g., numerical reasoning and commonsense knowledge, etc.\n\nRelated Work:\nScientific Fact-Checking Datasets: Existing datasets for scientific fact-checking are summarized in a recent survey from Vladika and Matthes (2023). These datasets differ in: 1) domain: biology (Wadden et al., 2020; Akhtar et al., 2022), COVID-19 (Saakyan et al., 2021; Sarrouti et al., 2021; Mohr et al., 2022; Wang et al., 2023), and climate (Diggelmann et al., 2020), 2) claim creation: crowd-sourced claims v.s. natural claims, and 3) evidence source: Wikipedia articles (Diggelmann et al., 2020) or research papers (Wadden et al., 2020, 2022; Sarrouti et al., 2021). However, most of these datasets rely on text evidence to verify claims. SEM-TAB-FACTS (Wang et al., 2021) is the only existing dataset based on scientific tables, but it is limited to simple, crowd-sourced claims. To bridge this gap, we construct SCITAB which contains complex claims from authentic scientific papers with table-based evidence.\nTable-Based Reasoning: Table-based reasoning requires reasoning over both free-form natural language queries and (semi-)structured tables. Early works either rely on executable languages (e.g., SQL and SPARQL) to access the tabular data (Yin et al., 2016; Yu et al., 2018) or employ graph neural networks to capture logical structure in statements, e.g., LogicFactChecker (Zhong et al., 2020) and ProgVGAT (Yang et al., 2020). However, these approaches often struggle with generalization, as they are tightly bound to specific table formats and language patterns. To address this, we have seen a shift toward table pre-training, with the advent of TableBERT (Chen et al., 2020), TAPAS (Herzig et al., 2020), SaMoE (Zhou et al., 2022), PASTA (Gu et al., 2022), and DATER (Ye et al., 2023). These methods encode sentence-table pairs using language models and transform table-based reasoning into question-answering or natural language inference. In our work, we focus on evaluating pretraining-based methods on SCITAB because they not only demonstrate superior performance but also offer the benefits of few-shot learning.\n\nDataset Information:\nDataset Name: SCITAB\nSize: 1,225 claims\nDomains:   \n        Return the output **only in valid JSON**, without extra commentary or explanations.\n        Follow this exact JSON schema:\n        {\n        \"ideas\": [\n            {\n            \"title\": \"string - concise and catchy research title\",\n            \"core_idea\": \"string - 2–3 sentence summary of the research\",\n            \"hypothesis\": \"string - main hypothesis to be tested\",\n            \"why_it_matters\": \"string - significance and potential impact of pursuing this research\",\n            \"possible_methods\": [\"list of suggested experimental designs\"],\n            \"experimental_design\": [\"guideline details on how to generate the code and how to run experiments]\n            }\n        ]\n        }\n        Generate 7 research ideas in this format.\n\t\t"
  }
]