category,key,question,ground_truths
scifacts-geo,55,"Find the paper(s) that use all of the datasets mentioned below:
American Community Survey, Flooding Data from FEMA: Harvey Flood Depths Grid, Food Access Research Atlas by USDA, Streetlight Data, POI data by SafeGraph
    
Instructions:
1. You must respond with a list JSON object(s) in this format:
{
    ""paper_title"": ""<full paper title>"",
}
   
2. You must provide a valid json file with empty fields even if you don't have the answer.","[[{""paper_title"": [""Characterizing Equitable Access to Grocery Stores During Disasters Using Location-based Data""]}]]"
scifacts-materials,76,"Find the material(s) that satisfy every one of the listed measured properties.

Instructions
1. In your response, provide a list of JSON object(s) in the following format:
  [
    {
      ""material"": ""<material name>"",
      ""inference_basis"": ""<brief explanation of how the listed properties match this material>"",
      ""paper_title"": ""<title of the paper>"",
      ""property_source_table"": ""<source table from which material name, property and property descriptor were identified>"",
      ""property_source_passage"": ""<source passage from which material name, property and property descriptor were identified>""
    }
  ]
2. Your inference_basis should reference the key properties (e.g., band gap, absorption coefficient, crystal structure) that led you to that material.
3. You must provide a valid json file even if you don't have the answer.


Measured properties:
- Direct band gap: 3.37 eV
- Exciton binding energy: 60 meV
- Lattice fringe: 0.52 nm
- contact angle CA water contract angle (Dataphysics OCA20 contact angle system Ambient temperature): 121 °
","[[{""paper_title"": [""Hydrothermally Grown ZnO Micro/Nanotube Arrays and Their Properties""], ""material"": [""zinc oxide""]}]]"
entities,10,"Provide a comprehensive list of all animated films (excluding anime) that meet the following criteria:

1. The film must fall under the category of animation but must not be classified as anime.

2. The film must have a Tomatometer rating of 95% or higher on Rotten Tomatoes.

3. The film’s runtime must be between 80 and 85 minutes inclusive.

Present the final result as a JSON array of strings, where each string is the title of a qualifying film.","[[""Beauty and the Beast (1991)"", ""Shaun the Sheep Movie"", ""Snow White and the Seven Dwarfs"", ""A Boy Named Charlie Brown"", ""The Big Bad Fox and Other Tales..."", ""The King and the Mockingbird"", ""Batman: The Long Halloween, Part One"", ""Steven Universe: The Movie"", ""Predator: Killer of Killers"", ""Flow (2024)"", ""Marvel Rising: Secret Warriors"", ""Yellow Submarine"", ""Even Mice Belong in Heaven"", ""The Witcher: Nightmare of the Wolf"", ""Batman and Superman: Battle of the Super Sons"", ""Long Way North"", ""Metalocalypse: Army of the Doomstar"", ""Wallace & Gromit: The Curse of the Were-Rabbit"", ""Chicken Run"", ""Bu\u00f1uel in the Labyrinth of the Turtles"", ""Sita Sings the Blues"", ""Toy Story"", ""Ernest & Celestine"", ""I Lost My Body""]]"
entities,4,"Provide a comprehensive list of all people that meet the following criteria:

1. He/She should have won atleast 3 IMO medals for Korea.

2. Atleast one of these medals should have been won in an year when Korea was ranked a minimum of 5(<=5) during IMO.

Present the final result as a JSON array of strings, where each string is the name of the person.","[[""Youngbeom Jin"", ""Woojin Choi"", ""Gyudong Lee"", ""Sehun Kim"", ""Jimin Kim"", ""Dong Ryul Kim"", ""Whan Ghang"", ""Junhwi Bae"", ""Yuchan Jung""]]"
novel-datasets-identification,31,"Identify a publicly available dataset of long-form in-the-wild videos that are segmented into scenes and richly annotated with both content descriptions and user engagement signals. The dataset should include multimodal information and support analysis at the video level as well as the scene level.

Once you've found a dataset paper, provide its information in the following JSON format:

```json
{
    ""title"": <title of the dataset paper>,
    ""is_published"": <true/false>,
    ""venue"": <venue>, // eg: CVPR
    ""year"": <year of publication>,
    ""dataset_url"": <link to the dataset>, // if available, else null
}
```","[{""title"": ""Large Content and Behavior Models to Understand, Simulate, and Optimize Content and Behavior"", ""is_published"": true, ""venue"": ""ICLR"", ""year"": 2024, ""dataset_url"": ""https://huggingface.co/datasets/behavior-in-the-wild/content-behavior-corpus""}]"
novel-datasets-identi-extraction,23,"I'm looking for a GPT-4 generated corpus of decisions rooted in quotidian life - think commuting, family squabbles or career decisions - each tagged against broad socio-psychological dimensions. I'm specifically interested in scenarios where the resolution hinges not on objective correctness, but instead on personal values. How does this corpus reveal GPT-4’s implicit generation bias across the various value dimensions in each of the socio-psychological frameworks explored in the corpus?

Provide the an overview of GPT-4's generation bias in the following json format:

```json
[
    {
        ""framework"": <framework_name>,
        ""most_biased_dimension"": <most_biased_dimension_name>
    },
    ...
]
```","[[{""framework"": ""Moral Foundations Theory"", ""most_biased_dimension"": ""Fairness""}, {""framework"": ""World Values Survey"", ""most_biased_dimension"": ""Self-Expression""}, {""framework"": ""Maslow's Hierarchy of Needs"", ""most_biased_dimension"": ""Self-Esteem""}, {""framework"": ""Plutchik's Wheel of Emotions"", ""most_biased_dimension"": ""Trust""}, {""framework"": ""Aristotle's Virtues"", ""most_biased_dimension"": ""Truthfulness""}]]"
novel-datasets-peer,38,"I want to compare existing 3D urban segmentation datasets based on real-world scenes that go beyond static 2D maps or flat segmentation overlays, and instead represent urban environments as richly annotated point clouds. How do they compare in terms of data acquisition methods—such as mobile laser scanning, terrestrial laser scanning, aerial laser scanning, and photogrammetry?

For the datasets you find, provide the following information in json format:

```json
[
    {
        ""name"": <name of the resource>,
        ""data_aquisition_method"": <data acquisition method>,
        ""area"": <area of the dataset>,
        ""scenes"": <number of scenes>,
        ""points_million"": <number of points>,
    },
    ...
]
```","[[{""name"": ""Okland"", ""data_aquisition_method"": ""MLS"", ""area"": ""1.5 km"", ""scenes"": 1, ""points_million"": 1.6}, {""name"": ""Semantic3D"", ""data_aquisition_method"": ""TLS"", ""area"": ""-"", ""scenes"": 3, ""points_million"": 4000}, {""name"": ""Paris-Lille-3D"", ""data_aquisition_method"": ""MLS"", ""area"": ""1.94 km"", ""scenes"": 2, ""points_million"": 143}, {""name"": ""DublinCity"", ""data_aquisition_method"": ""ALS"", ""area"": ""2 km\u00b2"", ""scenes"": 1, ""points_million"": 260}, {""name"": ""SemanticKITTI"", ""data_aquisition_method"": ""MLS"", ""area"": ""39.2 km"", ""scenes"": 1, ""points_million"": 4549}, {""name"": ""Toronto-3D"", ""data_aquisition_method"": ""MLS"", ""area"": ""1.0 km"", ""scenes"": 1, ""points_million"": 78.3}, {""name"": ""DALES"", ""data_aquisition_method"": ""ALS"", ""area"": ""10 km\u00b2"", ""scenes"": 1, ""points_million"": 505.3}, {""name"": ""Campus3D"", ""data_aquisition_method"": ""UAV Ptgy"", ""area"": ""1.58 km\u00b2"", ""scenes"": 1, ""points_million"": 937.1}, {""name"": ""Hessigheim 3D"", ""data_aquisition_method"": ""UAV LiDAR/Camera"", ""area"": ""0.19 km\u00b2"", ""scenes"": 1, ""points_million"": 125.7}, {""name"": ""SUM"", ""data_aquisition_method"": ""Airplane Camera"", ""area"": ""4 km\u00b2"", ""scenes"": 1, ""points_million"": 19}, {""name"": ""Swiss3DCities"", ""data_aquisition_method"": ""UAV Ptgy"", ""area"": ""2.7 km\u00b2"", ""scenes"": 3, ""points_million"": 226}, {""name"": ""SensatUrban"", ""data_aquisition_method"": ""UAV Ptgy"", ""area"": ""7.6 km\u00b2"", ""scenes"": 3, ""points_million"": 2847}, {""name"": ""STPLS3D"", ""data_aquisition_method"": ""UAV Ptgy"", ""area"": ""1.27 km\u00b2"", ""scenes"": 1, ""points_million"": 150.4}, {""name"": ""InstanceBuilding"", ""data_aquisition_method"": ""UAV Ptgy"", ""area"": ""0.434 km\u00b2"", ""scenes"": 1, ""points_million"": 7.46}, {""name"": ""UrbanBIS"", ""data_aquisition_method"": ""UAV Ptgy"", ""area"": ""10.78 km\u00b2"", ""scenes"": 6, ""points_million"": 2523.8}, {""name"": ""nuScenes panoptic"", ""data_aquisition_method"": ""MLS"", ""area"": null, ""scenes"": 1000, ""points_million"": 1400}, {""name"": ""FRACTAL"", ""data_aquisition_method"": ""ALS"", ""area"": ""250 km\u00b2"", ""scenes"": 100000, ""points_million"": 9261}, {""name"": ""KITTI-360"", ""data_aquisition_method"": ""MLS"", ""area"": ""73.7 km"", ""scenes"": null, ""points_million"": 1000}]]"
flights,40,"In which flight incident did a commercial airliner perform an unusually high number of go-arounds before safely landing? For each attempt, add an entry using the following format:

```json
[
    {
        ""time_utc"": <timestamp>,
        ""attempt_number"": <attempt_number>,
        ""airport"": <airport_name>,
        ""runway_number"": <runway_number>,
        ""runway_aids_available"": <runway_aids>,
        ""visibility_minima_for_aid"": <visibility_minima>,
        ""actual_visibility"": <actual_visibility>,
        ""remaining_fuel_estimate"": <remaining_fuel>
    },
    ...
]
```","[[{""time_(utc)"": ""2358"", ""attempt_#"": ""1"", ""airport"": ""Cochin"", ""runway_#"": ""27"", ""runway_aid_available"": ""ILS Cat I"", ""visibility_minima_(for_aid)"": ""650m RVR, 320ft DA"", ""actual_visibility"": ""3500m"", ""remaining_fuel_(estimate)"": ""4699 kg""}, {""time_(utc)"": ""0017"", ""attempt_#"": ""2"", ""airport"": ""Cochin"", ""runway_#"": ""27"", ""runway_aid_available"": ""ILS Cat I"", ""visibility_minima_(for_aid)"": ""650m RVR, 320ft DA"", ""actual_visibility"": ""2500m"", ""remaining_fuel_(estimate)"": ""3919 kg""}, {""time_(utc)"": ""0050"", ""attempt_#"": ""3"", ""airport"": ""Cochin"", ""runway_#"": ""27"", ""runway_aid_available"": ""ILS Cat I"", ""visibility_minima_(for_aid)"": ""650m RVR, 320ft DA"", ""actual_visibility"": ""1500m"", ""remaining_fuel_(estimate)"": ""2644 kg""}, {""time_(utc)"": ""0119"", ""attempt_#"": ""4"", ""airport"": ""Trivandrum"", ""runway_#"": ""14"", ""runway_aid_available"": ""VOR"", ""visibility_minima_(for_aid)"": ""2100m RVR, 560ft DA"", ""actual_visibility"": ""2000m"", ""remaining_fuel_(estimate)"": ""1324 kg""}, {""time_(utc)"": ""0127"", ""attempt_#"": ""5"", ""airport"": ""Trivandrum"", ""runway_#"": ""14"", ""runway_aid_available"": ""VOR"", ""visibility_minima_(for_aid)"": ""2100m RVR, 560ft DA"", ""actual_visibility"": ""2000m"", ""remaining_fuel_(estimate)"": ""898 kg""}, {""time_(utc)"": ""0132"", ""attempt_#"": ""6"", ""airport"": ""Trivandrum"", ""runway_#"": ""14"", ""runway_aid_available"": ""VOR"", ""visibility_minima_(for_aid)"": ""2100m RVR, 560ft DA"", ""actual_visibility"": ""2000m"", ""remaining_fuel_(estimate)"": ""662 kg""}, {""time_(utc)"": ""0139"", ""attempt_#"": ""7"", ""airport"": ""Trivandrum"", ""runway_#"": ""32"", ""runway_aid_available"": ""VOR"", ""visibility_minima_(for_aid)"": ""-"", ""actual_visibility"": ""-"", ""remaining_fuel_(estimate)"": ""349 kg""}]]"
prior-art,86,"_I have the following ideas for a research paper. Can you help identify if this has already been done or implied in full or in parts in other papers?
Give your answer as a **JSON with three fields: Paper title, link, and connection a field** that quotes exact sentences from the paper and parts of my ideas below to make the case._

Here is the format:

```json
[
    {
        ""title"": <paper title>,
        ""link"": <link to the paper>,
        ""connection"": <connection field>
    }
]
```

We develop a comprehensive evaluation framework for eliciting reasoning mistakes in LLMs.
We explore aspects of mistake correction in LLMs, as well as address a distinct and critical dimension: explicit mistake detection in reasoning chains. In addition to exploring the implicit self-correction abilities, which focus on a model’s capacity to evaluate or refine its own responses, we also explicitly evaluate whether models can identify errors in reasoning chains—whether these errors originate from the same model or other models.

Specifically, our work investigates intrinsic self-correction by measuring a model's ability to judge the correctness of its own generated answers. We focus on distinguishing among self-generated responses and selecting the most appropriate one. Furthermore, our work probes a more foundational question: Can LLMs reliably detect mistakes in a given reasoning chain? We argue that mistake detection is a precursor to effective self-correction, as the ability to robustly identify errors demonstrates higher-order reasoning capabilities. Our findings reveal two novel insights: Current models, including state-of-the-art ones, exhibit significant weaknesses in mistake detection, performing inconsistently across both simple and complex problems. When a model successfully identifies mistakes, its subsequent ability to rectify or self-correct those mistakes improves, suggesting a strong interdependence between mistake detection and correction.","[[{""title"": ""Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning"", ""url"": ""https://openreview.net/forum?id=uDZ9d4UAUh""}, {""title"": ""Large Language Models Cannot Self-Correct Reasoning Yet"", ""url"": ""https://arxiv.org/abs/2310.01798""}, {""title"": ""SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses"", ""url"": ""https://arxiv.org/abs/2404.04298""}, {""title"": ""LLMs cannot find reasoning errors, but can correct them given the error location"", ""url"": ""https://arxiv.org/abs/2311.08516""}, {""title"": ""LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought"", ""url"": ""https://arxiv.org/abs/2405.06705""}, {""title"": ""SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights"", ""url"": ""https://arxiv.org/html/2410.09008v1""}]]"
prior-art,99,"_I have the following ideas for a research paper. Can you help identify if this has already been done or implied in full or in parts in other papers?
Give your answer as a **JSON with three fields: Paper title, link, and connection a field** that quotes exact sentences from the paper and parts of my ideas below to make the case._

Here is the format:

```json
[
    {
        ""title"": <paper title>,
        ""link"": <link to the paper>,
        ""connection"": <connection field>
    }
]
```

We present a theoretical analysis of the training dynamics of vision-language models with nonlinear activation functions. In
particular, we provide a rigorous justification for the effectiveness of synthetic text captions in improving pre-training performance. To analyze the impact of misaligned image-text pairs, we consider a one-hidden-layer neural network model, and show that neurons trained on noisy data tend to learn a mixture of true and spurious features. To constrast, we also provide an analysis when the models are restricted to be linear for both text and image encoders. For such linear models, we show that the training dynamics can be studied using singular value decomposition. But when both the text and image encoders are nonlinear, the analysis becomes more involved --- we analyze the behavior of nonlinear activations across three distinct training stage, as well as the non-convex interactions between modalities.

We also provide a robust theoretical analysis of noisy labels' effects on contrastive learning generalization, offering insights into inductive biases necessary for robustness. Further we investigate how contrastive learning inherently enhances robustness against label noise and explain why learned representations are less sensitive to noisy labels.","[[{""title"": ""Theoretical Analysis of Contrastive Learning in Vision-Language Model Pretraining: The Role of Synthetic Text Captions for Feature Alignment"", ""url"": ""https://openreview.net/forum?id=hgAAXdv8q8""}, {""title"": ""Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data"", ""url"": ""https://arxiv.org/abs/2302.06232""}, {""title"": ""Understanding Contrastive Learning Requires Incorporating Inductive Biases"", ""url"": ""https://arxiv.org/abs/2202.14037""}, {""title"": ""Investigating Why Contrastive Learning Benefits Robustness Against Label Noise"", ""url"": ""https://arxiv.org/abs/2201.12498""}, {""title"": ""On the Role of Label Noise in the Feature Learning Process"", ""url"": ""https://openreview.net/pdf?id=kwHvs1UdTM""}, {""title"": ""On the Robustness of Multimodal Contrastive Learning to Distribution Shifts"", ""url"": ""https://arxiv.org/pdf/2310.04971""}]]"
