# Data Format and Paper Metadata Format

## Data Format of AirQA

```json
{
    "uuid": "xxxx-xxxx-xxxx", // unique identifier for this data sample
    "question": "user question about ai research papers", // user question
    "answer_format": "text description on answer format, e.g., a single float number, a list of strings", // can be inserted into prompt
    "tags": [
        "tag1",
        "tag2"
    ], // different tags for the data or task sample, see below for definition
    "anchor_pdf": [
    ], // UUIDs of the papers that are explicitly mentioned in the question
    "reference_pdf": [
    ], // in "multiple" category, UUIDs of papers that may be used but not provided in the question
    "conference": [
        "acl2023"
    ], // in "retrieval" category, define the search space of papers, usually conference+year
    "evaluator": {
        "eval_func": "function_name_to_call", // all eval functions are defined under `evaluation/` folder
        "eval_kwargs": {
            "gold": "ground truth answer",
            "lowercase": true
        } // the gold answer or how to get the gold answer should be included in `eval_kwargs` dict. Other optional keyword arguments can be used for customization and function re-use, e.g., `lowercase=True` and `threshold=0.95`.
    }, // A complex dict specifying the evaluation function and its parameters. The first parameter of the `eval_func` must be LLM predicted string.
    "annotator": "human" // whether annotated by human, machine, or adapted from other datasets
}
```


## Tags of AirQA

This section describes different question categories (or tags) for classification.

### Category 1: Task Goals

- `single`: ask technical details of one single paper, e.g.,
    - list 3 major contributions of this work
- `multiple`: involve multiple papers, may require comparison, calculation, aggregation and multi-step reasoning, e.g.,
    - which agent framework performs better on a popular benchmark
- `retrieval`: retrieve papers with constraints, e.g.,
    - papers published by a specific author or institute


### Category 2: Key Capabilities

- `text`: Q&A that focuses on text understanding and reasoning
- `table`: Q&A that requires identifying tables and their contents
- `image`: Q&A that involves the recognition of figures, charts, or graphs
- `formula`: Q&A that queries the details of math formulas
- `metadata`: metadata includes authors, institutes, e-mails, conferences, years and other information that does not appear in the main text (e.g., page header and footer)


### Category 3: Evaluation Types

- `subjective`: answers that require LLM or model-based evaluation
- `objective`: answers that can be evaluated with objective metrics defined in the module [`evaluation/`](../evaluation/__init__.py)


## Paper Metadata Format

The first function [`get_ai_research_metadata`](../utils/functions/ai_research_metadata.py#get_ai_research_metadata) that will be invoked in the pipeline function when parsing PDF aims to get the metadata of the paper.

```py
from utils.functions import get_ai_research_metadata

output_json = get_ai_research_metadata(
    title = "Attention is all you need",
    model = 'gpt-4o-mini',
    temperature = 0.1,
    api_tools = ['openreview', 'dblp', 'arxiv', 'semantic-scholar'],
    write_to_json = True, # the metadata will be written to a json file under dataset_dir/metadata/
    title_lines = 20,
    tldr_max_length = 80,
    tag_number = 5,
    dataset_dir = 'data/dataset/airqa',
    threshold = 95,
    limit = 10
)
```

The template of the output metadata is:

```json
{
    "uuid": "0a02b881-d0b1-59c6-a23e-1feb3bdf4c24", // UUID generated by `get_airqa_paper_uuid`
    "title": "Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models", // paper title
    "conference_full": "Annual Meeting of the Association for Computational Linguistics (2024)", // full title of the conference
    "conference": "ACL", // conference abbreviation
    "year": 2024, // conference year, or which year is this paper published
    "volume": "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", // volume title
    "bibtex": "...", // bibtex citation text
    "authors": ["Zhengxin Zhang", "Dan Zhao", "Xupeng Miao", ...], // authors list
    "num_pages": 36, // int value
    "pdf_url": "https://aclanthology.org/2024.acl-long.1.pdf", // URL to download the PDF, should end with .pdf
    "pdf_path": "data/dataset/airqa/papers/acl2024/ab14a93a-c5ee-5d60-8713-8b38bd501140.pdf", // local path to save the PDF, rename it with the UUID
    "abstract": "...", // paper abstract text
    "tldr": "...", // TLDR
    "tags": ["llm", "agent"] // keywords or tags
}
```