---
############################################################
models:
  # AI21 Labs
  - name: ai21/j1-jumbo
    display_name: J1-Jumbo v1 (178B)
    description: Jurassic-1 Jumbo (178B parameters) ([docs](https://studio.ai21.com/docs/jurassic1-language-models/), [tech report](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)).
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 178000000000
    release_date: 2021-08-11
  - name: ai21/j1-large
    display_name: J1-Large v1 (7.5B)
    description: Jurassic-1 Large (7.5B parameters) ([docs](https://studio.ai21.com/docs/jurassic1-language-models/), [tech report](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)).
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 7500000000
    release_date: 2021-08-11
  - name: ai21/j1-grande
    display_name: J1-Grande v1 (17B)
    description: Jurassic-1 Grande (17B parameters) with a "few tweaks" to the training process ([docs](https://studio.ai21.com/docs/jurassic1-language-models/), [tech report](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)).
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 17000000000
    release_date: 2022-05-03
  - name: ai21/j1-grande-v2-beta
    display_name: J1-Grande v2 beta (17B)
    description: Jurassic-1 Grande v2 beta (17B parameters)
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 17000000000
    release_date: 2022-10-28
  - name: ai21/j2-jumbo
    display_name: Jurassic-2 Jumbo (178B)
    description: Jurassic-2 Jumbo (178B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 178000000000
    release_date: 2023-03-09
  - name: ai21/j2-grande
    display_name: Jurassic-2 Grande (17B)
    description: Jurassic-2 Grande (17B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 17000000000
    release_date: 2023-03-09
  - name: ai21/j2-large
    display_name: Jurassic-2 Large (7.5B)
    description: Jurassic-2 Large (7.5B parameters) ([docs](https://www.ai21.com/blog/introducing-j2))
    creator_organization: AI21 Labs
    access: limited
    num_parameters: 7500000000
    release_date: 2023-03-09

  #  Aleph Alpha
  # TODO: add Luminous World when it's released
  - name: AlephAlpha/luminous-base
    display_name: Luminous Base (13B)
    description: Luminous Base (13B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/)
    creator_organization: Aleph Alpha
    access: limited
    num_parameters: 13000000000
    # TODO: get exact release date
    release_date: 2022-01-01
    todo: true
  - name: AlephAlpha/luminous-extended
    display_name: Luminous Extended (30B)
    description: Luminous Extended (30B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/)
    creator_organization: Aleph Alpha
    access: limited
    num_parameters: 30000000000
    release_date: 2022-01-01
    todo: true
  - name: AlephAlpha/luminous-supreme
    display_name: Luminous Supreme (70B)
    description: Luminous Supreme (70B parameters) ([docs](https://docs.aleph-alpha.com/docs/introduction/luminous/)
    creator_organization: Aleph Alpha
    access: limited
    num_parameters: 70000000000
    release_date: 2022-01-01
    todo: true

  # Anthropic
  - name: anthropic/stanford-online-all-v4-s3
    display_name: Anthropic-LM v4-s3 (52B)
    description: A 52B parameter language model, trained using reinforcement learning from human feedback [paper](https://arxiv.org/pdf/2204.05862.pdf).
    creator_organization: Anthropic
    access: closed
    num_parameters: 52000000000
    release_date: 2021-12-01
  - name: anthropic/claude-v1.3
    display_name: Anthropic Claude V1.3
    description: A model trained using reinforcement learning from human feedback ([docs](https://www.anthropic.com/index/introducing-claude)).
    creator_organization: Anthropic
    access: limited
    release_date: 2023-03-17
  - name: anthropic/claude-instant-v1
    display_name: Anthropic Claude Instant V1
    description: A lightweight version of Claude, a model trained using reinforcement learning from human feedback ([docs](https://www.anthropic.com/index/introducing-claude)).
    creator_organization: Anthropic
    access: limited
    release_date: 2023-03-17

  # BigScience
  - name: together/bloom
    display_name: BLOOM (176B)
    description: BLOOM (176B parameters) is an autoregressive model trained on 46 natural languages and 13 programming languages ([paper](https://arxiv.org/pdf/2211.05100.pdf)).
    creator_organization: BigScience
    access: open
    num_parameters: 176000000000
    release_date: 2022-06-28
  - name: together/bloomz
    display_name: BLOOMZ (176B)
    description: BLOOMZ (176B parameters) is BLOOM that has been fine-tuned on natural language instructions ([details](https://huggingface.co/bigscience/bloomz)).
    creator_organization: BigScience
    access: open
    num_parameters: 176000000000
    release_date: 2022-11-03
    todo: true
  - name: together/t0pp
    display_name: T0pp (11B)
    description: T0pp (11B parameters) is an encoder-decoder model trained on a large set of different tasks specified in natural language prompts ([paper](https://arxiv.org/pdf/2110.08207.pdf)).
    creator_organization: BigScience
    access: open
    num_parameters: 11000000000
    release_date: 2021-10-15

  # BigCode
  - name: huggingface/santacoder
    display_name: SantaCoder (1.1B)
    description: SantaCoder (1.1B parameters) model trained on the Python, Java, and JavaScript subset of The Stack (v1.1) ([model card](https://huggingface.co/bigcode/santacoder)).
    creator_organization: BigCode
    access: open
  - name: huggingface/starcoder
    display_name: StarCoder (15.5B)
    description: The StarCoder (15.5B parameter) model trained on 80+ programming languages from The Stack (v1.2) ([model card](https://huggingface.co/bigcode/starcoder)).
    creator_organization: BigCode
    access: open

  # Cohere
  - name: cohere/xlarge-20220609
    display_name: Cohere xlarge v20220609 (52.4B)
    description: Cohere xlarge v20220609 (52.4B parameters)
    creator_organization: Cohere
    access: limited
    num_parameters: 52400000000
    release_date: 2022-06-09
  - name: cohere/large-20220720
    display_name: Cohere large v20220720 (13.1B)
    description: Cohere large v20220720 (13.1B parameters), which is deprecated by Cohere as of December 2, 2022.
    creator_organization: Cohere
    access: limited
    num_parameters: 13100000000
    release_date: 2022-07-20
  - name: cohere/medium-20220720
    display_name: Cohere medium v20220720 (6.1B)
    description: Cohere medium v20220720 (6.1B parameters)
    creator_organization: Cohere
    access: limited
    num_parameters: 6100000000
    release_date: 2022-07-20
  - name: cohere/small-20220720
    display_name: Cohere small v20220720 (410M)
    description: Cohere small v20220720 (410M parameters), which is deprecated by Cohere as of December 2, 2022.
    creator_organization: Cohere
    access: limited
    num_parameters: 410000000
    release_date: 2022-07-20
  - name: cohere/xlarge-20221108
    display_name: Cohere xlarge v20221108 (52.4B)
    description: Cohere xlarge v20221108 (52.4B parameters)
    creator_organization: Cohere
    access: limited
    num_parameters: 52400000000
    release_date: 2022-11-08
  - name: cohere/medium-20221108
    display_name: Cohere medium v20221108 (6.1B)
    description: Cohere medium v20221108 (6.1B parameters)
    creator_organization: Cohere
    access: limited
    num_parameters: 6100000000
    release_date: 2022-11-08
  - name: cohere/command-medium-beta
    display_name: Cohere Command beta (6.1B)
    description: Cohere Command beta (6.1B parameters) is fine-tuned from the medium model to respond well with instruction-like prompts ([details](https://docs.cohere.ai/docs/command-beta)).
    creator_organization: Cohere
    access: limited
    num_parameters: 6100000000
    release_date: 2022-11-08
    todo: true
  - name: cohere/command-xlarge-beta
    display_name: Cohere Command beta (52.4B)
    description: Cohere Command beta (52.4B parameters) is fine-tuned from the XL model to respond well with instruction-like prompts ([details](https://docs.cohere.ai/docs/command-beta)).
    creator_organization: Cohere
    access: limited
    num_parameters: 52400000000
    release_date: 2022-11-08
    todo: true

  # DeepMind
  - name: deepmind/gopher
    display_name: Gopher (280B)
    description: Gopher (540B parameters) ([paper](https://arxiv.org/pdf/2112.11446.pdf)).
    creator_organization: DeepMind
    access: closed
    todo: true
  - name: deepmind/chinchilla
    display_name: Chinchilla (70B)
    description: Chinchilla (70B parameters) ([paper](https://arxiv.org/pdf/2203.15556.pdf)).
    creator_organization: DeepMind
    access: closed
    todo: true

  # EleutherAI
  - name: together/gpt-j-6b
    display_name: GPT-J (6B)
    description: GPT-J (6B parameters) autoregressive language model trained on The Pile ([details](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/)).
    creator_organization: EleutherAI
    access: open
    num_parameters: 6000000000
    release_date: 2021-06-04
  - name: together/gpt-neox-20b
    display_name: GPT-NeoX (20B)
    description: GPT-NeoX (20B parameters) autoregressive language model trained on The Pile ([paper](https://arxiv.org/pdf/2204.06745.pdf)).
    creator_organization: EleutherAI
    access: open
    num_parameters: 20000000000
    release_date: 2022-02-02

  # Google
  - name: together/t5-11b
    display_name: T5 (11B)
    description: T5 (11B parameters) is an encoder-decoder model trained on a multi-task mixture, where each task is converted into a text-to-text format ([paper](https://arxiv.org/pdf/1910.10683.pdf)).
    creator_organization: Google
    access: open
    num_parameters: 11000000000
    release_date: 2019-10-23

  - name: together/ul2
    display_name: UL2 (20B)
    description: UL2 (20B parameters) is an encoder-decoder model trained on the C4 corpus. It's similar to T5 but trained with a different objective and slightly different scaling knobs ([paper](https://arxiv.org/pdf/2205.05131.pdf)).
    creator_organization: Google
    access: open
    num_parameters: 20000000000
    release_date: 2022-05-10

  - name: together/flan-t5-xxl
    display_name: Flan-T5 (11B)
    description: Flan-T5 (11B parameters) is T5 fine-tuned on 1.8K tasks ([paper](https://arxiv.org/pdf/2210.11416.pdf)).
    creator_organization: Google
    access: open

  - name: google/palm
    display_name: PaLM (540B)
    description: Pathways Language Model (540B parameters) is trained using 6144 TPU v4 chips ([paper](https://arxiv.org/pdf/2204.02311.pdf)).
    creator_organization: Google
    access: closed
    todo: true

  # HazyResearch
  - name: together/h3-2.7b
    display_name: H3 (2.7B)
    description: H3 (2.7B parameters) is a decoder-only language model based on state space models ([paper](https://arxiv.org/abs/2212.14052)).
    creator_organization: HazyResearch
    access: open
    num_parameters: 2700000000
    release_date: 2023-01-23
    todo: true

  # Meta
  - name: together/opt-iml-175b
    display_name: OPT-IML (175B)
    description: OPT-IML (175B parameters) is a suite of decoder-only transformer LMs that are multi-task fine-tuned on 2000 datasets ([paper](https://arxiv.org/pdf/2212.12017.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 175000000000
    release_date: 2022-12-22
    todo: true

  - name: together/opt-iml-30b
    display_name: OPT-IML (30B)
    description: OPT-IML (30B parameters) is a suite of decoder-only transformer LMs that are multi-task fine-tuned on 2000 datasets ([paper](https://arxiv.org/pdf/2212.12017.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 30000000000
    release_date: 2022-12-22
    todo: true

  - name: together/opt-175b
    display_name: OPT (175B)
    description: Open Pre-trained Transformers (175B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 175000000000
    release_date: 2022-05-02

  - name: together/opt-66b
    display_name: OPT (66B)
    description: Open Pre-trained Transformers (66B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 66000000000
    release_date: 2022-05-02

  - name: together/opt-6.7b
    display_name: OPT (6.7B)
    description: Open Pre-trained Transformers (6.7B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 6700000000
    release_date: 2022-05-02

  - name: together/opt-1.3b
    display_name: OPT (1.3B)
    description: Open Pre-trained Transformers (1.3B parameters) is a suite of decoder-only pre-trained transformers that are fully and responsibly shared with interested researchers ([paper](https://arxiv.org/pdf/2205.01068.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 1300000000
    release_date: 2022-05-02

  - name: together/galactica-120b
    display_name: Galactica (120B)
    description: Galactica (120B parameters) is trained on 48 million papers, textbooks, lectures notes, compounds and proteins, scientific websites, etc. ([paper](https://galactica.org/static/paper.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 120000000000
    release_date: 2022-11-15
    todo: true

  - name: together/galactica-30b
    display_name: Galactica (30B)
    description: Galactica (30B parameters) is trained on 48 million papers, textbooks, lectures notes, compounds and proteins, scientific websites, etc. ([paper](https://galactica.org/static/paper.pdf)).
    creator_organization: Meta
    access: open
    num_parameters: 30000000000
    release_date: 2022-11-15
    todo: true

  # Microsoft/NVIDIA
  - name: microsoft/TNLGv2_530B
    display_name: TNLG v2 (530B)
    description: TNLG v2 (530B parameters) autoregressive language model trained on a filtered subset of the Pile and CommonCrawl ([paper](https://arxiv.org/pdf/2201.11990.pdf)).
    creator_organization: Microsoft/NVIDIA
    access: closed
    num_parameters: 530000000000
    release_date: 2022-01-28
  - name: microsoft/TNLGv2_7B
    display_name: TNLG v2 (6.7B)
    description: TNLG v2 (6.7B parameters) autoregressive language model trained on a filtered subset of the Pile and CommonCrawl ([paper](https://arxiv.org/pdf/2201.11990.pdf)).
    creator_organization: Microsoft/NVIDIA
    access: closed
    num_parameters: 6700000000
    release_date: 2022-01-28

  # OpenAI: https://beta.openai.com/docs/engines/gpt-3
  - name: openai/davinci
    display_name: davinci (175B)
    description: Original GPT-3 (175B parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 175000000000
    release_date: 2020-05-28
  - name: openai/curie
    display_name: curie (6.7B)
    description: Original GPT-3 (6.7B parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 6700000000
    release_date: 2020-05-28
  - name: openai/babbage
    display_name: babbage (1.3B)
    description: Original GPT-3 (1.3B parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 1300000000
    release_date: 2020-05-28
  - name: openai/ada
    display_name: ada (350M)
    description: Original GPT-3 (350M parameters) autoregressive language model ([paper](https://arxiv.org/pdf/2005.14165.pdf), [docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 350000000
    release_date: 2020-05-28
  - name: openai/text-davinci-003
    display_name: text-davinci-003
    description: text-davinci-003 model that involves reinforcement learning (PPO) with reward models. Derived from text-davinci-002 ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 175000000000
    release_date: 2022-11-28
  - name: openai/text-davinci-002
    display_name: text-davinci-002
    description: text-davinci-002 model that involves supervised fine-tuning on human-written demonstrations. Derived from code-davinci-002 ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 175000000000
    release_date: 2022-01-27
  - name: openai/text-davinci-001
    display_name: text-davinci-001
    description: text-davinci-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 175000000000
    release_date: 2022-01-27
    todo: true
  - name: openai/text-curie-001
    display_name: text-curie-001
    description: text-curie-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 6700000000
    release_date: 2022-01-27
  - name: openai/text-babbage-001
    display_name: text-babbage-001
    description: text-babbage-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 1300000000
    release_date: 2022-01-27
  - name: openai/text-ada-001
    display_name: text-ada-001
    description: text-ada-001 model that involves supervised fine-tuning on human-written demonstrations ([docs](https://beta.openai.com/docs/model-index-for-researchers)).
    creator_organization: OpenAI
    access: limited
    num_parameters: 350000000
    release_date: 2022-01-27
  - name: openai/gpt-4-0314
    display_name: gpt-4-0314
    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 from March 14th 2023.
    creator_organization: OpenAI
    access: limited
    release_date: 2023-03-14
  - name: openai/gpt-4-32k-0314
    display_name: gpt-4-32k-0314
    description: GPT-4 is a large multimodal model (currently only accepting text inputs and emitting text outputs) that is optimized for chat but works well for traditional completions tasks. Snapshot of gpt-4 with a longer context length of 32,768 tokens from March 14th 2023.
    creator_organization: OpenAI
    access: limited
    release_date: 2023-03-14
  - name: openai/code-davinci-002
    display_name: code-davinci-002
    description: Codex-style model that is designed for pure code-completion tasks ([docs](https://beta.openai.com/docs/models/codex)).
    creator_organization: OpenAI
    access: limited
  - name: openai/code-davinci-001
    display_name: code-davinci-001
    description: code-davinci-001 model
    creator_organization: OpenAI
    access: limited
    todo: true
  - name: openai/code-cushman-001
    display_name: code-cushman-001 (12B)
    description: Codex-style model that is a stronger, multilingual version of the Codex (12B) model in the [Codex paper](https://arxiv.org/pdf/2107.03374.pdf).
    creator_organization: OpenAI
    access: limited
  - name: openai/gpt-3.5-turbo-0301
    display_name: gpt-3.5-turbo-0301
    description: Sibling model Sibling model of text-davinci-003 is optimized for chat but works well for traditional completions tasks as well. Snapshot from 2023-03-01.
    creator_organization: OpenAI
    access: limited
    release_date: 2023-03-01
  - name: openai/chat-gpt
    display_name: ChatGPT
    description: Sibling model to InstructGPT which interacts in a conversational way. See [OpenAI's announcement](https://openai.com/blog/chatgpt/). The size of the model is unknown.
    creator_organization: OpenAI
    access: limited
    release_date: 2022-11-30
    todo: true

  # Together
  - name: together/Together-gpt-JT-6B-v1
    display_name: GPT-JT (6B)
    description: GPT-JT (6B parameters) is a fork of GPT-J ([blog post](https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered-by-open-source-ai)).
    creator_organization: Together
    access: open
    num_parameters: 6700000000
    release_date: 2022-11-29
    todo: true
  - name: together/gpt-neoxt-chat-base-20b
    display_name: GPT-NeoXT-Chat-Base (20B)
    description: GPT-NeoXT-Chat-Base (20B) is fine-tuned from GPT-NeoX, serving as a base model for developing open-source chatbots.
    creator_organization: Together
    access: open
    num_parameters: 20000000000
    release_date: 2023-03-08
    todo: true

  # Salesforce
  - name: together/codegen
    display_name: CodeGen (16B)
    description: CodeGen (16B parameters) is an open dense code model trained for multi-turn program synthesis ([blog](https://arxiv.org/pdf/2203.13474.pdf)).
    creator_organization: Tsinghua
    access: open
    num_parameters: 16000000000
    release_date: 2022-03-25
    todo: true

  # Tsinghua
  - name: together/glm
    display_name: GLM (130B)
    description: GLM (130B parameters) is an open bilingual (English & Chinese) bidirectional dense model that was trained using General Language Model (GLM) procedure ([paper](https://arxiv.org/pdf/2210.02414.pdf)).
    creator_organization: Tsinghua
    access: open
    num_parameters: 130000000000
    release_date: 2022-08-04

  - name: together/codegeex
    display_name: CodeGeeX (13B)
    description: CodeGeeX (13B parameters) is an open dense code model trained on more than 20 programming languages on a corpus of more than 850B tokens ([blog](http://keg.cs.tsinghua.edu.cn/codegeex/)).
    creator_organization: Tsinghua
    access: open
    num_parameters: 13000000000
    release_date: 2022-09-19
    todo: true

  # Writer
  - name: writer/palmyra-base
    display_name: Palmyra Base (5B)
    description: Palmyra Base (5B)
    creator_organization: Writer
    access: limited
    num_parameters: 5000000000
    release_date: 2022-10-13
  - name: writer/palmyra-large
    display_name: Palmyra Large (20B)
    description: Palmyra Large (20B)
    creator_organization: Writer
    access: limited
    num_parameters: 20000000000
    release_date: 2022-12-23
  - name: writer/palmyra-instruct-30
    display_name: InstructPalmyra (30B)
    description: InstructPalmyra (30B)
    creator_organization: Writer
    access: limited
    num_parameters: 30000000000
    release_date: 2023-02-16
  - name: writer/palmyra-e
    display_name: Palmyra E (30B)
    description: Palmyra E (30B)
    creator_organization: Writer
    access: limited
    num_parameters: 30000000000
    release_date: 2023-03-03
  - name: writer/silk-road
    display_name: Silk Road (35B)
    description: Silk Road (35B)
    creator_organization: Writer
    access: limited
    num_parameters: 35000000000
    release_date: 2023-04-13

  # Yandex
  - name: together/yalm
    display_name: YaLM (100B)
    description: YaLM (100B parameters) is an autoregressive language model trained on English and Russian text ([GitHub](From https://github.com/yandex/YaLM-100B)).
    creator_organization: Yandex
    access: open
    num_parameters: 100000000000
    release_date: 2022-06-23

  # Nvidia
  - name: nvidia/megatron-gpt2
    display_name: Megatron GPT2
    description: GPT-2 implemented in Megatron-LM ([paper](https://arxiv.org/abs/1909.08053)).
    creator_organization: Nvidia
    access: open

############################################################
adapter:
  - name: method
    description: The high-level strategy for converting instances into a prompt for the language model.
    values:
      - name: generation
        description: Given the input, the model generates the output free-form.
      - name: multiple_choice_joint
        description: Given the input, the model selects from multiple-choice options (A., B., C., D., E.).
      - name: multiple_choice_separate_original
        description: For each answer choice, the model assigns the input and answer choice a probability, returning the answer with maximum probability.
      - name: multiple_choice_separate_calibrated
        description: For each answer choice, the model assigns the input and answer choice a probability, returning the answer with maximum probability when calibrated by answer choice probability.
      - name: language_modeling
        description: Given the input, the model assigns the sequence a probability.
  - name: instructions
    description: The description of the task that is included at the very beginning of the prompt.
  - name: global_prefix
    description: The string that is prepended to the prompt.
  - name: instance_prefix
    description: The string that is included before each instance (e.g., '\n\n').
  - name: input_prefix
    description: The string that is included before each input (e.g., 'Question:').
  - name: input_suffix
    description: The string that is included after each input (e.g., '\n').
  - name: reference_prefix
    description: The string that is included before each reference (for multiple-choice questions).
  - name: reference_suffix
    description: The string that is included after each reference (for multiple-choice questions).
  - name: output_prefix
    description: The string that is included before the correct answer/predicted output (e.g., 'Answer:').
  - name: output_suffix
    description: The string that is included after the correct answer/predicted output (e.g., '\n').
  - name: substitutions
    description: A list of regular expression substitutions (e.g., replacing '\n' with ';\n') to perform at the very end on the prompt.
  - name: max_train_instances
    description: Maximum number of training instances to include in the prompt (currently by randomly sampling).
  - name: max_eval_instances
    description: Maximum number of instances to evaluate on (over all splits - test, valid, etc.).
  - name: num_outputs
    description: Maximum number of possible outputs to generate by sampling multiple outputs.
  - name: num_train_trials
    description: Number of trials, where in each trial we choose an independent, random set of training instances. Used to compute variance.
  - name: model
    description: Name of the language model (<organization>/<model name>) to send requests to.
  - name: temperature
    description: Temperature parameter used in generation.
  - name: max_tokens
    description: Maximum number of tokens to generate.
  - name: stop_sequences
    description: List of sequences, where we stop generation if we encounter any of them.
  - name: random
    description: Random seed (string), which guarantees reproducibility.

############################################################
metrics:
  # Infrastructure metrics:
  - name: num_perplexity_tokens
    display_name: '# tokens'
    description: Average number of tokens in the predicted output (for language modeling, the input too).
  - name: num_bytes
    display_name: '# bytes'
    description: Average number of bytes in the predicted output (for language modeling, the input too).

  - name: num_references
    display_name: '# ref'
    description: Number of references.
  - name: num_train_trials
    display_name: '# trials'
    description: Number of trials, where in each trial we choose an independent, random set of training instances.
  - name: estimated_num_tokens_cost
    display_name: 'cost'
    description: An estimate of the number of tokens (including prompt and output completions) needed to perform the request.
  - name: num_prompt_tokens
    display_name: '# prompt tokens'
    description: Number of tokens in the prompt.
  - name: num_prompt_characters
    display_name: '# prompt chars'
    description: Number of characters in the prompt.
  - name: num_completion_tokens
    display_name: '# completion tokens'
    description: Actual number of completion tokens (over all completions).
  - name: num_output_tokens
    display_name: '# output tokens'
    description: Actual number of output tokens.
  - name: max_num_output_tokens
    display_name: 'Max output tokens'
    description: Maximum number of output tokens (overestimate since we might stop earlier due to stop sequences).
  - name: num_requests
    display_name: '# requests'
    description: Number of distinct API requests.
  - name: num_instances
    display_name: '# eval'
    description: Number of evaluation instances.
  - name: num_train_instances
    display_name: '# train'
    description: Number of training instances (e.g., in-context examples).
  - name: prompt_truncated
    display_name: truncated
    description: Fraction of instances where the prompt itself was truncated (implies that there were no in-context examples).
  - name: finish_reason_length
    display_name: finish b/c length
    description: Fraction of instances where the the output was terminated because of the max tokens limit.
  - name: finish_reason_stop
    display_name: finish b/c stop
    description: Fraction of instances where the the output was terminated because of the stop sequences.
  - name: finish_reason_endoftext
    display_name: finish b/c endoftext
    description: Fraction of instances where the the output was terminated because the end of text token was generated.
  - name: finish_reason_unknown
    display_name: finish b/c unknown
    description: Fraction of instances where the the output was terminated for unknown reasons.
  - name: num_completions
    display_name: '# completions'
    description: Number of completions.
  - name: predicted_index
    display_name: Predicted index
    description: Integer index of the reference (0, 1, ...) that was predicted by the model (for multiple-choice).

  # Accuracy metrics:
  - name: exact_match
    display_name: Exact match
    short_display_name: EM
    description: Fraction of instances that the predicted output matches a correct reference exactly.
    lower_is_better: false
  - name: quasi_exact_match
    display_name: Quasi-exact match
    short_display_name: EM
    description: Fraction of instances that the predicted output matches a correct reference up to light processing.
    lower_is_better: false
  - name: prefix_exact_match
    display_name: Prefix exact match
    short_display_name: PEM
    description: Fraction of instances that the predicted output matches the prefix of a correct reference exactly.
    lower_is_better: false
  - name: quasi_prefix_exact_match
    # TODO: should call this prefix_quasi_exact_match
    display_name: Prefix quasi-exact match
    short_display_name: PEM
    description: Fraction of instances that the predicted output matches the prefix of a correct reference up to light processing.
    lower_is_better: false
  - name: exact_match@5
    display_name: Exact match @5
    short_display_name: EM@5
    description: Fraction of instances where at least one predicted output among the top 5 matches a correct reference exactly.
    lower_is_better: false
  - name: quasi_exact_match@5
    display_name: Quasi-exact match @5
    short_display_name: EM@5
    description: Fraction of instances where at least one predicted output among the top 5 matches a correct reference up to light processing.
    lower_is_better: false
  - name: logprob
    display_name: Log probability
    short_display_name: Logprob
    description: Predicted output's average log probability (input's log prob for language modeling).
    lower_is_better: false
  - name: logprob_per_byte
    display_name: Log probability / byte
    short_display_name: Logprob/byte
    description: Predicted output's average log probability normalized by the number of bytes.
    lower_is_better: false
  - name: bits_per_byte
    display_name: Bits/byte
    short_display_name: BPB
    lower_is_better: true
    description: Average number of bits per byte according to model probabilities.
  - name: perplexity
    display_name: Perplexity
    short_display_name: PPL
    lower_is_better: true
    description: Perplexity of the output completion (effective branching factor per output token).
  - name: rouge_1
    display_name: ROUGE-1
    description: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 1-gram overlap.
    lower_is_better: false
  - name: rouge_2
    display_name: ROUGE-2
    description: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.
    lower_is_better: false
  - name: rouge_l
    display_name: ROUGE-L
    description: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
    lower_is_better: false
  - name: bleu_1
    display_name: BLEU-1
    description: Average BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 1-gram overlap.
    lower_is_better: false
  - name: bleu_4
    display_name: BLEU-4
    description: Average BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 4-gram overlap.
    lower_is_better: false
  - name: f1_set_match
    display_name: F1 (set match)
    short_display_name: F1
    description: Average F1 score in terms of set overlap between the model predicted set and correct reference set.
    lower_is_better: false
  - name: f1_score
    display_name: F1
    description: Average F1 score in terms of word overlap between the model output and correct reference.
    lower_is_better: false
  - name: classification_macro_f1
    display_name: Macro-F1
    description: Population-level macro-averaged F1 score.
    lower_is_better: false
  - name: classification_micro_f1
    display_name: Micro-F1
    description: Population-level micro-averaged F1 score.
    lower_is_better: false
  - name: absolute_value_difference
    display_name: Absolute difference
    short_display_name: Diff.
    lower_is_better: true
    description: Average absolute difference between the model output (converted to a number) and the correct reference.
  - name: distance
    display_name: Geometric distance
    short_display_name: Dist.
    lower_is_better: true
    description: Average gometric distance between the model output (as a point) and the correct reference (as a curve).
  - name: percent_valid
    display_name: Valid fraction
    short_display_name: Valid
    description: Fraction of valid model outputs (as a number).
    lower_is_better: false
  - name: NDCG@10
    display_name: NDCG@10
    description: Normalized discounted cumulative gain at 10 in information retrieval.
    lower_is_better: false
  - name: RR@10
    display_name: RR@10
    description: Mean reciprocal rank at 10 in information retrieval.
    lower_is_better: false
  - name: NDCG@20
    display_name: NDCG@20
    description: Normalized discounted cumulative gain at 20 in information retrieval.
    lower_is_better: false
  - name: RR@20
    display_name: RR@20
    description: Mean reciprocal rank at 20 in information retrieval.
    lower_is_better: false
  - name: math_equiv
    display_name: Equivalent
    description: Fraction of model outputs that are mathematically equivalent to the correct reference.
    lower_is_better: false
  - name: math_equiv_chain_of_thought
    display_name: Equivalent (chain of thought)
    description: Fraction of model outputs that are mathematically equivalent to the correct reference when using chain-of-thoughts prompting.
    lower_is_better: false
  - name: exact_match_indicator
    display_name: Exact match (up to specified indicator)
    short_display_name: EM
    description: Fraction of instances that the predicted output matches a correct reference exactly, ignoring text preceding the specified indicator.
    lower_is_better: false
  - name: exact_set_match
    display_name: Exact match (at sets)
    short_display_name: EM
    description: Fraction of instances that the predicted output matches a correct reference exactly as sets.
    lower_is_better: false
  - name: iou_set_match
    display_name: Intersection over union (as sets)
    short_display_name: IoU
    description: Intersection over union in terms of set overlap between the model predicted set and correct reference set.
    lower_is_better: false

  # Summarization metrics
  - name: summac
    display_name: SummaC
    description: Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
    lower_is_better: false
  - name: QAFactEval
    display_name: QAFactEval
    description: Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
    lower_is_better: false
  - name: summarization_coverage
    display_name: Coverage
    description: Extent to which the model-generated summaries are extractive fragments from the source document [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/).
  - name: summarization_density
    display_name: Density
    description: Extent to which the model-generated summaries are extractive summaries based on the source document [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/).
  - name: summarization_compression
    display_name: Compression
    description: Extent to which the model-generated summaries are compressed relative to the source document [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/).
  - name: BERTScore-P
    display_name: BERTScore (P)
    description: Average BERTScore precision [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and reference summary.
    lower_is_better: false
  - name: BERTScore-R
    display_name: BERTScore (R)
    description: Average BERTScore recall [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and reference summary.
    lower_is_better: false
  - name: BERTScore-F
    display_name: BERTScore (F1)
    description: Average BERTScore F1 [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and reference summary.
    lower_is_better: false
  - name: HumanEval-faithfulness
    display_name: HumanEval-faithfulness
    description: Human evaluation score for faithfulness.
    lower_is_better: false
  - name: HumanEval-relevance
    display_name: HumanEval-relevance
    description: Human evaluation score for relevance.
    lower_is_better: false
  - name: HumanEval-coherence
    display_name: HumanEval-coherence
    description: Human evaluation score for coherence.
    lower_is_better: false

  #  Code metrics
  - name: code_eval_acc
    display_name: Correctness
    short_display_name: Correctness
    description: Fraction of instances that the model output evaluates to the correct answer.
    lower_is_better: false
  - name: pass
    display_name: pass@1
    description: Fraction of model outputs that pass the associated test cases.
    lower_is_better: false
  - name: test_avg
    display_name: 'Avg. # tests passed'
    description: Average number of tests passed by model outputs.
    lower_is_better: false
  - name: strict_acc
    display_name: Strict correctness
    short_display_name: Strict correctness
    description: Fraction of models outputs that pass all associated test cases.
    lower_is_better: false

  # Disinformation metrics (measure diversity):
  - name: self_bleu
    display_name: Self-BLEU
    description: Self-BLEU metric of [Zhu et al. (2018)](https://arxiv.org/pdf/1802.01886.pdf) as a measure of diversity in model generations.
  - name: monte_carlo_entropy
    display_name: Entropy (Monte Carlo)
    short_display_name: Entropy
    description: Monte Carlo estimate of the entropy as a measure of diversity in model generations.
  # Disinformation wedging metrics (measure quality of disinformation):
  - name: wedging_eval_address_intended_audience
    display_name: Address intended audience (human)
    short_display_name: Intended audience (human)
    description: Human judgment of fraction of model generations that target the intended audience.
  - name: wedging_eval_include_intended_goal
    display_name: Include intended goal (human)
    short_display_name: Intended goal (human)
    description: Human judgment of fraction of model generations that support the intended goal.
  - name: wedging_eval_style
    display_name: Correct style (human)
    short_display_name: Style (human)
    description: Average human judgment of whether model generations are of the specified style.
  - name: wedging_eval_divisiveness
    display_name: Divisiveness (human)
    description: Average human judgment of the extent to which model generations are divisive.
  - name: wedging_eval_toxicity
    display_name: Toxicity (human)
    description: Average human judgment of the extent to which model generations are toxicity.
  # Disinformation reiteration metrics (measure quality of disinformation):
  - name: reiteration_eval_support_thesis
    display_name: Support thesis (human)
    description: Average human judgment of the extent to which model generations support the stated thesis.
  - name: reiteration_eval_style
    display_name: Style (human)
    description: Average human judgment of whether model generations are of the specified style.

  # Copyright metrics (measure copying/overlap):
  - name: longest_common_prefix_length
    display_name: Longest common prefix length
    short_display_name: LCS
    lower_is_better: true
    description: Average length of longest common prefix between model generation and reference.
  - name: edit_distance
    display_name: Edit distance (Levenshtein)
    short_display_name: Edit dist.
    description: Average Levenshtein edit distance between model generation and reference.
    lower_is_better: false
  - name: edit_similarity
    display_name: Edit similarity (Levenshtein)
    short_display_name: Edit sim.
    lower_is_better: true
    description: Average Levenshtein edit similarity (1 - distance normalized by length of longer sequence) between model generation and reference.

  # Bias metrics:
  - name: bias_metric:mode=associations,demographic_category=race,target_category=profession
    display_name: Stereotypical associations (race, profession)
    short_display_name: Stereotypes (race)
    lower_is_better: true
    description: Measures uneven association of racial groups (Asian, Hispanic, White) with target professions. This measurement is based on cooccurence statistics between the racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics) and the target professions (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).
  - name: bias_metric:mode=associations,demographic_category=race,target_category=adjective
    display_name: Stereotypical associations (race, adjectives)
    short_display_name: Stereotypes (race)
    lower_is_better: true
    description: Measures uneven association of racial groups (Asian, Hispanic, White) with target adjectives. This measurement is based on cooccurence statistics between the racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics) and the target adjectives (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

  - name: bias_metric:mode=associations,demographic_category=gender,target_category=profession
    display_name: Stereotypical associations (gender, profession)
    short_display_name: Stereotypes (gender)
    lower_is_better: true
    description: Measures uneven association of gender groups (male, female) with target professions. This measurement is based on cooccurence statistics between the gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)) and the target professions (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).
  - name: bias_metric:mode=associations,demographic_category=gender,target_category=adjective
    display_name: Stereotypical associations (gender, adjectives)
    short_display_name: Stereotypes (gender)
    lower_is_better: true
    description: Measures uneven association of gender groups (male, female) with target adjectives. This measurement is based on cooccurence statistics between the gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)) and the target adjectives (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

  - name: bias_metric:mode=representation,demographic_category=race
    display_name: Demographic representation (race)
    short_display_name: Representation (race)
    lower_is_better: true
    description: Measures uneven representation of racial groups (Asian, Hispanic, White). This measurement is based on disparities in the frequency statistics across racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics).
  - name: bias_metric:mode=representation,demographic_category=gender
    display_name: Demographic representation (gender)
    short_display_name: Representation (gender)
    lower_is_better: true
    description: Measures uneven representation of gender groups (male, female). This measurement is based on disparities in the frequency statistics across gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

  - name: bbq_metric_ambiguous_bias
    display_name: BBQ (ambiguous)
    lower_is_better: true
    description: Metric of [Parrish et al. (2022)](https://aclanthology.org/2022.findings-acl.165/) for BBQ on ambiguous examples.
  - name: bbq_metric_unambiguous_bias
    display_name: BBQ (unambiguous)
    lower_is_better: true
    description: Metric of [Parrish et al. (2022)](https://aclanthology.org/2022.findings-acl.165/) for BBQ on unambiguous examples.

  # Toxicity metrics
  - name: expected_max_toxicity
    display_name: Average max toxicity
    short_display_name: Avg. max toxicity
    lower_is_better: true
    description: Average max toxicity score across all completions (based on the PerspectiveAPI toxicity classifier; [Gehman et al. (2020)](https://aclanthology.org/2020.findings-emnlp.301/)).
  - name: max_toxicity_probability
    display_name: max toxicity
    lower_is_better: true
    description: Fraction of instances where most toxic completion is above 0.5 toxicity (based on the PerspectiveAPI toxicity classifier; [Gehman et al. (2020)](https://aclanthology.org/2020.findings-emnlp.301/)).
  - name: toxic_frac
    display_name: Toxic fraction
    lower_is_better: true
    description: Fraction of model outputs that are toxic (based on the PerspectiveAPI toxicity classifier).

  # Efficiency metrics:
  - name: training_co2_cost
    display_name: Estimated training emissions (kg CO2)
    short_display_name: Training emissions (kg CO2)
    lower_is_better: true
    description: Estimate of the CO2 emissions from training the model.
  - name: training_energy_cost
    display_name: Estimated training energy cost (MWh)
    short_display_name: Training energy (MWh)
    lower_is_better: true
    description: Estimate of the amount of energy used to train the model.
  - name: inference_runtime
    display_name: Observed inference runtime (s)
    short_display_name: Observed inference time (s)
    lower_is_better: true
    description: Average observed time to process a request to the model (via an API, and thus depends on particular deployment).
  - name: inference_idealized_runtime
    display_name: Idealized inference runtime (s)
    short_display_name: Idealized inference time (s)
    lower_is_better: true
    description: Average time to process a request to the model based solely on the model architecture (using Megatron-LM).
  - name: inference_denoised_runtime
    display_name: Denoised inference runtime (s)
    short_display_name: Denoised inference time (s)
    lower_is_better: true
    description: Average time to process a request to the model minus performance contention by using profiled runtimes from multiple trials of SyntheticEfficiencyScenario.
  - name: batch_size
    display_name: Batch size
    description: For batch jobs, how many requests are in a batch.

  # Calibration metrics:
  - name: ece_1_bin
    display_name: 1-bin expected calibration error
    short_display_name: ECE (1-bin)
    lower_is_better: true
    description: The (absolute value) difference between the model's average confidence and accuracy (only computed for classification tasks).
  - name: max_prob
    display_name: Max prob
    description: Model's average confidence in its prediction (only computed for classification tasks)
    lower_is_better: false
  - name: ece_10_bin
    display_name: 10-bin expected calibration error
    short_display_name: ECE (10-bin)
    lower_is_better: true
    description: The average difference between the model's confidence and accuracy, averaged across 10 bins where each bin contains an equal number of points (only computed for classification tasks). Warning - not reliable for small datasets (e.g., with < 300 examples) because each bin will have very few examples.
  - name: platt_ece_1_bin
    display_name: 1-bin expected calibration error (after Platt scaling)
    short_display_name: Platt-scaled ECE (1-bin)
    lower_is_better: true
    description: 1-bin ECE computed after applying Platt scaling to recalibrate the model's predicted probabilities.
  - name: platt_ece_10_bin
    display_name: 10-bin Expected Calibration Error (after Platt scaling)
    short_display_name: Platt-scaled ECE (10-bin)
    lower_is_better: true
    description: 10-bin ECE computed after applying Platt scaling to recalibrate the model's predicted probabilities.
  - name: platt_coef
    display_name: Platt Scaling Coefficient
    short_display_name: Platt Coef
    description: Coefficient of the Platt scaling classifier (can compare this across tasks).
    lower_is_better: false
  - name: platt_intercept
    display_name: Platt Scaling Intercept
    short_display_name: Platt Intercept
    description: Intercept of the Platt scaling classifier (can compare this across tasks).
    lower_is_better: false
  - name: selective_cov_acc_area
    display_name: Selective coverage-accuracy area
    short_display_name: Selective Acc
    description: The area under the coverage-accuracy curve, a standard selective classification metric (only computed for classification tasks).
    lower_is_better: false
  - name: selective_acc@10
    display_name: Accuracy at 10% coverage
    short_display_name: Acc@10%
    description: The accuracy for the 10% of predictions that the model is most confident on (only computed for classification tasks).
    lower_is_better: false

############################################################
perturbations:
  - name: robustness
    display_name: Robustness
    description: Computes worst case over different robustness perturbations (misspellings, formatting, contrast sets).
  - name: fairness
    display_name: Fairness
    description: Computes worst case over different fairness perturbations (changing dialect, race of names, gender).
  - name: typos
    display_name: Typos
    description: >
      Randomly adds typos to each token in the input with probability 0.05 and computes the per-instance worst-case
      performance between perturbed and unperturbed versions.
  - name: synonym
    display_name: Synonyms
    description: >
      Randomly substitutes words in the input with WordNet synonyms with probability 0.5 and computes the per-instance
      worst-case performance between perturbed and unperturbed versions.
  - name: dialect
    display_name: SAE -> AAE
    short_display_name: Dialect
    description: >
      Deterministically substitutes SAE words in input with AAE counterparts using validated dictionary of [Ziems et al. (2022)](https://aclanthology.org/2022.acl-long.258/) and computes the per-instance worst-case performance between perturbed and unperturbed versions.
  - name: race
    display_name: First names by race (White -> Black)
    short_display_name: Race
    description: >
      Deterministically substitutes White first names with Black first names sampled from the lists of [Caliskan et al. (2017)](https://www.science.org/doi/10.1126/science.aal4230) and computes the per-instance worst-case performance between perturbed and unperturbed versions.
  - name: gender
    display_name: Pronouns by gender (Male -> Female)
    short_display_name: Gender
    description: >
      Deterministically substitutes male pronouns with female pronouns and computes the per-instance worst-case
      performance between perturbed and unperturbed versions.

############################################################
metric_groups:
  - name: accuracy
    display_name: Accuracy
    metrics:
      - name: ${main_name}
        split: ${main_split}

  - name: calibration
    display_name: Calibration
    metrics:
      - name: ece_10_bin
        split: ${main_split}

  - name: calibration_detailed
    display_name: Calibration
    description: Measures how calibrated the model is (how meaningful its uncertainty estimates are).
    metrics:
      - name: max_prob
        split: ${main_split}
      - name: ece_1_bin
        split: ${main_split}
      - name: ece_10_bin
        split: ${main_split}
      - name: selective_cov_acc_area
        split: ${main_split}
      - name: selective_acc@10
        split: ${main_split}
      - name: platt_ece_1_bin
        split: ${main_split}
      - name: platt_ece_10_bin
        split: ${main_split}
      - name: platt_coef
        split: ${main_split}
      - name: platt_intercept
        split: ${main_split}

  - name: robustness
    display_name: Robustness
    metrics:
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: robustness

  # TODO: Add other robustness perturbations
  - name: robustness_detailed
    display_name: Robustness
    description: Measures how robust the model is to invariances.
    metrics:
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: typos
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: synonyms

  - name: fairness
    display_name: Fairness
    metrics:
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: fairness

  # TODO: Add other fairness perturbations
  - name: fairness_detailed
    display_name: Fairness
    description: Measures how fair the model is.
    metrics:
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: dialect
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: race
      - name: ${main_name}
        split: ${main_split}
        perturbation_name: gender

  - name: bias
    display_name: Bias
    metrics:
    - name: bias_metric:mode=associations,demographic_category=race,target_category=profession
      split: ${main_split}
    - name: bias_metric:mode=associations,demographic_category=gender,target_category=profession
      split: ${main_split}
    - name: bias_metric:mode=representation,demographic_category=race
      split: ${main_split}
    - name: bias_metric:mode=representation,demographic_category=gender
      split: ${main_split}

  - name: toxicity
    display_name: Toxicity
    metrics:
    - name: toxic_frac
      split: ${main_split}

  - name: efficiency
    display_name: Efficiency
    metrics:
    - name: inference_denoised_runtime
      split: ${main_split}

  - name: efficiency_detailed
    display_name: Efficiency
    description: The efficiency of the model across both training and inference.
    metrics:
      - name: inference_runtime
        split: ${main_split}
      - name: inference_idealized_runtime
        split: ${main_split}
      - name: inference_denoised_runtime
        split: ${main_split}
      - name: training_co2_cost
        split: ${main_split}
      - name: training_energy_cost
        split: ${main_split}

  - name: general_information
    display_name: General information
    metrics:
    - name: num_instances
      split: ${main_split}
    - name: num_train_instances
      split: ${main_split}
    - name: prompt_truncated
      split: ${main_split}
    - name: num_prompt_tokens
      split: ${main_split}
    - name: num_output_tokens
      split: ${main_split}
    - name: num_train_trials
      split: ${main_split}

  # Special metrics for scenarios with more than 1 main metric
  - name: summarization_metrics
    display_name: Summarization metrics
    metrics:
      - name: summac
        split: ${main_split}
      - name: QAFactEval
        split: ${main_split}
      - name: BERTScore-F
        split: ${main_split}
      - name: summarization_coverage
        split: ${main_split}
      - name: summarization_density
        split: ${main_split}
      - name: summarization_compression
        split: ${main_split}
      - name: HumanEval-faithfulness
        split: ${main_split}
      - name: HumanEval-relevance
        split: ${main_split}
      - name: HumanEval-coherence
        split: ${main_split}

  - name: apps_metrics
    display_name: APPS metrics
    description: Metrics used for the APPS code generation benchmark.
    metrics:
      - name: test_avg
        split: ${main_split}
      - name: strict_acc
        split: ${main_split}

  - name: bbq_metrics
    display_name: BBQ metrics
    description: Metrics used for the BBQ bias benchmark.
    metrics:
      - name: bbq_metric_ambiguous_bias
        split: ${main_split}
      - name: bbq_metric_unambiguous_bias
        split: ${main_split}

  - name: copyright_metrics
    display_name: Copyright metrics
    metrics:
      - name: longest_common_prefix_length
        split: ${main_split}
      - name: edit_distance
        split: ${main_split}
      - name: edit_similarity
        split: ${main_split}

  - name: disinformation_metrics
    display_name: Disinformation metrics
    metrics:
      - name: self_bleu
        split: ${main_split}
      - name: monte_carlo_entropy
        split: ${main_split}

  - name: classification_metrics
    display_name: Classification metrics
    metrics:
      - name: classification_macro_f1
        split: ${main_split}
      - name: classification_micro_f1
        split: ${main_split}

############################################################
run_groups:
## Top-level
  - name: core_scenarios
    display_name: Core scenarios
    description: The scenarios where we evaluate all the models.
    # TODO: Could category just be supergroup everywhere?
    category: All scenarios
    subgroups:
      - question_answering
      - information_retrieval
      - summarization
      - sentiment_analysis
      - toxicity_detection
      - miscellaneous_text_classification

  - name: targeted_evaluations
    display_name: Targeted evaluations
    description: Targeted evaluation of specific skills (e.g., knowledge, reasoning) and risks (e.g., disinformation, memorization/copyright).
    category: All scenarios
    subgroups:
      - language
      - knowledge
      - reasoning
      - harms
      - efficiency

## Core scenarios
  - name: question_answering
    display_name: Question answering
    description: In question answering, given a question and (optionally, in open-book settings) a passage, the goal is to produce the answer.
      QA is a general format that captures a wide range of tasks involving varying levels of world and commonsense knowledge and reasoning abilities.
    category: Core scenarios
    subgroups:
      - mmlu
      - boolq
      - narrative_qa
      - natural_qa_closedbook
      - natural_qa_openbook_longans
      - quac
      - hellaswag
      - openbookqa
      - truthful_qa

  - name: information_retrieval
    display_name: Information retrieval
    description: In information retrieval, given a query and a set of candidate documents, the goal is to produce a ranking of the documents.
    category: Core scenarios
    subgroups:
      - msmarco_regular
      - msmarco_trec

  - name: summarization
    display_name: Summarization
    description: In text summarization, given a piece of text (paragraph or document), the goal is to produce a much shorter summary.
    category: Core scenarios
    subgroups:
      - summarization_cnndm
      - summarization_xsum

  - name: sentiment_analysis
    display_name: Sentiment analysis
    description: In sentiment classification, given a text (e.g., movie review), the goal is to predict the sentiment (positive or negative).
    category: Core scenarios
    subgroups:
      - imdb

  - name: toxicity_detection
    display_name: Toxicity detection
    description: In toxicity detection, given a text, the goal is to predict whether the text has toxic content.
    category: Core scenarios
    subgroups:
      - civil_comments

  - name: miscellaneous_text_classification
    display_name: Text classification
    description: Text classification is a general format that aims to classify text into a set of categories. This includes a wide range of classification tasks where the input is text.
    category: Core scenarios
    subgroups:
      - raft

  - name: aspirational
    display_name: Aspirational scenarios
    description: Scenarios that we should support.
    category: Core scenarios
    subgroups:
      - data_to_text_generation
      - fact_verification
      - copywriting
      - story_generation
      - biomedical_scenarios
      - clinical_scenarios
      - financial_scenarios
      - customer_service_scenarios
      - educational_scenarios
      - very_recent_scenarios
      - historical_scenarios
      - not_native_English_speaker
      - non_US_demographics
      - non_english
      - user_facing_tasks_english_dialects

  - name: language
    display_name: Language
    description: Targeted evaluation of linguistic capabilities.
    category: Targeted evaluations
    subgroups:
      - the_pile
      - twitter_aae
      - ice
      - blimp

  - name: knowledge
    display_name: Knowledge
    description: Targeted evaluation of knowledge (e.g. factual, cultural, commonsense).
    category: Targeted evaluations
    subgroups:
      - natural_qa_closedbook
      - hellaswag
      - openbookqa
      - truthful_qa
      - mmlu
      - wikifact

  - name: reasoning
    display_name: Reasoning
    description: Targeted evaluation of reasoning capabilities (e.g. mathematical, hierarchical).
    category: Targeted evaluations
    subgroups:
      - synthetic_reasoning
      - synthetic_reasoning_natural
      - babi_qa
      - dyck_language
      - gsm
      - math_regular
      - math_chain_of_thought
      - code_apps
      - code_humaneval
      - lsat_qa
      - legal_support
      - entity_data_imputation
      - entity_matching

  - name: harms
    display_name: Harms
    description: Targeted evaluation of social harms (e.g., copyright, disinformation, social bias, toxicity).
    category: Targeted evaluations
    subgroups:
      - copyright_text
      - copyright_code
      - disinformation_reiteration
      - disinformation_wedging
      - bbq
      - bold
      - real_toxicity_prompts

  - name: efficiency
    display_name: Efficiency
    description: Targeted evaluation of training and inference efficiency.
    category: Targeted evaluations
    subgroups:
      - synthetic_efficiency
    adapter_keys_shown:
      - model
      - max_tokens

  - name: calibration
    display_name: Calibration
    description: Extended calibration metrics.
    category: Targeted evaluations
    subgroups:
      - mmlu
      - imdb
      - raft
      - civil_comments
    metric_groups:
      - calibration_detailed
      - accuracy
    environment:  # need to specify an environment for metric placeholders ("none" won't match anything)
      main_name: none
      main_split: none

### Ablations
  - name: ablation_in_context
    display_name: Vary number of in-context examples
    description: Vary the number of in-context training examples.
    category: Targeted evaluations
    visibility: this_group_only
    subgroups:
      - natural_qa_openbook_longans
      - summarization_cnndm
      - imdb
      - civil_comments
    adapter_keys_shown:
      - model
      - max_train_instances
    subgroup_metric_groups_hidden:
      - robustness
      - fairness

  - name: ablation_multiple_choice
    display_name: Vary multiple-choice strategy
    description: Vary the adapation strategy for multiple-choice questions.
    category: Targeted evaluations
    visibility: this_group_only
    subgroups:
      - hellaswag
      - openbookqa
      - truthful_qa
      - mmlu
      - blimp
      - legal_support
      - lsat_qa
      - bbq
    adapter_keys_shown:
      - model
      - method

  - name: ablation_prompts
    display_name: Vary prompting
    description: Vary the instructions and labels for input/output.
    category: Targeted evaluations
    visibility: this_group_only
    subgroups:
      - natural_qa_openbook_longans
      - summarization_cnndm
      - imdb
      - civil_comments
    adapter_keys_shown:
      - model
      - instructions
      - input_prefix
      - input_suffix
      - output_prefix
      - output_suffix

  - name: robustness_contrast_sets
    display_name: Robustness to contrast sets
    description: Evaluating equivariance to semantics-altering perturbations
    category: Targeted evaluations
    subgroup_display_mode: by_group
    visibility: this_group_only
    subgroups:
      - imdb
      - boolq

  - name: robustness_individual
    display_name: Robustness to single types of perturbations
    description: Evaluating robsustness to a single perturbation at a time (e.g., typos, synonyms)
    category: Targeted evaluations
    visibility: this_group_only

### Scenarios (the actual scenarios)

# Question answering scenarios
  - name: boolq
    display_name: BoolQ
    description: The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: valid
    taxonomy:
      task: question answering
      what: passages from Wikipedia, questions from search queries
      who: web users
      when: 2010s
      language: English

  - name: narrative_qa
    display_name: NarrativeQA
    description: The NarrativeQA benchmark for reading comprehension over narratives [(Kočiský et al., 2017)](https://aclanthology.org/Q18-1023/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: f1_score
      main_split: test
    taxonomy:
      task: question answering
      what: passages are books and movie scripts, questions are unknown
      who: "?"
      when: "?"
      language: English

  - name: natural_qa_closedbook
    display_name: NaturalQuestions (closed-book)
    description: The NaturalQuestions [(Kwiatkowski et al., 2019)](https://aclanthology.org/Q19-1026/) benchmark for question answering based on naturally-occurring queries through Google Search. The input does not include the Wikipedia page with the answer.
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: f1_score
      main_split: valid
    taxonomy:
      task: question answering
      what: passages from Wikipedia, questions from search queries
      who: web users
      when: 2010s
      language: English

  - name: natural_qa_openbook_longans
    display_name: NaturalQuestions (open-book)
    description: The NaturalQuestions [(Kwiatkowski et al., 2019)](https://aclanthology.org/Q19-1026/) benchmark for question answering based on naturally-occurring queries through Google Search. The input includes the Wikipedia page with the answer.
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: f1_score
      main_split: valid
    taxonomy:
      task: question answering
      what: passages from Wikipedia, questions from search queries
      who: web users
      when: 2010s
      language: English

  - name: quac
    display_name: QuAC (Question Answering in Context)
    short_display_name: QuAC
    description: The QuAC benchmark for question answering in the context of dialogues [(Choi et al., 2018)](https://aclanthology.org/D18-1241/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: f1_score
      main_split: valid
    taxonomy:
      task: question answering
      what: "?"
      who: "?"
      when: "?"
      language: English

  - name: hellaswag
    display_name: HellaSwag
    description: The HellaSwag benchmark for commonsense reasoning in question answering [(Zellers et al., 2019)](https://aclanthology.org/P19-1472/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - efficiency
      - general_information
    environment:
      main_name: exact_match
      main_split: valid
    taxonomy:
      task: question answering
      what: commonsense reasoning
      who: "?"
      when: "?"
      language: English

  - name: openbookqa
    display_name: OpenbookQA
    description: The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - efficiency
      - general_information
    environment:
      main_name: exact_match
      main_split: test
    taxonomy:
      task: question answering
      what: "?"
      who: "?"
      when: "?"
      language: English

  - name: truthful_qa
    display_name: TruthfulQA
    description: The TruthfulQA benchmarking for measuring model truthfulness and commonsense knowledge in question answering [(Lin et al., 2022)](https://aclanthology.org/2022.acl-long.229/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - efficiency
      - general_information
    environment:
      main_name: exact_match
      main_split: valid
    taxonomy:
      task: question answering
      what: "?"
      who: "?"
      when: "?"
      language: English

  - name: mmlu
    display_name: MMLU (Massive Multitask Language Understanding)
    short_display_name: MMLU
    description: The Massive Multitask Language Understanding (MMLU) benchmark for knowledge-intensive question answering across 57 domains [(Hendrycks et al., 2021)](https://openreview.net/forum?id=d7KBjmI3GmQ).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - efficiency
      - general_information
    environment:
      main_name: exact_match
      main_split: test
    taxonomy:
      task: question answering
      what: "?"
      who: "?"
      when: "?"
      language: English

# Information retrieval scenarios
  - name: msmarco_regular
    display_name: MS MARCO (regular track)
    short_display_name: MS MARCO (regular)
    description: The MS MARCO benchmark's regular track for passage retrieval in information retrieval [(https://microsoft.github.io/msmarco/)](https://microsoft.github.io/msmarco/).
    metric_groups:
      - accuracy
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: RR@10
      main_split: valid
    taxonomy:
      task: information retrieval
      what: "?"
      who: "?"
      when: "?"
      language: English

  - name: msmarco_trec
    display_name: MS MARCO (TREC track)
    short_display_name: MS MARCO (TREC)
    description: The MS MARCO benchmark's deep learning TREC track for passage retrieval in information retrieval [(https://trec.nist.gov)](https://microsoft.github.io/msmarco/).
    metric_groups:
      - accuracy
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: NDCG@10
      main_split: valid
    taxonomy:
      task: information retrieval
      what: "?"
      who: "?"
      when: "?"
      language: English

# Summarization scenarios
  - name: summarization_cnndm
    display_name: CNN/DailyMail
    description: The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).
    metric_groups:
      - accuracy
      - summarization_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: rouge_2
      main_split: test
    taxonomy:
      task: summarization
      what: "?"
      who: "?"
      when: "?"
      language: English

  - name: summarization_xsum
    display_name: XSUM
    description: The XSUM benchmark for text summarization of BBC news articles [(Narayan et al., 2018)](https://aclanthology.org/D18-1206/).
    metric_groups:
      - accuracy
      - summarization_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: rouge_2
      main_split: test
    taxonomy:
      task: summarization
      what: "?"
      who: "?"
      when: "?"
      language: English

# Sentiment analysis scenarios
  - name: imdb
    display_name: IMDB
    description: The IMDB benchmark for sentiment analysis in movie review [(Maas et al., 2011)](https://aclanthology.org/P11-1015/).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: valid
    taxonomy:
      task: sentiment analysis
      what: movie reviews
      who: "?"
      when: "?"
      language: English

# Text classification scenarios
  - name: raft
    display_name: RAFT (Real-world Annotated Few-Shot)
    short_display_name: RAFT
    description: The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text classification tasks [(Alex et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ca46c1b9512a7a8315fa3c5a946e8265-Abstract-round2.html).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: text classification
      what: "?"
      who: "?"
      when: "?"
      language: English

# Toxicity detection scenarios
  - name: civil_comments
    display_name: CivilComments
    description: The CivilComments benchmark for toxicity detection [(Borkan et al., 2019)](https://arxiv.org/pdf/1903.04561.pdf).
    metric_groups:
      - accuracy
      - calibration
      - robustness
      - fairness
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: toxicity classification
      what: "?"
      who: "?"
      when: "?"
      language: English

## Language scenarios
# Language modeling
  - name: ice
    display_name: ICE (International Corpus of English)
    short_display_name: ICE
    description: The International Corpus of English (ICE) drawn from English speakers from various places in the world, initiated by [Greenbaum (1991)](https://www.cambridge.org/core/journals/english-today/article/abs/ice-the-international-corpus-of-english/47808205394C538393C3FD8E62E5E701).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: bits_per_byte
      main_split: test
    taxonomy:
      task: language modeling
      what: "?"
      who: "?"
      when: "?"
      language: English varieties from different nations

  - name: the_pile
    display_name: The Pile
    description: The Pile corpus for measuring lanugage model performance across various domains [(Gao et al., 2020)](https://arxiv.org/pdf/2101.00027.pdf).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: bits_per_byte
      main_split: test
    taxonomy:
      task: language modeling
      what: "?"
      who: "?"
      when: "?"
      language: English, code

  - name: twitter_aae
    display_name: TwitterAAE
    description: The TwitterAAE corpus of [Blodgett et al. (2016)](https://aclanthology.org/D16-1120/) for measuring language model performance in tweets as a function of speaker dialect.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: bits_per_byte
      main_split: test
    taxonomy:
      task: language modeling
      what: "?"
      who: "?"
      when: "?"
      language: English (AAE-aligned and White-aligned)

  - name: twitter_aae_aa
    display_name: TwitterAAE (AA)
    description: The TwitterAAE corpus of [Blodgett et al. (2016)](https://aclanthology.org/D16-1120/) for measuring language model performance on African-American-aligned Tweets.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: bits_per_byte
      main_split: test
    taxonomy:
      task: language modeling
      what: "?"
      who: "?"
      when: "?"
      language: English (AAE-aligned)

  - name: twitter_aae_white
    display_name: TwitterAAE (white)
    description: The TwitterAAE corpus of [Blodgett et al. (2016)](https://aclanthology.org/D16-1120/) for measuring language model performance on White-aligned Tweets.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: bits_per_byte
      main_split: test
    taxonomy:
      task: language modeling
      what: "?"
      who: "?"
      when: "?"
      language: English (White-aligned)

# Minimal pairs
  - name: blimp
    display_name: BLiMP (The Benchmark of Linguistic Minimal Pairs for English)
    short_display_name: BLiMP
    description: The Benchmark of Linguistic Minimal Pairs for English (BLiMP) for measuring performance on linguistic phenomena using minimal pair design [(Warstadt et al., 2020)](https://aclanthology.org/2020.tacl-1.25/).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: exact_match
      main_split: test
    taxonomy:
      task: grammaticality
      what: constructed minimal pair sentences
      who: linguists
      when: "2019"
      language: English

## Knowledge scenarios
  - name: wikifact
    display_name: WikiFact
    description: Scenario introduced in this work, inspired by [Petroni et al. (2019)](https://aclanthology.org/D19-1250/), to more extensively test factual knowledge.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: knowledge base completion
      what: entity-relation-entity triples in natural language form
      who: automatically generated from templates
      when: "?"
      language: structured English

## Reasoning scenarios
# Primitive-focused reasoning
  - name: babi_qa
    display_name: bAbI
    description: The bAbI benchmark for measuring understanding and reasoning [(Weston et al., 2015)](https://arxiv.org/pdf/1502.05698.pdf).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: question answering
      what: reasoning
      who: synthetic
      when: "2015"
      language: English

  - name: dyck_language
    display_name: Dyck
    description: Scenario testing hierarchical reasoning through the Dyck formal languages [(Suzgun et al., 2019)](https://aclanthology.org/W19-3905/).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: exact_match_indicator
      main_split: test
    taxonomy:
      task: next-word prediction
      what: Dyck formal language
      who: n/a
      when: n/a
      language: synthetic

  - name: numeracy
    display_name: Numerical reasoning
    description: Scenario introduced in this work to test numerical reasoning via symbolic regression.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: absolute_value_difference
      main_split: test
    taxonomy:
      task: next-word prediction
      what: Dyck formal language
      who: n/a
      when: n/a
      language: synthetic

  - name: synthetic_reasoning
    display_name: Synthetic reasoning (abstract symbols)
    description: Synthetic reasoning tasks defined using abstract symbols based on LIME [(Wu et al., 2021)](https://proceedings.mlr.press/v139/wu21c.html).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: synthetic_reasoning_natural
    display_name: Synthetic reasoning (natural language)
    description: Synthetic reasoning tasks defined using simple natural language based on LIME [(Wu et al., 2021)](https://proceedings.mlr.press/v139/wu21c.html).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: f1_set_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

# Realistic reasoning
  - name: gsm
    display_name: GSM8K (Grade school math word problems)
    short_display_name: GSM8K
    description: The grade school math word problems dataset (GSM8K) for testing mathematical reasoning on grade-school math problems [(Cobbe et al., 2021)](https://arxiv.org/pdf/2110.14168.pdf).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: exact_match_indicator
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: math_regular
    display_name: MATH
    description: The MATH benchmark for measuring mathematical problem solving on competition math problems [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: math_equiv
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: math_chain_of_thought
    display_name: MATH (chain-of-thoughts)
    description: The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: math_equiv_chain_of_thought
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: code_apps
    display_name: APPS (Code)
    description: The APPS benchmark for measuring competence on code challenges [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html).
    metric_groups:
      - apps_metrics
      - efficiency
      - general_information
    environment:
      # We do not include accuracy as it is subsumed by apps
      main_name: test_avg
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: code_humaneval
    display_name: HumanEval (Code)
    description: The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: pass
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: legal_support
    display_name: LegalSupport
    description: Scenario introduced in this work to measure fine-grained legal reasoning through reverse entailment.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: lsat_qa
    display_name: LSAT
    description: The LSAT benchmark for measuring analytical reasoning on the Law School Admission Test (LSAT; [Zhong et al., 2021](https://arxiv.org/pdf/2104.06598.pdf)).
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: lextreme
    display_name: LEXTREME
    description: A Multilingual Legal Benchmark for Natural Language Understanding
    metric_groups:
      - classification_metrics
      - calibration
      - efficiency
      - general_information
    environment:
      main_name: classification_macro_f1
      main_split: test

  - name: lex_glue
    display_name: LexGLUE
    description: A Benchmark Dataset for Legal Language Understanding in English
    metric_groups:
      - classification_metrics
      - calibration
      - efficiency
      - general_information
    environment:
      main_name: classification_macro_f1
      main_split: test

  - name: billsum_legal_summarization
    display_name: BillSum
    description: The BillSum benchmark for legal text summarization ([Kornilova & Eidelmann, 2020](https://aclanthology.org/D19-5406/)).
    metric_groups:
      - accuracy
      - summarization_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: rouge_2
      main_split: test
    taxonomy:
      task: summarization
      what: legal text from US bills
      who: lawyers
      language: English

  - name: multilexsum_legal_summarization
    display_name: MultiLexSum
    description: The MultiLexSum benchmark for legal text summarization ([Shen et al., 2022](https://arxiv.org/abs/2206.10883)).
    metric_groups:
      - accuracy
      - summarization_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: rouge_2
      main_split: test
    taxonomy:
      task: summarization
      what: legal text from US civil rights lawsuits
      who: lawyers
      language: English

  - name: eurlexsum_legal_summarization
    display_name: EurLexSum
    description: The EurLexSum benchmark for legal text summarization ([Aumiller et al., 2022](https://arxiv.org/abs/2210.13448)).
    metric_groups:
      - accuracy
      - summarization_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_name: rouge_2
      main_split: test
    taxonomy:
      task: summarization
      what: legal text from EU legislation
      who: lawyers
      when: 1960 - 2020
      language: English

  - name: entity_data_imputation
    display_name: Data imputation
    description: Scenario from [Mei et al. (2021)](https://ieeexplore.ieee.org/document/9458712/) that tests the ability to impute missing entities in a data table.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: entity_matching
    display_name: Entity matching
    description: Scenario from Magellan [(Konda et al., 2016)](https://dl.acm.org/doi/10.14778/3007263.3007314) that tests the ability to determine if two entities match.
    metric_groups:
      - accuracy
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

## Copyright scenarios
  - name: copyright_text
    display_name: Copyright (text)
    description: Scenario introduced in this work to measure copyright and memorization behavior for books, based off of [Carlini et al. (2021)](https://www.usenix.org/biblio-11958).
    metric_groups:
      - copyright_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: copyright_code
    display_name: Copyright (code)
    description: Scenario introduced in this work to measure copyright and memorization behavior for code, based off of [Carlini et al. (2021)](https://www.usenix.org/biblio-11958).
    metric_groups:
      - copyright_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

## Disinformation scenarios
  - name: disinformation_reiteration
    display_name: Disinformation (reiteration)
    description: Scenario from [Buchanan et al. (2021)](https://cset.georgetown.edu/publication/truth-lies-and-automation/) that tests the ability to reiterate disinformation content.
    metric_groups:
      - disinformation_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_split: valid
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: disinformation_wedging
    display_name: Disinformation (wedging)
    description: Scenario from [Buchanan et al. (2021)](https://cset.georgetown.edu/publication/truth-lies-and-automation/) that tests the ability to generate divisive and wedging content.
    metric_groups:
      - disinformation_metrics
      - bias
      - toxicity
      - efficiency
      - general_information
    environment:
      main_split: valid
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

## Bias scenarios
  - name: bbq
    display_name: BBQ (Bias Benchmark for Question Answering)
    short_display_name: BBQ
    description: The Bias Benchmark for Question Answering (BBQ) for measuring social bias in question answering in ambiguous and unambigous context [(Parrish et al., 2022)](https://aclanthology.org/2022.findings-acl.165/).
    metric_groups:
      - accuracy
      - bbq_metrics
      - efficiency
      - general_information
    environment:
      main_name: quasi_exact_match
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

## Toxicity scenarios
  - name: bold
    display_name: BOLD (Bias in Open-Ended Language Generation Dataset)
    short_display_name: BOLD
    description: The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases and toxicity in open-ended language generation [(Dhamala et al., 2021)](https://dl.acm.org/doi/10.1145/3442188.3445924).
    metric_groups:
      - toxicity
      - bias
      - efficiency
      - general_information
    environment:
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

  - name: real_toxicity_prompts
    display_name: RealToxicityPrompts
    description: The RealToxicityPrompts dataset for measuring toxicity in prompted model generations [(Gehman et al., 2020)](https://aclanthology.org/2020.findings-emnlp.301/).
    metric_groups:
      - toxicity
      - bias
      - efficiency
      - general_information
    sub_splits:
      - toxic
      - non-toxic
    environment:
      main_split: test
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

## Efficiency scenarios
  - name: synthetic_efficiency
    display_name: Synthetic efficiency
    description: Scenario introduced in this work to better understand inference runtime performance of various models.
    metric_groups:
      - efficiency_detailed
      - general_information
    environment:
      main_split: test
    adapter_keys_shown:
      - model
      - max_tokens
    taxonomy:
      task: "?"
      what: n/a
      who: n/a
      when: n/a
      language: synthetic

## Aspirational scenarios
# Task coverage
  - name: data_to_text_generation
    display_name: Data-to-text generation
    description: Currently, we prioritize user-facing tasks in our core scenarios, but don't implement data-to-text generation. Could be implemented via WebNLG, E2E, ToTTo, etc.
    taxonomy:
      task: data-to-text generation
    todo: true

  - name: fact_verification
    display_name: Fact verification
    description: Currently, we prioritize user-facing tasks in our core scenarios, but don't implement fact verification. Could be implemented via FEVER.
    taxonomy:
      task: fact verification
    todo: true

  - name: copywriting
    display_name: Copywriting
    description: Currently, we prioritize user-facing tasks in our core scenarios, but don't implement tasks that have not been historically studied in the NLP research community like (ad) copywriting.
    taxonomy:
      task: copywriting
    todo: true

  - name: story_generation
    display_name: Story generation
    description: Currently, we prioritize user-facing tasks in our core scenarios, but don't implement more creative and interactive tasks like story generation.
    taxonomy:
      task: story generation
    todo: true

# Domain coverage
  - name: biomedical_scenarios
    display_name: Biomedical scenarios
    description: Currently, we implement scenarios from common domains in NLP research, neglecting various domains where language technologies could provide significant value.
    taxonomy:
      what: Biomedical text (e.g., biomedicine papers)
    todo: true

  - name: clinical_scenarios
    display_name: Clinical scenarios
    description: Currently, we implement scenarios from common domains in NLP research, neglecting various domains where language technologies could provide significant value.
    taxonomy:
      what: Clincal text (e.g., clinical notes)
    todo: true

  - name: financial_scenarios
    display_name: Financial scenarios
    description: Currently, we implement scenarios from common domains in NLP research, neglecting various domains where language technologies could provide significant value.
    taxonomy:
      what: Financial text (e.g., financial reports)
    todo: true

  - name: customer_service_scenarios
    display_name: Customer services scenarios
    description: Currently, we implement scenarios from common domains in NLP research, neglecting various domains where language technologies could provide significant value.
    taxonomy:
      what: Customer service text (e.g., customer service chat logs)
    todo: true

  - name: educational_scenarios
    display_name: Educational scenarios
    description: Currently, we implement scenarios from common domains in NLP research, neglecting various domains where language technologies could provide significant value.
    taxonomy:
      what: Text from educational contexts (e.g., student-teacher interactions)
    todo: true

  - name: very_recent_scenarios
    display_name: Very recent scenarios
    description: Currently, we implement scenarios using standard NLP datasets. However, to test temporal generalization as the world and language change, we should implement scenarios with very recent data (e.g., current world events) like StreamingQA.
    taxonomy:
      when: present
    todo: true

  - name: historical_scenarios
    display_name: Scenarios involving historic data
    description: Currently, we implement scenarios using standard NLP datasets, which predominantly are from post-Internet and contemporary society. However, to test temporal generalization for using models in the digital humanities for historic data, we should implement scenarios with significantly older data (e.g., text from 1800s).
    taxonomy:
      when: distant past
    todo: true

  - name: not_native_English_speaker
    display_name: Scenarios involving non-native speakers
    description: Currently, we implement scenarios of an unknown composition of native and non-native English speakers. We should implement scenarios to ensure coverage of language from non-native English speakers.
    taxonomy:
      who: non-native English speakers
      language: English
    todo: true

  - name: non_US_demographics
    display_name: Scenarios involving data from marginalized demographics in non-US English-speaking regions
    description: Currently, we ensure some coverage of language based on US-centric demographic groups, including marginalized groups. We should implement scenarios to ensure coverage of other socially-relevant groups beyond US demographics (e.g., caste in India).
    taxonomy:
      who: relevant demographics in non-US English-speaking regions
      language: English
    todo: true

# Language coverage
  - name: non_english
    display_name: Scenarios for languages beyond English.
    description: Currently, we only implement English scenarios.
    taxonomy:
      language: non-English
    todo: true

  - name: user_facing_tasks_english_dialects
    display_name: Scenarios with user-facing tasks on English dialects
    description: Currently, evaluate performance on English dialects via language modeling (e.g., TwitterAAE, ICE), but it would be good to implement user-facing tasks for these dialects.
    taxonomy:
      task: user-facing tasks
      language: English dialects
    todo: true
