# Embroid Supplement

This README describes the organization of our attached supplement. In addition to making an Appendix available, we also provide CSV files with performance numbers for each task, in each of the regimes studied.

`Appendix.pdf`: contains a full PDF of the paper, including the Appendix section. We provide a full PDF only to preserve references.

`code/`: contains code and data to run Embroid
- `Demo.ipynb`: notebook showing how to run Embroid on predictions generated for the DBPedia Animal task
- `bert_embeddings.pickle`: BERT embeddings for task samples
- `roberta_embeddings.pickle`: RoBERTa embeddings for task samples
- `sbert_embeddings.pickle`: SentenceBert embeddings for task samples
- `votes.npy`: predictions from GPT-JT on task (using one prompt)
- `labels.npy`: true labels for task samples
- `dbpedia_animal.tsv`: samples for dataset


`table_1_results.tsv` contains full results for Table 1 (page 7) as a TSV file. We report the macro F1 for each trial. 

- `task`: name of task
- `language_model`: name of language model used
- `original_prompt_performance`: macro F1 for original predictions
- `embroid_corrected_performance`: macro F1 after applying Embroid to predictions 

`table_2_results` contains full results for Table 2 (page 8) as a TSV file.
- `task`: name of task
- `language_model`: name of language model used
- `flying_squid`: performance of FlyingSquid
- `liger`: performance of Liger
- `majority_vote`: performance of majority vote
- `ama`: performance of AMA
- `embroid-1`: performance of Embroid (using on prompt's predictions)
- `embroid-3`: performance of Embroid (using three prompts' predictions)


`table_3_results` contains full results for Table 3 (page 8) as a TSV file.
- `task`: name of task
- `language_model`: name of language model used
- `original_prompt`: macro F1 for original predictions
- `chain_of_thought_version`: macro F1 for chain-of-thought version of prompt
- `embroid_on_original_prompt`: macro F1 after applying Embroid to original prompt
- `embroid_on_chain_of_thought`: macro F1 after applying Embroid to chain-of-thought predictions

`table_selective_annotation_results.tsv` contains full results for Figure 2 (upper right) which describes performance when applying Embroid to selective annotation (also known as vote-k).
- `task`: name of task
- `language_model`: name of language model used
- `original_prompt`: macro F1 for original predictions
- `vote-25`: macro F1 when using selective annotation with a label budget of 25 samples
- `vote-25_embroid`: macro F1 when applying Embroid to the prompts generated by using selective annotation with a label budget of 25 samples
- `vote-50`: macro F1 when using selective annotation with a label budget of 50 samples
- `vote-50_embroid`: macro F1 when applying Embroid to the prompts generated by using selective annotation with a label budget of 50 samples
- `vote-100`: macro F1 when using selective annotation with a label budget of 100 samples
- `vote-100_embroid`: macro F1 when applying Embroid to the prompts generated by using selective annotation with a label budget of 100 samples
