OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: instruction-tuning dataset, occupational bias, large language models
TL;DR: We create an instruction-tuning dataset named OccuQuest covering over 1,000 occupations to mitigate the occupational bias in large language models and promote occupation-inclusive large language models.
Abstract: The emergence of large language models (LLMs) has revolutionized natural language processing tasks. However, existing instruction-tuning datasets suffer from occupational bias: the majority of data relates to only a few occupations, which hampers the instruction-tuned LLMs to generate helpful responses to professional queries from practitioners in specific fields. To mitigate this issue and promote occupation-inclusive LLMs, we create an instruction-tuning dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories. We systematically request ChatGPT, organizing queries hierarchically based on Occupation, Responsibility, Topic, and Question, to ensure a comprehensive coverage of occupational specialty inquiries. By comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we observe that OccuQuest exhibits a more balanced distribution across occupations. Furthermore, we assemble three test sets for comprehensive evaluation, an occu-test set covering 25 occupational categories, an estate set focusing on real estate, and an occu-quora set containing real-world questions from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against WizardLM. Furthermore, we demonstrate the potential of combining OccuQuest with other instruction-tuning datasets to enhance the overall performance of LLMs. By fine-tuning LLaMA on a mixture of OccuQuest and Tulu datasets, we introduce ProLLaMA, which excels in addressing occupational questions and exhibits superior performance in comprehensive evaluations such as MMLU, GSM8K, BBH, and HumanEval. Among the different LLaMA variants, the 7B and 13B ProLLaMA models achieve the highest performance on MMLU and GSM8K, with the 7B ProLLaMA model demonstrating an improvement of more than 4 points over the other 7B variants on GSM8K.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3565
Loading