[
    {
        "source_paper": {
            "arxiv_id": "2201.11903",
            "isAPA": true,
            "related work": "This work is inspired by many research areas, which we detail in an extended related work section (AppendixC). Here we describe two directions and associated papers that are perhaps most relevant.The first relevant direction is using intermediate steps to solve reasoning problems.Ling et al. (2017)pioneer the idea of using natural language rationales to solve math word problems through a series of intermediate steps. Their work is a remarkable contrast to the literature using formal languages to reason(Roy et al.,2015; Chiang and Chen,2019; Amini et al.,2019; Chen et al.,2019).Cobbe et al. (2021)extendLing et al. (2017)by creating a larger dataset and using it to finetune a pretrained language model rather than training a model from scratch. In the domain of program synthesis,Nye et al. (2021)leverage language models to predict the final outputs of Python programs via first line-to-line predicting the intermediate computational results, and show that their step-by-step prediction method performs better than directly predicting the final outputs.Naturally, this paper also relates closely to the large body of recent work on prompting.\nSince the popularization of few-shot prompting as given byBrown et al. (2020), several general approaches have improved the prompting ability of models, such as automatically learning prompts(Lester et al.,2021)or giving models instructions describing a task(Wei et al.,2022a; Sanh et al.,2022; Ouyang et al.,2022).\nWhereas these approaches improve or augment the input part of the prompt (e.g., instructions that are prepended to inputs), our work takes the orthogonal direction of augmenting the outputs of language models with a chain of thought.",
            "reference": [
                "Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. 2022. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.",
                "Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics.",
                "Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. Giving BERT a calculator: Finding operations and arguments with reading comprehension. EMNLP.",
                "Jacob Andreas, Dan Klein, and Sergey Levine. 2018. Learning with latent language. NAACL.",
                "Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.",
                "BIG-bench collaboration. 2021. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation.",
                "Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. Flexible generation of natural language deductions. EMNLP.",
                "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. NeurIPS.",
                "Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming architectures generalize via recursion. ICLR.",
                "Oana-Maria Camburu, Tim Rockt\u00e4schel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural language inference with natural language explanations. NeurIPS.",
                "Howard Chen, Jacqueline He, Karthik Narasimhan, and Danqi Chen. 2022. Can rationalization improve robustness? NAACL.",
                "Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.",
                "Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2019. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. ICLR.",
                "Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-aligned equation generation for solving and reasoning math word problems. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2656-2668, Minneapolis, Minnesota. Association for Computational Linguistics.",
                "Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. IJCAI.",
                "Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL.",
                "Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. 2019. Neural logic machines. ICLR.",
                "Dheeru Dua, Sameer Singh, and Matt Gardner. 2020. Benefits of intermediate annotations in reading comprehension. ACL.",
                "Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. TACL.",
                "Yuling Gu, Bhavana Dalvi Mishra, and Peter Clark. 2022. DREAM: Uncovering mental models behind language models. NAACL.",
                "Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R\u00e9. 2018. Training classifiers with natural language explanations. ACL.",
                "Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. ACL.",
                "Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.",
                "Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. EMNLP.",
                "Zhanming Jie, Jierui Li, and Wei Lu. 2022. Learning to reason deductively: Math word problem solving as complex relation extraction. arXiv preprint arXiv:2203.10316.",
                "Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.",
                "Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math word problem repository. NAACL.",
                "Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. 2022. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329.",
                "Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. 2021. MWPToolkit: An open-source framework for deep learning-based math word problem solvers. arXiv preprint arXiv:2109.00799.",
                "Teven Le Scao and Alexander Rush. 2021. How many data points is a prompt worth? NAACL.",
                "Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. EMNLP.",
                "Iddo Lev, Bill MacCartney, Christopher Manning, and Roger Levy. 2004. Solving logic puzzles: From robust processing to precise semantics. Proceedings of the 2nd Workshop on Text Meaning and Interpretation.",
                "Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. ACL.",
                "Zhengzhong Liang, Steven Bethard, and Mihai Surdeanu. 2021. Explainable multi-hop verbal reasoning through internal monologue. NAACL.",
                "Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.",
                "Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.",
                "Bodhisattwa Prasad Majumder, Oana-Maria Camburu, Thomas Lukasiewicz, and Julian McAuley. 2021. Rationale-inspired natural language explanations with commonsense. arXiv preprint arXiv:2106.13876.",
                "Ana Marasovi\u0107, Iz Beltagy, Doug Downey, and Matthew E Peters. 2022. Few-shot self-rationalization with natural language prompts. NAACL Findings.",
                "Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. InACL.",
                "Shen Yun Miao, Chao Chun Liang, and Keh Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. ACL.",
                "Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.",
                "Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. WT5?! Training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546.",
                "Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.",
                "Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.",
                "Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? NAACL.",
                "Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL.",
                "Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, and Weizhu Chen. 2022. Reasoning like program executors. arXiv preprint arXiv:2201.11473.",
                "Piotr Pi\u0119kos, Mateusz Malinowski, and Henryk Michalewski. 2021. Measuring and improving BERT's mathematical abilities by predicting the order of reasoning. ACL.",
                "Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446.",
                "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1-67.",
                "Dheeraj Rajagopal, Vidhisha Balachandran, Eduard H. Hovy, and Yulia Tsvetkov. 2021. SelfExplain: A self-explaining architecture for neural text classifiers. EMNLP.",
                "Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! Leveraging language models for commonsense reasoning. ACL.",
                "Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. NumNet: Machine reading comprehension with numerical reasoning. EMNLP.",
                "Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2021. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.",
                "Gabriel Recchia. 2021. Teaching autoregressive language models complex tasks by demonstration. arXiv preprint arXiv:2109.02102.",
                "Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2022. A recipe for arbitrary text style transfer with large language models. ACL.",
                "Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems.",
                "Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. EMNLP.",
                "Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reasoning about Quantities in Natural Language. TACL.",
                "Mohammed Saeed, Naser Ahmadi, Preslav Nakov, and Paolo Papotti. 2021. RuleBERT: Teaching soft rules to pre-trained language models. EMNLP.",
                "Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask prompted training enables zero-shot task generalization. ICLR.",
                "Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. 2021. Generate & rank: A multi-task framework for math word problems. InFindings of the Association for Computational Linguistics: EMNLP 2021.",
                "Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. NAACL.",
                "Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. 2020. Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. NeurIPS.",
                "Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. 2021. CommonsenseQA 2.0: Exposing the limits of ai through gamification. NeurIPS Track on Datasets and Benchmarks.",
                "Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. 2022. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.",
                "Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239.",
                "Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022a. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.",
                "Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022b. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705.",
                "Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022a. Finetuned language models are zero-shot learners. ICLR.",
                "Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022b. Emergent abilities of large language models. Transactions on Machine Learning Research.",
                "Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi. 2022. Reframing human-AI collaboration for generating free-text explanations. NAACL.",
                "Sarah Wiegreffe and Ana Marasovi\u0107. 2021. Teach me to explain: A review of datasets for explainable NLP. NeurIPS.",
                "Sarah Wiegreffe, Ana Marasovi\u0107, and Noah A. Smith. 2021. Measuring association between labels and free-text rationales. EMNLP.",
                "Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022a. PromptChainer: Chaining large language model prompts through visual programming. CHI Extended Abstracts.",
                "Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022b. AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. CHI.",
                "Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi. 2020. Neural execution engines: Learning to execute subroutines. NeurIPS.",
                "Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, and Xiang Ren. 2021. Refining language models with compositional explanations. NeurIPS.",
                "Xi Ye and Greg Durrett. 2022. The unreliability of explanations in few-shot in-context learning. arXiv preprint arXiv:2205.03401.",
                "Yordan Yordanov, Vid Kocijan, Thomas Lukasiewicz, and Oana-Maria Camburu. 2021. Few-shot out-of-domain transfer learning of natural language explanations. arXiv preprint arXiv:2112.06204.",
                "Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. Using \u201cannotator rationales\u201d to improve machine learning for text categorization. NAACL.",
                "Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv preprint arXiv:1410.4615.",
                "Eric Zelikman, Yuhuai Wu, and Noah D. Goodman. 2022. STaR: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465.",
                "Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. ICML.",
                "Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong, and Jian Tang. 2020. Towards interpretable natural language understanding with explanations as latent variables. NeurIPS."
            ],
            "abstract": "We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.",
            "date": 2021,
            "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"
        },
        "topic": "in-context learning"
    },
    {
        "source_paper": {
            "arxiv_id": "2305.00447",
            "isAPA": true,
            "related work": "LMs for Recommendation.There have been several attempts to integrate language models (LMs) with recommendation systems.\nDespite the incorporation of LMs(Li et al.,2023b; Geng et al.,2022), some attempts persist in utilizing traditional user/item IDs to represent users/items.\nThereby, they disregard the semantic understanding capabilities of LMs, such as reviews, which other work has incorporated the language information as part of the users/items embedding(Hou et al.,2022).\nIn addition, other methods either utilize an undisclosed model that already possesses preliminary recommendation capabilities(Cui et al.,2022). or employ small models to train on large-scale downstream task data(Zhang and Wang,2023).\nMoreover, the aforementioned models are also limited to small models, while this paper is orthogonal about how to adapt large language models to recommendation tasks.\nIn recommendation systems, there is currently little research on applying LLMs in recommendation scenarios.\nThose works utilize the interaction ability of GPT3.5 series models and apply In-context Learning(Gao et al.,2023; Wang et al.,2023a).\nIn detail, Chat-Rec(Gao et al.,2023)endeavors to harness the interaction capabilities of ChatGPT and link the ChatGPT with traditional recommendation models (e.g. MF(Koren et al.,2009), LightGCN(He et al.,2020)) to formulate a conversational recommendation system.\nNIR(Wang et al.,2023a)shares a similar concept with Chat-Rec, which employs conventional recommendation models to generate candidate items subjected to a three-stage multi-step prompting process for re-ranking.Sequential Recommendation.Our setup is close to the sequential recommendation, which aims to infer the user's next interaction based on users' historical interaction sequences(Fang et al.,2020; Wang et al.,2019).\nIn the early time, the Markov chain plays an important role in sequential recommendation(Rendle et al.,2010; He and McAuley,2016; Wang et al.,2015; Mahmood and Ricci,2007).\nRecently, deep learning-based methods have become mainstream.\nExtensive work using different kinds of neural network structures, like RNN(Hidasi et al.,2016; Cui et al.,2020; Donkers et al.,2017), CNN(Tang and Wang,2018; Yuan et al.,2019; Yan et al.,2019), and attention(Kang and McAuley,2018; Zhang et al.,2019; Xu et al.,2021), to model the user interaction sequences. However, limited by only using IDs to represent users and items, such work cannot fastly adapt and generalize to new scenarios. Thus, some works focus on the generalization ability of sequential recommendation models by pre-training(Ni et al.,2018; Yuan et al.,2020), data augmentation(Qiu et al.,2022,2021; Wang et al.,2022a; Xie et al.,2022), debiasing(Ding et al.,2022; Wang et al.,2022b; Zheng et al.,2021; Zhang et al.,2021), and robust optimization(Wen et al.,2022; Yang et al.,2023).\nHowever, they ignore the strong generalization ability of existing LLMs, leading to inadequate exploration.",
            "reference": [
                "Qingyao Ai, Ting Bai, Zhao Cao, Yi Chang, Jiawei Chen, Zhumin Chen, Zhiyong Cheng, Shoubin Dong, Zhicheng Dou, Fuli Feng, et al.2023. Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community. arXiv preprint arXiv:2307.09751(2023).",
                "Greg Brockman, Mira Murati, Peter Welinder, and OpenAI. 2022. OpenAI API.",
                "Tom Brown, Benjamin Mann, Nick Ryder, et al.2020. Language models are few-shot learners. NeurIPS33 (2020), 1877-1901.",
                "Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al.2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311(2022).",
                "Qiang Cui, Shu Wu, Qiang Liu, Wen Zhong, and Liang Wang. 2020. MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation. IEEE Trans. Knowl. Data Eng.32, 2 (2020), 317-331.",
                "Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems. arXiv preprint arXiv:2205.08084(2022).",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InNAACL. Association for Computational Linguistics, 4171-4186.",
                "Sihao Ding, Fuli Feng, Xiangnan He, Jinqiu Jin, Wenjie Wang, Yong Liao, and Yongdong Zhang. 2022. Interpolative Distillation for Unifying Biased and Debiased Recommendation. InSIGIR '22, July 11 - 15, 2022. ACM, 40-49.",
                "Tim Donkers, Benedikt Loepp, and J\u00fcrgen Ziegler. 2017. Sequential user-based recurrent neural network recommendations. InProceedings of the eleventh ACM conference on recommender systems. 152-160.",
                "Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, et al.2023. PaLM-E: An Embodied Multimodal Language Model. abs/2303.03378 (2023). arXiv:2303.03378",
                "Hui Fang, Danning Zhang, Yiheng Shu, and Guibing Guo. 2020. Deep Learning for Sequential Recommendation: Algorithms, Influential Factors, and Evaluations. ACM Trans. Inf. Syst.39, 1 (2020), 10:1-10:42.",
                "F.Maxwell and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context.. InACM Transactions on Interactive Intelligent Systems (TiiS).",
                "Yunfan Gao et al.2023. Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System. arXiv preprint:2303.14524(2023).",
                "Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM Conference on Recommender Systems. 299-315.",
                "Ruining He and Julian McAuley. 2016. Fusing similarity models with markov chains for sparse sequential recommendation. In2016 IEEE 16th international conference on data mining (ICDM). IEEE, 191-200.",
                "Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. InSIGIR'20. 639-648.",
                "Bal\u00e1zs Hidasi et al.2016. Session-based Recommendations with Recurrent Neural Networks. InICLR.",
                "Jordan Hoffmann et al.2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556(2022).",
                "Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards Universal Sequence Representation Learning for Recommender Systems. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585-593.",
                "Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, et al.2019. Parameter-Efficient Transfer Learning for NLP. InProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Vol. 97. 2790-2799.",
                "Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.",
                "Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, et al.2022. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv preprint arXiv:2212.12017(2022).",
                "Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2023. Exploring the Benefits of Training Expert Language Models over Instruction Tuning. (2023). arXiv:2302.03202",
                "Vitor Jeronymo, Luiz Henrique Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto de Alencar Lotufo, Jakub Zavrel, and Rodrigo Frassetto Nogueira. 2023. InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. abs/2301.01820 (2023). arXiv:2301.01820",
                "Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. InferFix: End-to-End Program Repair with LLMs. (2023). arXiv:2303.07263",
                "Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. InICDM. 197-206.",
                "Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer42, 8 (2009), 30-37.",
                "Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP,7-11 November, 2021. Association for Computational Linguistics, 3045-3059.",
                "Lei Li, Yongfeng Zhang, and Li Chen. 2023b. Personalized prompt learning for explainable recommendation. ACM Transactions on Information Systems41, 4 (2023), 1-26.",
                "Ruyu Li, Wenhao Deng, Yu Cheng, Zheng Yuan, Jiaqi Zhang, and Fajie Yuan. 2023a. Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700(2023).",
                "Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. InACL/IJCNLP 2021, August 1-6, 2021. Association for Computational Linguistics, 4582-4597.",
                "Percy Liang, Rishi Bommasani, Tony Lee, et al.2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110(2022).",
                "Xi Victoria Lin, Todor Mihaylov, et al.2021. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668(2021).",
                "Tariq Mahmood and Francesco Ricci. 2007. Learning and adaptivity in interactive recommender systems. InProceedings of the ninth international conference on Electronic commerce. 75-84.",
                "Yabo Ni, Dan Ou, Shichen Liu, et al.2018. Perceive Your Users in Depth: Learning Universal User Representations from Multiple E-commerce Tasks. InKDD 2018, August 19-23, 2018. ACM, 596-605.",
                "Long Ouyang et al.2022. Training language models to follow instructions with human feedback. NeurIPS35 (2022), 27730-27744.",
                "Baolin Peng et al.2023. Instruction Tuning with GPT-4. (2023). arXiv:2304.03277",
                "Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2021. Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation. (2021). arXiv:2110.05730",
                "Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2022. Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation. InWSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining, February 21 - 25, 2022. ACM, 813-823.",
                "Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. InProceedings of the 19th international conference on World wide web. 811-820.",
                "Victor Sanh, Albert Webson, Colin Raffel, et al.2021. Multitask prompted training enables zero-shot task generalization. arXiv:2110.08207(2021).",
                "Timo Schick, Jane Dwivedi-Yu, Roberto Dess\u00ec, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761(2023).",
                "Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. 2022. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. InCoRL 2022, 14-18 December 2022(Proceedings of Machine Learning Research, Vol. 205). PMLR, 492-504.",
                "Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. InWSDM. 565-573.",
                "Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model.",
                "Hugo Touvron et al.2023. LlaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971(2023).",
                "Lei Wang et al.2023a. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv preprint arXiv:2304.03153(2023).",
                "Liang Wang, Nan Yang, and Furu Wei. 2023b. Query2doc: Query Expansion with Large Language Models. (2023). arXiv:2303.07678",
                "Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2015. Learning hierarchical representation model for nextbasket recommendation. InProceedings of the 38th International ACM SIGIR conference on Research and Development in Information Retrieval. 403-412.",
                "Shoujin Wang, Liang Hu, Yan Wang, Longbing Cao, Quan Z. Sheng, and Mehmet A. Orgun. 2019. Sequential Recommender Systems: Challenges, Progress and Prospects. InIJCAI 2019, Macao, China, August 10-16, 2019. ijcai.org, 6332-6338.",
                "Ziyang Wang, Huoyu Liu, Wei Wei, Yue Hu, Xian-Ling Mao, Shaojian He, Rui Fang, and Dangyang Chen. 2022a. Multi-level Contrastive Learning Framework for Sequential Recommendation. (2022). arXiv:2208.13007",
                "Zhenlei Wang, Shiqi Shen, Zhipeng Wang, Bo Chen, Xu Chen, and Ji-Rong Wen. 2022b. Unbiased Sequential Recommendation with Latent Confounders. InWWW '22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022. ACM, 2195-2204.",
                "Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652(2021).",
                "Jason Wei, Yi Tay, Rishi Bommasani, et al.2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682(2022).",
                "Hongyi Wen, Xinyang Yi, Tiansheng Yao, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2022. Distributionally-robust Recommendations for Improving Worst-case User Experience. InWWW '22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022. ACM, 3606-3610.",
                "Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, and Jonathan Tompson. 2022. Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models. abs/2211.11736 (2022). arXiv:2211.11736",
                "Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui. 2022. Contrastive Learning for Sequential Recommendation. In38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 1259-1273.",
                "Chengfeng Xu, Jian Feng, Pengpeng Zhao, Fuzhen Zhuang, Deqing Wang, Yanchi Liu, and Victor S. Sheng. 2021. Long- and short-term self-attention network for sequential recommendation. Neurocomputing423 (2021), 580-589.",
                "An Yan, Shuo Cheng, Wang-Cheng Kang, Mengting Wan, and Julian J. McAuley. 2019. CosRec: 2D Convolutional Neural Networks for Sequential Recommendation. InCIKM 2019, November 3-7, 2019. ACM, 2173-2176.",
                "Zhengyi Yang et al.2023. A Generic Learning Framework for Sequential Recommendation with Distribution Shifts. (2023).",
                "Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020. Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation. InSIGIR 2020, July 25-30, 2020. ACM, 1469-1478.",
                "Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, and Xiangnan He. 2019. A Simple Convolutional Generative Network for Next Item Recommendation. InWSDM 2019, February 11-15, 2019. ACM, 582-590.",
                "Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, Hsin-Hsi Chen, Wei-Jou (Edward) Duh, Hen-Hsen Huang, Makoto P. Kato, Josiane Mothe, and Barbara Poblete (Eds.). ACM, 2639-2649. https://doi.org/10.1145/3539618.3591932",
                "Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023a. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. InProceedings of the 17th ACM Conference on Recommender Systems(RecSys '23). Association for Computing Machinery.",
                "Tingting Zhang, Pengpeng Zhao, Yanchi Liu, et al.2019. Feature-level Deeper Self-Attention Network for Sequential Recommendation. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, August 10-16, 2019. ijcai.org, 4320-4326.",
                "Yang Zhang et al.2023b. Reformulating CTR Prediction: Learning Invariant Feature Interactions for Recommendation. (2023).",
                "Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. 2021. Causal Intervention for Leveraging Popularity Bias in Recommendation. InSIGIR '21, July 11-15, 2021. ACM, 11-20.",
                "Zizhuo Zhang and Bang Wang. 2023. Prompt Learning for News Recommendation. arXiv preprint arXiv:2304.05263(2023).",
                "Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRRabs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223arXiv:2303.18223",
                "Yu Zheng et al.2021. Disentangling User Interest and Conformity for Recommendation with Causal Embedding. InWWW '21. 2980-2991.",
                "Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. 2005. Improving Recommendation Lists through Topic Diversification. InProceedings of the 14th International Conference on World Wide Web(WWW '05). Association for Computing Machinery, 22-32."
            ],
            "abstract": "Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains, thereby prompting researchers to explore their potential for use in recommendation systems. Initial attempts have leveraged the exceptional capabilities of LLMs, such as rich knowledge and strong generalization through In-context Learning, which involves phrasing the recommendation task as prompts. Nevertheless, the performance of LLMs in recommendation tasks remains suboptimal due to a substantial disparity between the training tasks for LLMs and recommendation tasks, as well as inadequate recommendation data during pre-training. To bridge the gap, we consider building a Large Recommendation Language Model by tunning LLMs with recommendation data. To this end, we propose an efficient and effective Tuning framework for Aligning LLMs with Recommendation, namely TALLRec. We have demonstrated that the proposed TALLRec framework can significantly enhance the recommendation capabilities of LLMs in the movie and book domains, even with a limited dataset of fewer than 100 samples. Additionally, the proposed framework is highly efficient and can be executed on a single RTX 3090 with LLaMA-7B. Furthermore, the fine-tuned LLM exhibits robust cross-domain generalization. Our code and data are available atthis https URL.",
            "date": 2021,
            "title": "TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation"
        },
        "topic": "LLMs for Recommendation"
    },
    {
        "source_paper": {
            "arxiv_id": "2305.16617",
            "isAPA": true,
            "related work": "Large language models. LLMs Radford et al. (2019); Brown et al. (2020); Chowdhery et al. (2022); Zhang et al. (2022); OpenAI (2022) have revolutionized the field of natural language processing by offering several advantages over previous pre-trained models Devlin et al. (2018); Liu et al. (2019); Lan et al. (2019), including a better characterization of complex patterns and dependencies in the text, and the appealing in-context learning ability for solving downstream tasks with minimal examples. Representative models such as GPT-3 Brown et al. (2020), PaLM Chowdhery et al. (2022), and ChatGPT OpenAI (2022) have showcased their remarkable ability to generate text with high coherence, fluency, and semantic relevance. They can even effectively address complex inquiries related to science, mathematics, history, current events, and social trends. Therefore, it is increasingly important to effectively regulate the use of LLMs to prevent significant social issues.\nLLM-generated text detection. Previous methods can be broadly categorized into two groups. The first group of methods performs detection in a zero-shot manner Solaiman et al. (2019); Gehrmann et al. (2019a); Mitchell et al. (2023); Yang et al. (2023), but they require access to the source model that generates the texts to derive quantities like output logits or losses for detection. For instance, Solaiman et al. (2019) suggest that a higher log probability for each token indicates that the text will likely be machine-generated. When the output logits/losses of the source model are unavailable, these methods rely on a proxy model for detection. However, there is often a substantial gap between the proxy and source models from which the text is generated. Another group of methods trains DNN-based classifiers on collected human-written and machine-generated texts for detection Guo et al. (2023); Uchendu et al. (2020); OpenAI (2023b). However, such detectors are data-hungry and may exhibit poor generalization ability when facing domain shift Bakhtin et al. (2019); Uchendu et al. (2020). Furthermore, training DNN-based classifiers is susceptible to backdoor attacks Qi et al. (2021) and adversarial attacks He et al. (2023). Besides, both  He et al. (2023) and  Li et al. (2023) develop benchmarks for evaluating existing detection methods and call for more robust detection methods.",
            "reference": [
                "Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc'Aurelio Ranzato, and Arthur Szlam. 2019.Real or fake? learning to discriminate machine from human generated text.arXiv preprint arXiv:1906.03351.",
                "Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022.Gpt-neox-20b: An open-source autoregressive language model.arXiv preprint arXiv:2204.06745.",
                "Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021.GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.If you use this software, please cite it using these metadata.",
                "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020.Language models are few-shot learners.Advances in neural information processing systems, 33:1877-1901.",
                "Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.",
                "Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022.Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311.",
                "Zhijie Deng and Jun Zhu. 2023.Bayesadapter: Being bayesian, inexpensively and reliably, via bayesian fine-tuning.In Asian Conference on Machine Learning, pages 280-295. PMLR.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.",
                "Angela Fan, Mike Lewis, and Yann Dauphin. 2018.Hierarchical neural story generation.arXiv preprint arXiv:1805.04833.",
                "Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017.Deep bayesian active learning with image data.In International conference on machine learning, pages 1183-1192. PMLR.",
                "Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019a.GLTR: Statistical detection and visualization of generated text.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111-116, Florence, Italy. Association for Computational Linguistics.",
                "Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. 2019b.Gltr: Statistical detection and visualization of generated text.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111-116.",
                "Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023.How close is chatgpt to human experts? comparison corpus, evaluation, and detection.",
                "Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024.Spotting llms with binoculars: Zero-shot detection of machine-generated text.arXiv preprint arXiv:2401.12070.",
                "Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2023.MGTBench: Benchmarking Machine-Generated Text Detection.CoRR abs/2303.14822.",
                "Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020.Automatic detection of generated text is easiest when humans are fooled.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1808-1822.",
                "Diederik P Kingma and Jimmy Ba. 2014.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980.",
                "Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017.Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems, 30.",
                "Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019.Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942.",
                "Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang. 2023.Deepfake text detection in the wild.arXiv preprint arXiv:2305.13242.",
                "Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692.",
                "Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. 2019.A simple baseline for bayesian uncertainty in deep learning.Advances in neural information processing systems, 32.",
                "Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023.Detectgpt: Zero-shot machine-generated text detection using probability curvature.",
                "Salman Mohamadi and Hamidreza Amindavar. 2020.Deep bayesian active learning, a brief survey on recent advances.arXiv preprint arXiv:2012.08044.",
                "Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018.Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745.",
                "OpenAI. 2022.Introducing chatgpt.https://openai.com/blog/chatgpt, Last accessed on 2023-05-09.",
                "OpenAI. 2023a.Gpt-4.https://openai.com/research/gpt-4, Last accessed on 2023-05-09.",
                "OpenAI. 2023b.New ai classifier for indicating ai-written text.Https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.",
                "Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021.Hidden killer: Invisible textual backdoor attacks with syntactic trigger.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 443-453, Online. Association for Computational Linguistics.",
                "Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9.",
                "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020.Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485-5551.",
                "Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250.",
                "Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023.Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156.",
                "Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012.Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25.",
                "Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. 2019.Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203.",
                "Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971.",
                "Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. 2020.Authorship attribution for neural text generation.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384-8395, Online. Association for Computational Linguistics.",
                "Saranya Venkatraman, Adaku Uchendu, and Dongwon Lee. 2023.Gpt-who: An information density-based machine-generated text detector.arXiv preprint arXiv:2310.06202.",
                "Ben Wang and Aran Komatsuzaki. 2021.GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.https://github.com/kingoflolz/mesh-transformer-jax.",
                "Christopher Williams and Carl Rasmussen. 1995.Gaussian processes for regression.Advances in neural information processing systems, 8.",
                "Xianjun Yang, Wei Cheng, Linda Petzold, William Yang Wang, and Haifeng Chen. 2023.Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text.arXiv preprint arXiv:2305.17359.",
                "Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022.Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068.",
                "Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019.Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675."
            ],
            "abstract": "The detection of machine-generated text, especially from large language models (LLMs), is crucial in preventing serious social problems resulting from their misuse. Some methods train dedicated detectors on specific datasets but fall short in generalizing to unseen test data, while other zero-shot ones often yield suboptimal performance. Although the recent DetectGPT has shown promising detection performance, it suffers from significant inefficiency issues, as detecting a single candidate requires querying the source LLM with hundreds of its perturbations. This paper aims to bridge this gap. Concretely, we propose to incorporate a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency. Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget. Notably, when detecting the text generated by LLaMA family models, our method with just 2 or 3 queries can outperform DetectGPT with 200 queries.",
            "date": 2021,
            "title": "Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model"
        },
        "topic": "LLM-Generated Texts Detection"
    },
    {
        "source_paper": {
            "arxiv_id": "2303.14524",
            "isAPA": false,
            "related work": "2.1 Augmented Language Models\nAugmented Language Models (ALMs) are a new research direction that aims to overcome the limitations of traditional Language Models (LMs)[5,1,4]by equipping them with reasoning skills and the ability to use external tools, which has served millions of users, such as the coding assistant Copilot[2], or more recently ChatGPT based on GPT3.5 and GPT4. Reasoning is defined as breaking down complex tasks into simpler subtasks that the LM can solve more easily by itself or with the help of tools[9,15,13], while tools are external modules that the LM can call to augment its context. ALMs can use these augmentations separately or in combination to expand their context processing ability and outperform most regular LMs on several benchmarks. ALMs can learn to reason, use tools, and even act, while still performing standard natural language tasks. This new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues. By jointly discussing reasoning and tools, and tools and actions, ALMs can solve a broad range of complex tasks without heuristics, thus offering better generalization capabilities. 2.2 NLP for Recommendation\nThe field of recommender systems has had a long-standing relationship with natural language processing (NLP) techniques, especially when pre-trained language models (PLMs) comes out, which improve the performance of recommender systems and explainability[3,10,11]. PLMs are language models that have learned universal representations on large corpora in a self-supervised manner, and the learned representations can be beneficial to a series of downstream NLP tasks. In the recommendation domain, PLMs can help alleviate the data sparsity issue, which is a major performance bottleneck of current deep recommendation models. By extracting and transferring knowledge from pre-trained models learned by different PLM-related training paradigms, researchers aim to improve recommendation performance from various perspectives, such as generality, sparsity, efficiency, and effectiveness. In this vibrant field, there are open issues and future research directions that need to be explored, including the connection between PLM-based training paradigms and different input data types for recommender systems. Overall, adapting language modelling paradigms for recommendation is seen as a promising direction in both academia and industry. 2.3 Cold-start Recommendation\nCold start recommendation is a problem that arises in recommender systems when users or items have no prior interaction records with the system. This means that there is no data available for the system to make personalized recommendations. To address this issue, solutions have been proposed that either learn to model content features[16]or transfer representations from auxiliary domains[24,26]. The former approach focuses on learning about the characteristics of the items or users based on their content, such as text, images, or metadata. The latter approach involves leveraging information from other domains, such as social networks or product descriptions, to infer user preferences. Additionally, there are approaches that aim to quickly adapt to new domains instead of only providing recommendations for cold-start cases. A good generalization ability of recommendation models on startup cases is essential to ensure a better user experience and increased engagement. In our work, we use the reasoning and background knowledge of LLMs to enhance the performance of recommender systems for cold start scenarios.",
            "reference": [
                "Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877-1901 (2020)",
                "Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)",
                "Chen, X., Chen, H., Xu, H., Zhang, Y., Cao, Y., Qin, Z., Zha, H.: Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 765-774 (2019)",
                "Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)",
                "Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)",
                "Fu, Yao; Peng, H., Khot, T.: How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu's Notion (Dec 2022),https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1",
                "Geng, S., Liu, S., Fu, Z., Ge, Y., Zhang, Y.: Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In: Proceedings of the 16th ACM Conference on Recommender Systems. pp. 299-315 (2022)",
                "Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., Hajishirzi, H.: Unifiedqa: Crossing format boundaries with a single qa system (2020)",
                "LeCun, Y.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(2022)",
                "Li, L., Zhang, Y., Chen, L.: Generate neural template explanations for recommendation. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. pp. 755-764 (2020)",
                "Li, L., Zhang, Y., Chen, L.: Personalized transformer for explainable recommendation. arXiv preprint arXiv:2105.11601 (2021)",
                "Liu, P., Zhang, L., Gulla, J.A.: Pre-train, prompt and recommendation: A comprehensive survey of language modelling paradigm adaptations in recommender systems. arXiv preprint arXiv:2302.03735 (2023)",
                "Parisi, A., Zhao, Y., Fiedel, N.: Talm: Tool augmented language models (2022)",
                "Petroni, F., Rockt\u00e4schel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.H., Riedel, S.: Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019)",
                "Schick, T., Dwivedi-Yu, J., Dess\u00ec, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools (2023)",
                "Shi, S., Zhang, M., Yu, X., Zhang, Y., Hao, B., Liu, Y., Ma, S.: Adaptive feature sampling for recommendation with missing content feature values. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. pp. 1451-1460 (2019)",
                "Sun, C., Liu, H., Liu, M., Ren, Z., Gan, T., Nie, L.: Lara: Attribute-to-feature adversarial learning for new-item recommendation. In: Proceedings of the 13th international conference on web search and data mining. pp. 582-590 (2020)",
                "Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)",
                "Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. ArXivabs/2109.01652(2021)",
                "Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., hsin Chi, E.H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., Fedus, W.: Emergent abilities of large language models. ArXivabs/2206.07682(2022)",
                "Wei, J., Wang, X., Schuurmans, D., Bosma, M., hsin Chi, E.H., Le, Q., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. ArXivabs/2201.11903(2022)",
                "Xu, Y., Zhu, C., Xu, R., Liu, Y., Zeng, M., Huang, X.: Fusing context into knowledge graph for commonsense question answering (2021)",
                "Yu, W., Iter, D., Wang, S., Xu, Y., Ju, M., Sanyal, S., Zhu, C., Zeng, M., Jiang, M.: Generate rather than retrieve: Large language models are strong context generators (2023)",
                "Yuan, F., Zhang, G., Karatzoglou, A., Jose, J., Kong, B., Li, Y.: One person, one model, one world: Learning continual user representation without forgetting. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 696-705 (2021)",
                "Zhang, Y., Ding, H., Shui, Z., Ma, Y., Zou, J., Deoras, A., Wang, H.: Language models as recommender systems: Evaluations and limitations (2021)",
                "Zhu, F., Wang, Y., Chen, C., Zhou, J., Li, L., Liu, G.: Cross-domain recommendation: challenges, progress, and prospects. arXiv preprint arXiv:2103.01696 (2021)"
            ],
            "abstract": "Large language models (LLMs) have demonstrated their significant potential to be applied for addressing various application tasks. However, traditional recommender systems continue to face great challenges such as poor interactivity and explainability, which actually also hinder their broad deployment in real-world systems. To address these limitations, this paper proposes a novel paradigm called Chat-Rec (ChatGPT Augmented Recommender System) that innovatively augments LLMs for building conversational recommender systems by converting user profiles and historical interactions into prompts. Chat-Rec is demonstrated to be effective in learning user preferences and establishing connections between users and products through in-context learning, which also makes the recommendation process more interactive and explainable. What's more, within the Chat-Rec framework, user's preferences can transfer to different products for cross-domain recommendations, and prompt-based injection of information into LLMs can also handle the cold-start scenarios with new items. In our experiments, Chat-Rec effectively improve the results of top-k recommendations and performs better in zero-shot rating prediction task. Chat-Rec offers a novel approach to improving recommender systems and presents new practical scenarios for the implementation of AIGC (AI generated content) in recommender system studies.",
            "date": 2021,
            "title": "Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System"
        },
        "topic": "Explainability for LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2211.09102",
            "isAPA": true,
            "related work": "Inspired by the findings of Radford et al. (2019);Brown et al. (2020), prompting strategies for LLMshave become a topic of intense interest, generatingwork across a broad spectrum of methods and applications (Liu et al., 2021). A basic distinction canbe made between hard (explicit text) promptingsuch as we use, and soft prompting that seeks tolearn embeddings (Lester et al., 2021), activations(Li and Liang, 2021; Hambardzumyan et al., 2021),or attention weights (Liu et al., 2022a) that condition the model to perform a desired task. The latterapproach is more expressive and more efficient atinference time, but performance can be sensitiveto initialization (Hou et al., 2022), and some techniques require modifications to the model.\nHard prompts have the advantage of being easyto interpret and modify. Work in this area includes tools to facilitate development of handcrafted prompts (Strobelt et al., 2022; Bach et al.,2022); algorithms to find optimal prompts throughgradient-guided search (Shin et al., 2020) or exhaustive search through labels (Schick and Sch\u00fctze,2021) or both labels and templates (Gao et al.,2021); as well as studies on the effect of example order (Kumar and Talukdar, 2021; Lu et al.,2022). Hard prompts have also been used to analyze model capabilities (Garg et al., 2022; Li et al.,2022a), the role of data (Singh et al., 2022), and thenature of prompting itself (Min et al., 2022; Weiet al., 2022).\nWith few exceptions, e.g. (Li et al., 2022b;Liu et al., 2022b; Valvoda et al., 2022), early approaches to hard prompting tended to condition onthe task rather than the specific input. Our kNNapproach for conditioning on the input was pioneered by Liu et al. (2022b), who used RoBERTaembeddings to identify relevant GPT-3 prompts forsentiment, table-to-text, and QA tasks. They foundthat kNN works better than a random-selectionbaseline, and that the advantage grows as the sizeof the (domain-controlled) example pool increases.\nWork on prompting LLMs for MT began withthe GPT-3 and PaLM papers (Brown et al., 2020;Chowdhery et al., 2022), which adopted similar approaches, comparing 0, 1, and n-shot1random selection of independent sentence pairs fromWMT training corpora, and testing on older French,German, and Romanian WMT test sets traditionally used in ML, augmented in PaLM withFrench\u2192German and Kazakh. For both models,performance increased with number of shots, andn-shot BLEU scores were found to be competitivewith previous unsupervised SOTA, and in somesettings\u2014particularly into English\u2014supervisedSOTA as well.\nIn other early MT work, Reynolds and McDonell (2021) experimented with prompt templatesfor GPT-3, and found that 0-shot prompts withcarefully-chosen templates can outperform n-shotprompts with sub-optimal templates. Garcia and Firat (2022) explored using prompts with mT5 (Xueet al., 2021) to control output attributes such as formality, and also examine the effect of using promptlike natural-language tags during fine-tuning. Patelet al. (2022) proposed autoregressive prompting:concatenating only the first predicted word to aprompt and output prefix at each step.",
            "reference": [
                "Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, and Marjan Ghazvininejad. 2022. In-context examples selection for machine translation. arXiv preprint arXiv:2212.02437.",
                "Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ond\u0159ej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina Espa\u00f1a-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, and Marcos Zampieri. 2021. Findings of the 2021 conference on machine translation (WMT21). InProceedings of the Sixth Conference on Machine Translation, pages 1-88, Online. Association for Computational Linguistics.",
                "Anonymous. 2023. Does gpt-3 produces less literal translations? Anonymous preprint under review.",
                "Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. 2022. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279.",
                "Rachel Bawden and Fran\u00e7ois Yvon. 2023. Investigating the translation performance of a large multilingual language model: the case of bloom. arXiv preprint arXiv:2303.01911.",
                "Eleftheria Briakou, Colin Cherry, and George Foster. 2023. Searching for needles in a haystack: On the role of incidental bilingualism in palm's translation capability. arXiv preprint arXiv:2305.10266.",
                "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877-1901.",
                "Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.",
                "Hyung Won Chung, Thibault F\u00e9vry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2020. Rethinking embedding coupling in pre-trained language models. arXiv preprint:2010.12821.",
                "Marta R. Costa-juss\u00e0, Eric Smith, Christophe Ropers, Daniel Licht, Javier Ferrando, and Carlos Escolano. 2022. Toxicity in multilingual machine translation at scale. arXiv preprint arXiv:2210.03070.",
                "Daniel Deutsch, Rotem Dror, and Dan Roth. 2021. A statistical analysis of summarization evaluation metrics using resampling methods. Transactions of the Association for Computational Linguistics, 9:1132-1146.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics.",
                "Markus Freitag, Isaac Caswell, and Scott Roy. 2019. APE at scale and its implications on MT evaluation biases. InProceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 34-44, Florence, Italy. Association for Computational Linguistics.",
                "Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021a. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460-1474.",
                "Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ond\u0159ej Bojar. 2021b. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. InProceedings of the Sixth Conference on Machine Translation, pages 733-774, Online. Association for Computational Linguistics.",
                "Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816-3830.",
                "Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. arXiv preprint arXiv:2302.01398.",
                "Xavier Garcia and Orhan Firat. 2022. Using natural language prompts for machine translation. arXiv preprint arXiv:2202.11822.",
                "Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes. arXiv preprint arXiv:2208.01066.",
                "Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356-3369, Online. Association for Computational Linguistics.",
                "Marjan Ghazvininejad, Hila Gonen, and Luke Zettlemoyer. 2023. Dictionary-based phrase-level prompting of large language models for machine translation. arXiv preprint arXiv:2302.07856.",
                "Nuno M Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, and Andr\u00e9 FT Martins. 2023. Hallucinations in large multilingual translation models. arXiv preprint arXiv:2303.16104.",
                "Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. InInternational Conference on Machine Learning.",
                "Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy Z. Wang, Zeyu Wang, Luke Zettlemoyer, and Noah A. Smith. 2022. Whose language counts as high quality? measuring language ideologies in text data selection. arXiv preprint arXiv:2201.10474.",
                "Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. WARP: Word-level Adversarial ReProgramming. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4921-4933, Online. Association for Computational Linguistics.",
                "Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2023. Exploring human-like translation strategy with large language models. arXiv preprint arXiv:2305.04118.",
                "Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.",
                "Yutai Hou, Hongyuan Dong, Xinghao Wang, Bohan Li, and Wanxiang Che. 2022. Metaprompting: Learning to learn better prompts. arXiv preprint arXiv:2209.11486.",
                "Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745.",
                "Alex Jones, Isaac Caswell, Ishank Saxena, and Orhan Firat. 2023. Bilex rx: Lexical data augmentation for massively multilingual machine translation. arXiv preprint arXiv:2303.15265.",
                "Marzena Karpinska and Mohit Iyyer. 2023. Large language models effectively leverage document-level context for literary translation, but critical errors persist. arXiv preprint arXiv:2304.03245.",
                "Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2021. Nearest neighbor machine translation. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.",
                "Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. 2021. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. InProceedings of the Sixth Conference on Machine Translation, pages 478-494, Online. Association for Computational Linguistics.",
                "Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. InProceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388-395, Barcelona, Spain. Association for Computational Linguistics.",
                "Sawan Kumar and Partha Talukdar. 2021. Reordering examples helps during priming-based few-shot learning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4507-4518.",
                "Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045-3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.",
                "Jiaoda Li, Ryan Cotterell, and Mrinmaya Sachan. 2022a. Probing via prompting. arXiv preprint arXiv:2207.01736.",
                "Junyi Li, Tianyi Tang, Jian-Yun Nie, Ji-Rong Wen, and Xin Zhao. 2022b. Learning to transfer prompts for text generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3506-3518, Seattle, United States. Association for Computational Linguistics.",
                "Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582-4597, Online. Association for Computational Linguistics.",
                "Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. 2022a. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. arXiv preprint arXiv:2205.05638.",
                "Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022b. What makes good in-context examples for GPT-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100-114, Dublin, Ireland and Online. Association for Computational Linguistics.",
                "Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.",
                "Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.",
                "Arle Lommel, Hans Uszkoreit, and Aljoscha Burchardt. 2014. Multidimensional Quality Metrics (MQM) : A Framework for Declaring and Describing Translation Quality Metrics. Tradum\u00e0tica, pages 0455-463.",
                "Hongyuan Lu, Haoyang Huang, Dongdong Zhang, Haoran Yang, Wai Lam, and Furu Wei. 2023. Chain-of-dictionary prompting elicits translation in large language models. arXiv preprint arXiv:2305.06575.",
                "Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086-8098, Dublin, Ireland. Association for Computational Linguistics.",
                "Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.",
                "Yasmin Moslem, Rejwanul Haque, and Andy Way. 2023. Adaptive machine translation with large language models. arXiv preprint arXiv:2301.13294.",
                "Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, and Chris Callison-Burch. 2022. Bidirectional language models are also few-shot learners. arXiv preprint arXiv:2209.14500.",
                "Jonathan Pilault, Xavier Garcia, Arthur Bra\u017einskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. arXiv preprint arXiv:2301.10309.",
                "Matt Post. 2018. A call for clarity in reporting BLEU scores. InProceedings of the Third Conference on Machine Translation: Research Papers, pages 186-191, Brussels, Belgium. Association for Computational Linguistics.",
                "Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.",
                "Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. InExtended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1-7.",
                "Timo Schick and Hinrich Sch\u00fctze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255-269.",
                "Andrea Schioppa, Xavier Garcia, and Orhan Firat. 2023. Cross-lingual supervision improves large language models pre-training. arXiv preprint arXiv:2305.11778.",
                "Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881-7892, Online. Association for Computational Linguistics.",
                "Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222-4235.",
                "Chandan Singh, John X Morris, Jyoti Aneja, Alexander M Rush, and Jianfeng Gao. 2022. Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210.01848.",
                "Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M Rush. 2022. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics.",
                "Chau Tran, Shruti Bhosale, James Cross, Philipp Koehn, Sergey Edunov, and Angela Fan. 2021. Facebook AI's WMT21 news translation task submission. InProceedings of the Sixth Conference on Machine Translation, pages 205-215, Online. Association for Computational Linguistics.",
                "Josef Valvoda, Yimai Fang, and David Vandyke. 2022. Prompting for a conversation: How to control a dialog model? arXiv preprint arXiv:2209.11068.",
                "Longyue Wang, Mu Li, Fangxu Liu, Shuming Shi, Zhaopeng Tu, Xing Wang, Shuangzhi Wu, Jiali Zeng, and Wen Zhang. 2021. Tencent translation system for the WMT21 news translation task. InProceedings of the Sixth Conference on Machine Translation, pages 216-224, Online. Association for Computational Linguistics.",
                "Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023. Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210.",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.",
                "Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483-498, Online. Association for Computational Linguistics.",
                "Xianfeng Zeng, Yijin Liu, Ernan Li, Qiu Ran, Fandong Meng, Peng Li, Jinan Xu, and Jie Zhou. 2021. WeChat neural machine translation systems for WMT21. InProceedings of the Sixth Conference on Machine Translation, pages 243-254, Online. Association for Computational Linguistics.",
                "Biao Zhang, Barry Haddow, and Alexandra Birch. 2023. Prompting large language model for machine translation: A case study. arXiv preprint arXiv:2301.07069.",
                "Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Jiajun Chen, Lei Li, and Shujian Huang. 2023. Multilingual machine translation with large language models: Empirical results and analysis. arXiv preprint arXiv:2304.04675."
            ],
            "abstract": "Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM's MT output which reveals some interesting properties and prospects for future work.",
            "date": 2021,
            "title": "Prompting PaLM for Translation: Assessing Strategies and Performance"
        },
        "topic": "Evaluation of LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2308.03688",
            "isAPA": true,
            "related work": "Evaluation of LLMs. The general capabilities of self-supervised (Liu et al., 2021) LLMs (Brownet al., 2020; Chowdhery et al., 2022; Zhang et al., 2022; Scao et al., 2022; Zeng et al., 2022; Touvronet al., 2023), especially those chat-aligned ones (Ouyang et al., 2022; Anthropic, 2023a; OpenAI,2023), have refreshed people's impression on deep learning systems and significantly transcendedthe conventional scope of NLP evaluation. It thus makes the evaluation of LLMs an urgent andchallenging problem. Compared to previous efforts focusing on a subset of specified tasks (Wanget al., 2019; Wang et al.; Gehrmann et al., 2021), an increasing number of benchmarks are includingbroader spectra of tasks and datasets (Hendrycks et al., 2021b; Liang et al., 2022; Srivastava et al.,2023) in the evaluation. However, most of them are still limited to traditional tasks and thus fail toevaluate LLMs' open-ended generation, multi-turn interaction, and ability to act as agents.\nLLM-as-Agent. In pre-LLM era, text game environments such as TextWorld (C\u00f4t\u00e9 et al., 2019),Jericho (Hausknecht et al., 2020), and LIGHT (Urbanek et al., 2019) are dominant in language agentstudy which bases on BERT (Devlin et al., 2019) and reinforcement learning. With the advent ofLLMs, the study of LLM agents begins to thrive (Huang et al., 2022), especially after Chain-ofThought (Wei et al., 2022b) came out. ReAct (Yao et al., 2023b) is a pioneer work to combine CoTreasoning and actions in agent tasks. Later, a bunch of advanced reasoning strategies (Kim et al.,2023; Shinn et al., 2023; Wang et al., 2023d; Liu et al., 2023; Yao et al., 2023a; Gu et al., 2023) andapplications (Park et al., 2023; Richards, 2023; Nakajima, 2023; age, 2023) for LLM-as-Agent haveemerged and arouse much public interest. Nevertheless, limited datasets and models and availableon the topic, without a standard and comprehensive benchmark. AGENTBENCH presents the firstsystematic benchmark for evaluating LLM-as-Agent with a broad coverage of tasks and availableLLMs. Additionally, it also initiates the idea of adopting agent tasks to measure LLM performance.\nEvaluating LLMs in Executive Environments. As LLMs become increasingly capable of realworld challenges, there is also a trend to evaluate them in executive environments rather than staticdatasets. Besides text games (e.g., ALFWorld (Shridhar et al., 2020b)), another main stream ofworks lies in code execution. APPS (Hendrycks et al., 2021a), HumanEval (Chen et al., 2021) andMBPP (Austin et al., 2021) pioneer the effort to evaluate code LLMs for functional correctnessinstead of text similarity. The paradigm has been later widely recognized and adopted in followingworks (Li et al., 2022; Zheng et al., 2023; Nijkamp et al., 2023). However, few previous codeevaluation frameworks consider multi-turn interactions. A concurrent work InterCode (Yang et al.,2023) releases a framework that allows evaluation of interaction between models and Bash and SQLenvironments, which are similar to OS and DB tasks in AGENTBENCH.",
            "reference": [
                "Agentgpt. Python. https://github.com/reworkd/AgentGPT, 2023.",
                "Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.",
                "Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.",
                "Anonymous. Knowledge base question answering as tool learning. under review, 2023.",
                "Anthropic. Introducing claude, 2023a. URLhttps://www.anthropic.com/index/introducing-claude.",
                "Anthropic. Claude 2, 2023b. URLhttps://www.anthropic.com/index/claude-2.",
                "Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.",
                "Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Jason Tsong-Li Wang (ed.),Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pp.  1247-1250. ACM, 2008. doi:10.1145/1376616.1376746. URLhttps://doi.org/10.1145/1376616.1376746.",
                "Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.",
                "Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.",
                "Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. HybridQA: A dataset of multi-hop question answering over tabular and textual data. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp.  1026-1036, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.91. URLhttps://aclanthology.org/2020.findings-emnlp.91.",
                "Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org (accessed 14 April 2023), 2023.",
                "Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.",
                "Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.",
                "Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world's first truly open instruction-tuned llm, 2023. URLhttps://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.",
                "Marc-Alexandre C\u00f4t\u00e9, Akos K\u00e1d\u00e1r, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. InComputer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pp.  41-75. Springer, 2019.",
                "Edward De Bono. Lateral thinking. New York, pp.  70, 1970.",
                "Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.",
                "Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171-4186, 2019.",
                "Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320-335, 2022.",
                "Jack Edmonds and Richard M Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM), 19(2):248-264, 1972.",
                "Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343-18362, 2022.",
                "LR Ford Jr and DR F\u0173lkerson. Flows in networks. 1962.",
                "Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, et al. The gem benchmark: Natural language generation, its evaluation and metrics. InProceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp.  96-120. Association for Computational Linguistics, 2021.",
                "Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April, 1, 2023.",
                "Yu Gu and Yu Su. ArcaneQA: Dynamic program induction and contextualized encoding for knowledge base question answering. InProceedings of the 29th International Conference on Computational Linguistics, pp.  1718-1731, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URLhttps://aclanthology.org/2022.coling-1.148.",
                "Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond i.i.d.: Three levels of generalization for question answering on knowledge bases. InProceedings of the Web Conference 2021. ACM, apr 2021. doi:10.1145/3442381.3449992. URLhttps://doi.org/10.1145%2F3442381.3449992.",
                "Yu Gu, Xiang Deng, and Yu Su. Don't generate, discriminate: A proposal for grounding language models to real-world environments. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4928-4949, Toronto, Canada, July 2023. Association for Computational Linguistics. URLhttps://aclanthology.org/2023.acl-long.270.",
                "Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre C\u00f4t\u00e9, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  7903-7910, 2020.",
                "Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021a.",
                "Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021b.",
                "Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.",
                "Amy K Hoover, Julian Togelius, Scott Lee, and Fernando de Mesentier Silva. The many ai challenges of hearthstone. KI-K\u00fcnstliche Intelligenz, 34:33-43, 2020.",
                "Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational Conference on Machine Learning, pp.  9118-9147. PMLR, 2022.",
                "Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. Search-based neural structured learning for sequential question answering. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1821-1831, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1167. URLhttps://aclanthology.org/P17-1167.",
                "Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601-1611, 2017.",
                "Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.",
                "Heinrich K\u00fcttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt\u00e4schel. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671-7684, 2020.",
                "LAION. Open-assistant. https://github.com/LAION-AI/Open-Assistant, 2023.",
                "Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023.",
                "Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R\u00e9mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092-1097, 2022.",
                "Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.",
                "Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.",
                "Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.",
                "Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857-876, 2021.",
                "Pattie Maes. Agents that reduce work and information overload. Commun. ACM, 37:30-40, 1994.",
                "Dirk Merkel et al. Docker: lightweight linux containers for consistent development and deployment. Linux j, 239(2):2, 2014.",
                "Yohei Nakajima. Babyagi. Python. https://github. com/yoheinakajima/babyagi, 2023.",
                "Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry\u015bci\u0144ski, Nick Schoelkopf, Riley Kong, Xiangru Tang, Murori Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, and Dragomir Radev. Fetaqa: Free-form table question answering, 2021.",
                "Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. InThe Eleventh International Conference on Learning Representations, 2023.",
                "OpenAI. Introducing chatgpt, 2022. URLhttps://openai.com/blog/chatgpt.",
                "R OpenAI. Gpt-4 technical report. arXiv, pp.  2303-08774, 2023.",
                "Philip Osborne, Heido N\u00f5mm, and Andr\u00e9 Freitas. A survey of text games for reinforcement learning informed by natural language. Transactions of the Association for Computational Linguistics, 10:873-887, 2022.",
                "Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730-27744, 2022.",
                "Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. ArXiv, abs/2304.03442, 2023.",
                "Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1470-1480, Beijing, China, July 2015. Association for Computational Linguistics. doi:10.3115/v1/P15-1142. URLhttps://aclanthology.org/P15-1142.",
                "Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio G\u00f3mez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Gim\u00e9nez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research, 2022.",
                "Toran Bruce Richards. Auto-gpt: An autonomous gpt-4 experiment, 2023.",
                "Baptiste Rozi\u00e8re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J\u00e9r\u00e9my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.",
                "Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. InInternational Conference on Learning Representations, 2022.",
                "Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili\u0107, Daniel Hesslow, Roman Castagn\u00e9, Alexandra Sasha Luccioni, Fran\u00e7ois Yvon, Matthias Gall\u00e9, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.",
                "John R. Searle. Speech acts: An essay in the philosophy of language. Language, 46:217, 1970.",
                "Bokui Shen, Fei Xia, Chengshu Li, Roberto Mart\u00edn-Mart\u00edn, Linxi Fan, Guanzhi Wang, Claudia P\u00e9rez-D'Arpino, Shyamal Buch, Sanjana Srivastava, Lyne Tchapmi, et al. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  7520-7527. IEEE, 2021.",
                "Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pp.  3135-3144. PMLR, 2017.",
                "Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.",
                "Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10740-10749, 2020a.",
                "Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2020b.",
                "Paul Sloane. Lateral thinking puzzlers. Sterling Publishing Company, Inc., 1992.",
                "Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri\u00e0 Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.",
                "Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Mart\u00edn-Mart\u00edn, Fei Xia, Kent Elliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. InConference on Robot Learning, pp.  477-490. PMLR, 2022.",
                "Yu Su, Huan Sun, Brian M. Sadler, Mudhakar Srivatsa, Izzeddin Gur, Zenghui Yan, and Xifeng Yan. On generating characteristic-rich question sets for QA evaluation. In Jian Su, Xavier Carreras, and Kevin Duh (eds.),Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp.  562-572. The Association for Computational Linguistics, 2016. doi:10.18653/v1/d16-1054. URLhttps://doi.org/10.18653/v1/d16-1054.",
                "Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  641-651, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-1059. URLhttps://aclanthology.org/N18-1059.",
                "Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149-4158, 2019.",
                "Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.",
                "Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.",
                "Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rockt\u00e4schel, Douwe Kiela, Arthur Szlam, and Jason Weston. Learning to speak and act in a fantasy text adventure game. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  673-683, 2019.",
                "Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations.",
                "Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.",
                "Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing open-source language models with mixed-quality data, 2023a.",
                "Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023b.",
                "Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023c.",
                "Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023d.",
                "Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022a.",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824-24837, 2022b.",
                "Michael Wooldridge and Nicholas R Jennings. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115-152, 1995.",
                "Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.",
                "John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback. arXiv preprint arXiv:2306.14898, 2023.",
                "Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744-20757, 2022.",
                "Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.",
                "Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023b.",
                "Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.",
                "Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.",
                "Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023.",
                "Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017.",
                "Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyuan Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Y. Qiao, Zhaoxiang Zhang, and Jifeng Dai. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. ArXiv, abs/2305.17144, 2023."
            ],
            "abstract": "Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Training on code and high quality multi-turn alignment data could improve agent performance. Datasets, environments, and an integrated evaluation package for AgentBench are released at this https URL.",
            "date": 2021,
            "title": "AgentBench: Evaluating LLMs as Agents"
        },
        "topic": "LLMs-based Agents"
    },
    {
        "source_paper": {
            "arxiv_id": "2212.13138",
            "isAPA": false,
            "related work": "Over the past few years, LLMs have shown impressive performance on natural language processing (NLP) tasks[12,14,15,30,73,69,70,89,91,99]. They owe their success to scaling up the training of transformer-based models[84]. It has been shown that model performance and data-efficiency scales with model size and dataset size[37].\nLLMs are often trained using self-supervision on large scale, using general-purpose text corpi such as Wikipedia and BooksCorpus. They have demonstrated promising results across a wide range of tasks, including tasks that require specialized scientific knowledge and reasoning[29,17]. Perhaps the most interesting aspect of these LLMs is their in-context few-shot abilities, which adapt these models to diverse tasks without gradient-based parameter updates[12,43,40,89]. This allows them to rapidly generalize to unseen tasks and even exhibit apparent reasoning abilities with appropriate prompting strategies[14,47,79,91].Several studies have shown that LLMs have the capacity to act as implicit knowledge bases[35,29,79]. However, there is a significant risk of these models producing hallucinations, amplifying social biases present in their training data, and displaying deficiencies in their reasoning abilities. To examine the current limitations of LLMs and to quantify the large gap between human and LLM language capabilities, BIG-bench was introduced as a community-wide initiative to benchmark on tasks that were believed at time of publication to be beyond the capabilities of current language models[78].Recent studies, such as SciBERT[5], BioNLP[46], BioMegatron[76], BioBERT[44], PubMedBERT[25], DARE[66], ScholarBERT[31], and BioGPT[56], have demonstrated the effectiveness of using curated scientific and biomedical corpora for both discriminative and generative language modeling. These models, although promising, are typically small in scale and scope compared to LLMs such as GPT-3[12]and PaLM[14]. While the medical domain is challenging, specific proposals for LLMs have already included examples as varied as augmenting non-critical clinical assessments to summarisation of complex medical communications[41,75,3].The closest precedents to our work are[79], who introduced a LLM for science named Galactica, and[50], who studied the reasoning capability of LLMs in the medical question answering context. In particular,[50]used Instruct GPT-3, an instruction-tuned LLM[63], and applied chain-of-thought prompting[91]on top to improve the results on the MedQA, MedMCQA, and PubMedQA datasets.",
            "reference": [
                "Asma Ben Abacha, Eugene Agichtein, Yuval Pinter and Dina Demner-Fushman \u201cOverview of the medical question answering task at TREC 2017 LiveQA.\u201d InTREC, 2017, pp. 1-12",
                "Asma Ben Abacha, Yassine Mrabet, Mark Sharp, Travis R Goodwin, Sonya E Shooshan and Dina Demner-Fushman \u201cBridging the Gap Between Consumers' Medication Questions and Trusted Answers.\u201d InMedInfo, 2019, pp. 25-29",
                "Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim and David Sontag \u201cLarge Language Models are Zero-Shot Clinical Information Extractors\u201d InarXiv preprint arXiv:2205.12689, 2022",
                "Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang and Sudip Roy \u201cPathways: Asynchronous distributed dataflow for ML\u201d InProceedings of Machine Learning and Systems4, 2022, pp. 430-449",
                "Iz Beltagy, Kyle Lo and Arman Cohan \u201cSciBERT: A pretrained language model for scientific text\u201d InarXiv preprint arXiv:1903.10676, 2019",
                "Nancy D Berkman, Stacey L Sheridan, Katrina E Donahue, David J Halpern, Anthony Viera, Karen Crotty, Audrey Holland, Michelle Brasure, Kathleen N Lohr and Elizabeth Harden \u201cHealth literacy interventions and outcomes: an updated systematic review.\u201d InEvidence report/technology assessment, 2011, pp. 1-941",
                "Sid Black, Leo Gao, Phil Wang, Connor Leahy and Stella Biderman \u201cGPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow\u201d If you use this software, please cite it using these metadata. Zenodo, 2021 DOI:10.5281/zenodo.5297715",
                "Godfred O Boateng, Torsten B Neilands, Edward A Frongillo, Hugo R Melgar-Qui\u00f1onez and Sera L Young \u201cBest practices for developing and validating scales for health, social, and behavioral research: a primer\u201d InFrontiers in public health6 Frontiers Media SA, 2018, pp. 149",
                "Elliot Bolton, David Hall, Michihiro Yasunaga, Tony Lee, Chris Manning and Percy Liang \u201cStanford CRFM Introduces PubMedGPT 2.7B\u201d,https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b, 2022",
                "Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut and Emma Brunskill \u201cOn the opportunities and risks of foundation models\u201d InarXiv preprint arXiv:2108.07258, 2021",
                "Rishi Bommasani, Percy Liang and Tony Lee \u201cLanguage Models are Changing AI: The Need for Holistic Evaluation\u201d,https://crfm.stanford.edu/2022/11/17/helm.html, 2022",
                "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry and Amanda Askell \u201cLanguage models are few-shot learners\u201d InAdvances in neural information processing systems33, 2020, pp. 1877-1901",
                "Irene Y Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman and Marzyeh Ghassemi \u201cEthical machine learning in healthcare\u201d InAnnual review of biomedical data science4 Annual Reviews, 2021, pp. 123-144",
                "Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton and Sebastian Gehrmann \u201cPaLM: Scaling language modeling with pathways\u201d InarXiv preprint arXiv:2204.02311, 2022",
                "Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani and Siddhartha Brahma \u201cScaling instruction-finetuned language models\u201d InarXiv preprint arXiv:2210.11416, 2022",
                "Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev and Jennimaria Palomaki \u201cTyDi QA: A benchmark for information-seeking question answering in typologically diverse languages\u201d InTransactions of the Association for Computational Linguistics8 MIT Press, 2020, pp. 454-470",
                "Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse and John Schulman \u201cTraining verifiers to solve math word problems\u201d InarXiv preprint arXiv:2110.14168, 2021",
                "Kathleen Creel and Deborah Hellman \u201cThe Algorithmic Leviathan: Arbitrariness, Fairness, and Opportunity in Algorithmic Decision-Making Systems\u201d InCanadian Journal of Philosophy Cambridge University Press, 2022, pp. 1-18",
                "Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu and Orhan Firat \u201cGlam: Efficient scaling of language models with mixture-of-experts\u201d InInternational Conference on Machine Learning, 2022, pp. 5547-5569 PMLR",
                "Nwamaka D Eneanya, L Boulware, Jennifer Tsai, Marino A Bruce, Chandra L Ford, Christina Harris, Leo S Morales, Michael J Ryan, Peter P Reese and Roland J Thorpe \u201cHealth inequities and the inappropriate use of race in nephrology\u201d InNature Reviews Nephrology18.2 Nature Publishing Group, 2022, pp. 84-94",
                "Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff Dean and Richard Socher \u201cDeep learning-enabled medical computer vision\u201d InNPJ digital medicine4.1 Nature Publishing Group, 2021, pp. 1-9",
                "Steven Y Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman and Eduard Hovy \u201cCHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models\u201d InarXiv preprint arXiv:2210.04191, 2022",
                "Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H Chi and Alex Beutel \u201cCounterfactual fairness in text classification through robustness\u201d InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 219-226",
                "Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum\u00e9 Iii and Kate Crawford \u201cDatasheets for datasets\u201d InCommunications of the ACM64.12 ACM New York, NY, USA, 2021, pp. 86-92",
                "Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao and Hoifung Poon \u201cDomain-specific language model pretraining for biomedical natural language processing\u201d InACM Transactions on Computing for Healthcare (HEALTH)3.1 ACM New York, NY, 2021, pp. 1-23",
                "Yuxian Gu, Xu Han, Zhiyuan Liu and Minlie Huang \u201cPpt: Pre-trained prompt tuning for few-shot learning\u201d InarXiv preprint arXiv:2109.04332, 2021",
                "WHO Guidance \u201cEthics and governance of artificial intelligence for health\u201d InWorld Health Organization, 2021",
                "Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu and Maosong Sun \u201cPtr: Prompt tuning with rules for text classification\u201d InAI Open Elsevier, 2022",
                "Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song and Jacob Steinhardt \u201cMeasuring massive multitask language understanding\u201d InarXiv preprint arXiv:2009.03300, 2020",
                "Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl and Aidan Clark \u201cTraining Compute-Optimal Large Language Models\u201d InarXiv preprint arXiv:2203.15556, 2022",
                "Zhi Hong, Aswathy Ajith, Gregory Pauloski, Eamon Duede, Carl Malamud, Roger Magoulas, Kyle Chard and Ian Foster \u201cScholarBERT: Bigger is Not Always Better\u201d InarXiv preprint arXiv:2205.11342, 2022",
                "Sara Hooker \u201cMoving beyond \u201calgorithmic bias is a data problem\u201d\u201d InPatterns2.4 Elsevier, 2021, pp. 100241",
                "Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang and Peter Szolovits \u201cWhat disease does this patient have? a large-scale open domain question answering dataset from medical exams\u201d InApplied Sciences11.14 MDPI, 2021, pp. 6421",
                "Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen and Xinghua Lu \u201cPubMedQA: A dataset for biomedical research question answering\u201d InarXiv preprint arXiv:1909.06146, 2019",
                "Mandar Joshi, Eunsol Choi, Daniel S Weld and Luke Zettlemoyer \u201cTriviaQA: A large scale distantly supervised challenge dataset for reading comprehension\u201d InarXiv preprint arXiv:1705.03551, 2017",
                "Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma and Eli Tran-Johnson \u201cLanguage models (mostly) know what they know\u201d InarXiv preprint arXiv:2207.05221, 2022",
                "Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu and Dario Amodei \u201cScaling laws for neural language models\u201d InarXiv preprint arXiv:2001.08361, 2020",
                "Raynard S Kington, Stacey Arnesen, Wen-Ying Sylvia Chou, Susan J Curry, David Lazer and Antonia M Villarruel \u201cIdentifying credible sources of health information in social media: Principles and attributes\u201d InNAM perspectives2021 National Academy of Medicine, 2021",
                "Jon Kleinberg and Manish Raghavan \u201cAlgorithmic monoculture and social welfare\u201d InProceedings of the National Academy of Sciences118.22 National Acad Sciences, 2021, pp. e2018340118",
                "Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo and Yusuke Iwasawa \u201cLarge Language Models are Zero-Shot Reasoners\u201d InarXiv preprint arXiv:2205.11916, 2022",
                "Diane M Korngiebel and Sean D Mooney \u201cConsidering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery\u201d InNPJ Digital Medicine4.1 Nature Publishing Group, 2021, pp. 1-3",
                "Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan and Sameer Singh \u201cRethinking Explainability as a Dialogue: A Practitioner's Perspective\u201d InarXiv preprint arXiv:2202.01875, 2022",
                "Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang and Felix Hill \u201cCan language models learn from explanations in context?\u201d InarXiv preprint arXiv:2204.02329, 2022",
                "Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So and Jaewoo Kang \u201cBioBERT: a pre-trained biomedical language representation model for biomedical text mining\u201d InBioinformatics36.4 Oxford University Press, 2020, pp. 1234-1240",
                "Brian Lester, Rami Al-Rfou and Noah Constant \u201cThe power of scale for parameter-efficient prompt tuning\u201d InarXiv preprint arXiv:2104.08691, 2021",
                "Patrick Lewis, Myle Ott, Jingfei Du and Veselin Stoyanov \u201cPretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art\u201d InProceedings of the 3rd Clinical Natural Language Processing Workshop, 2020, pp. 146-157",
                "Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag and Theo Gutman-Solo \u201cSolving quantitative reasoning problems with language models\u201d InarXiv preprint arXiv:2206.14858, 2022",
                "Xiang Lisa Li and Percy Liang \u201cPrefix-tuning: Optimizing continuous prompts for generation\u201d InarXiv preprint arXiv:2101.00190, 2021",
                "Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu and Ananya Kumar \u201cHolistic evaluation of language models\u201d InarXiv preprint arXiv:2211.09110, 2022",
                "Valentin Li\u00e9vin, Christoffer Egeberg Hother and Ole Winther \u201cCan large language models reason about medical questions?\u201d InarXiv preprint arXiv:2207.08143, 2022",
                "Stephanie Lin, Jacob Hilton and Owain Evans \u201cTeaching Models to Express Their Uncertainty in Words\u201d InarXiv preprint arXiv:2205.14334, 2022",
                "Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi and Graham Neubig \u201cPre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing\u201d InarXiv preprint arXiv:2107.13586, 2021",
                "Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang and Jie Tang \u201cGPT understands, too\u201d InarXiv preprint arXiv:2103.10385, 2021",
                "Xiaoxuan Liu, Ben Glocker, Melissa M McCradden, Marzyeh Ghassemi, Alastair K Denniston and Lauren Oakden-Rayner \u201cThe medical algorithmic audit\u201d InThe Lancet Digital Health Elsevier, 2022",
                "Ilya Loshchilov and Frank Hutter \u201cDecoupled weight decay regularization\u201d InarXiv preprint arXiv:1711.05101, 2017",
                "Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu \u201cBioGPT: generative pre-trained transformer for biomedical text generation and mining\u201d InBriefings in Bioinformatics23.6 Oxford Academic, 2022",
                "Apoorva Mandavilli \u201cMedical Journals Blind to Racism as Health Crisis, Critics Say\u201d,https://www.nytimes.com/2021/06/02/health/jama-racism-bauchner.html, 2021",
                "Michael Matheny, Sonoo Thadaney Israni, Mahnoor Ahmed and Danielle Whicher \u201cArtificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril\u201d InWashington, DC: National Academy of Medicine, 2022",
                "Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji and Timnit Gebru \u201cModel cards for model reporting\u201d InProceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220-229",
                "Fabiane FR Morgado, Juliana FF Meireles, Clara M Neves, Ana Amaral and Maria EC Ferreira \u201cScale development: ten main limitations and recommendations to improve future research practices\u201d InPsicologia: Reflexao e Critica30 SciELO Brasil, 2017",
                "Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma and David Luan \u201cShow your work: Scratchpads for intermediate computation with language models\u201d InarXiv preprint arXiv:2112.00114, 2021",
                "White House Office Science and Technology Policy \u201cThe Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People\u201d,https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf, 2022",
                "Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama and Alex Ray \u201cTraining language models to follow instructions with human feedback\u201d InarXiv preprint arXiv:2203.02155, 2022",
                "Ankit Pal, Logesh Kumar Umapathi and Malaikannan Sankarasubbu \u201cMedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering\u201d InConference on Health, Inference, and Learning, 2022, pp. 248-260 PMLR",
                "Anusri Pampari, Preethi Raghavan, Jennifer Liang and Jian Peng \u201cemrqa: A large corpus for question answering on electronic medical records\u201d InarXiv preprint arXiv:1809.00732, 2018",
                "Yannis Papanikolaou and Andrea Pierleoni \u201cDARE: Data augmented relation extraction with gpt-2\u201d InarXiv preprint arXiv:2004.13845, 2020",
                "Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu \u201cBleu: a method for automatic evaluation of machine translation\u201d InProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318",
                "Vinodkumar Prabhakaran, Ben Hutchinson and Margaret Mitchell \u201cPerturbation sensitivity analysis to detect unintended model biases\u201d InarXiv preprint arXiv:1910.04210, 2019",
                "Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring and Susannah Young \u201cScaling language models: Methods, analysis & insights from training gopher\u201d InarXiv preprint arXiv:2112.11446, 2021",
                "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li and Peter J Liu \u201cExploring the limits of transfer learning with a unified text-to-text transformer.\u201d InJ. Mach. Learn. Res.21.140, 2020, pp. 1-67",
                "Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron and Parker Barnes \u201cClosing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing\u201d InProceedings of the 2020 conference on fairness, accountability, and transparency, 2020, pp. 33-44",
                "Negar Rostamzadeh, Diana Mincu, Subhrajit Roy, Andrew Smart, Lauren Wilcox, Mahima Pushkarna, Jessica Schrouff, Razvan Amironesei, Nyalleng Moorosi and Katherine Heller \u201cHealthsheet: Development of a Transparency Artifact for Health Datasets\u201d InarXiv preprint arXiv:2202.13028, 2022",
                "Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili\u0107, Daniel Hesslow, Roman Castagn\u00e9, Alexandra Sasha Luccioni, Fran\u00e7ois Yvon and Matthias Gall\u00e9 \u201cBLOOM: A 176B-Parameter Open-Access Multilingual Language Model\u201d InarXiv preprint arXiv:2211.05100, 2022",
                "Mike Schaekermann, Carrie J Cai, Abigail E Huang and Rory Sayres \u201cExpert discussions improve comprehension of difficult cases in medical image assessment\u201d InProceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1-13",
                "Emre Sezgin, Joseph Sirrianni and Simon L Linwood \u201cOperationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model\u201d InJMIR Medical Informatics10.2 JMIR Publications Inc., Toronto, Canada, 2022, pp. e32875",
                "Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi and Raghav Mani \u201cBioMegatron: Larger biomedical domain language model\u201d InarXiv preprint arXiv:2010.06060, 2020",
                "Sarah J Shoemaker, Michael S Wolf and Cindy Brach \u201cDevelopment of the Patient Education Materials Assessment Tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information\u201d InPatient education and counseling96.3 Elsevier, 2014, pp. 395-403",
                "Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta and Adri\u00e0 Garriga-Alonso \u201cBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language models\u201d InarXiv preprint arXiv:2206.04615, 2022",
                "Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez and Robert Stojnic \u201cGalactica: A Large Language Model for Science\u201d InarXiv preprint arXiv:2211.09085, 2022",
                "Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker and Yu Du \u201cLamda: Language models for dialog applications\u201d InarXiv preprint arXiv:2201.08239, 2022",
                "Nenad Toma\u0161ev, Natalie Harris, Sebastien Baur, Anne Mottram, Xavier Glorot, Jack W Rae, Michal Zielinski, Harry Askham, Andre Saraiva and Valerio Magliulo \u201cUse of deep learning to develop continuous-risk models for adverse event prediction from electronic health records\u201d InNature Protocols16.6 Nature Publishing Group, 2021, pp. 2765-2787",
                "Dustin Tran, Jeremiah Liu, Michael W Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet and Huiyi Hu \u201cPlex: Towards reliability using pretrained large model extensions\u201d InarXiv preprint arXiv:2207.07411, 2022",
                "George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis and Dimitris Polychronopoulos \u201cAn overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition\u201d InBMC bioinformatics16.1 BioMed Central, 2015, pp. 1-28",
                "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser and Illia Polosukhin \u201cAttention is all you need\u201d InAdvances in neural information processing systems30, 2017",
                "Darshali A Vyas, Leo G Eisenstein and David S Jones \u201cHidden in plain sight\u2014reconsidering the use of race correction in clinical algorithms\u201d InNew England Journal of Medicine383.9 Mass Medical Soc, 2020, pp. 874-882",
                "Kathleen E Walsh, Polina Harik, Kathleen M Mazor, Deborah Perfetto, Milena Anatchkova, Colleen Biggins, Joann Wagner, Pamela J Schoettker, Cassandra Firneno and Robert Klugman \u201cMeasuring harm in healthcare: optimizing adverse event review\u201d InMedical care55.4 NIH Public Access, 2017, pp. 436",
                "boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun \u201cTowards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters\u201d InarXiv preprint arXiv:2212.10001, 2022",
                "Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi and Denny Zhou \u201cSelf-consistency improves chain of thought reasoning in language models\u201d InarXiv preprint arXiv:2203.11171, 2022",
                "Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai and Quoc V Le \u201cFinetuned language models are zero-shot learners\u201d InarXiv preprint arXiv:2109.01652, 2021",
                "Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou and Donald Metzler \u201cEmergent abilities of large language models\u201d InarXiv preprint arXiv:2206.07682, 2022",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le and Denny Zhou \u201cChain of thought prompting elicits reasoning in large language models\u201d InarXiv preprint arXiv:2201.11903, 2022",
                "Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle and Atoosa Kasirzadeh \u201cEthical and social risks of harm from language models\u201d InarXiv preprint arXiv:2112.04359, 2021",
                "Tamara Williams, Marilyn Szekendi, Stephen Pavkovic, Wanda Clevenger and Julie Cerese \u201cThe reliability of AHRQ Common Format Harm Scales in rating patient safety events\u201d InJournal of patient safety11.1 JSTOR, 2015, pp. 52-59",
                "Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy Liang and Jure Leskovec \u201cDeep bidirectional language-knowledge graph pretraining\u201d InarXiv preprint arXiv:2210.09338, 2022",
                "Michihiro Yasunaga, Jure Leskovec and Percy Liang \u201cLinkBERT: Pretraining Language Models with Document Links\u201d InarXiv preprint arXiv:2203.15827, 2022",
                "Seonghyeon Ye, Joel Jang, Doyoung Kim, Yongrae Jo and Minjoon Seo \u201cRetrieval of Soft Prompt Enhances Zero-Shot Task Generalization\u201d InarXiv preprint arXiv:2210.03029, 2022",
                "Jason Yim, Reena Chopra, Terry Spitz, Jim Winkens, Annette Obika, Christopher Kelly, Harry Askham, Marko Lukic, Josef Huemer and Katrin Fasler \u201cPredicting conversion to wet age-related macular degeneration using deep learning\u201d InNature Medicine26.6 Nature Publishing Group, 2020, pp. 892-899",
                "Haoran Zhang, Amy X Lu, Mohamed Abdalla, Matthew McDermott and Marzyeh Ghassemi \u201cHurtful words: quantifying biases in clinical contextual word embeddings\u201d Inproceedings of the ACM Conference on Health, Inference, and Learning, 2020, pp. 110-120",
                "Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li and Xi Victoria Lin \u201cOPT: Open pre-trained transformer language models\u201d InarXiv preprint arXiv:2205.01068, 2022",
                "Denny Zhou, Nathanael Sch\u00e4rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le and Ed Chi \u201cLeast-to-Most Prompting Enables Complex Reasoning in Large Language Models\u201d InarXiv preprint arXiv:2205.10625, 2022"
            ],
            "abstract": "Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.",
            "date": 2021,
            "title": "Large Language Models Encode Clinical Knowledge"
        },
        "topic": "LLMs in Medicine"
    },
    {
        "source_paper": {
            "arxiv_id": "2208.03306",
            "isAPA": true,
            "related work": "Sparse Language Models\nSparsely activated language models have been considered in a few forms (Evci et al., 2020; Mostafa and Wang, 2019; Dettmers and Zettlemoyer, 2019), but the Mixture-of-Experts (MoE) model is of particular note. Early versions (Jacobs et al., 1991) had independent feed-forward networks serving as experts. Recent MoE models (Shazeer et al., 2017) have been studied with token-based routing through backpropagation - notably, by Lepikhin et al. (2021), which appplies this concept to machine translation, and Fedus et al. (2022), which simplifies the architecture to activation of only one expert per layer. Lewis et al. (2021), find an alternative approach to routing by formulating it as a linear assignment problem, and Roller et al. (2021) use a fixed hash as the gating function.\nOf this line of work, ours is most closely related to Gururangan et al. (2022). In that work, DEMix layers - placed in the feedforward layers of the Transformer - contain experts which specialize on specific domains. Routing at train time is determined only by the domain label, but all experts are activated at inference time and mixed according to weights estimated from a validation set. Similarly, Pfeiffer et al. (2022) develop a multilingual expert model with language-specific routing, and Kudugunta et al. (2021) develop a multi-task expert model with task-specific routing.\nAdapters\nPrevious work has also explored extending the capacity of a model with additional specialized parameters (e.g., adapters; Houlsby et al., 2019; Pfeiffer et al., 2020; Ben Zaken et al., 2022). However, unlike these existing approaches, our approach is significantly simplified, as our ELMs each consist of an entire model which requires no additional parameters and no shared parameters. Future work may explore combining ELMs with adapters to scale into smaller domains.\nEnsembles\nEnsemble methods are widely used in machine learning, for example in bagging, boosting, and stacking (Breiman, 1996; Freund, 1995; Wolpert, 1992). In a setting where training data is streamed, Caccia et al. (2021) define a growing ensemble, in which new base models are trained sequentially on incoming batches. However, their growing ensemble, incrementally trained on the randomly created batches of their setting, underperforms non-incremental methods.\nParameter Averaging\nOur averaging mechanism is inspired by the discovery that averaging many fine-tuned vision models improves out-of-domain generalization (Wortsman et al., 2022a; Izmailov et al., 2018). In Wortsman et al. 2022a, the authors propose a greedy mechanism for averaging experts with uniform weights. Here, we find that uniform weighted averaging does not work for combining domain-specific models; instead we use a posterior weighted average, where the averaging weights are estimated based on the relevance of the model to the target domain. Our posterior weighted average is highly related to Bayesian model averaging techniques used in classic ensembling methods (Fragoso et al., 2018). Model averaging has also been explored for federated learning (McMahan et al., 2017), where different models are trained locally to fit privacy-sensitive data on different devices and merged. However, these works have found success averaging models trained from the same random initialization, which we do not find to hold in our setting. Matena and Raffel (2021) compute a parameter average of models, estimating the optimal weights via an approximation of the Fisher information. Future work may explore these (and other) variations of weighted averages with ELMs.\nSeed training\nOur discovery of the importance of the seed training as a critical warm-up phase for BTM is in line with findings that parameter averaging only works when models share part of their optimization trajectory (Frankle et al., 2020; Entezari et al., 2022). Future work may investigate what is learned in the seed phase that makes it so useful for ELM specialization, regardless of the corpus used for seeding. Similar to seed training, Nie et al. (2021) propose dense-to-sparse gating, where mixture-of-experts routing mechanisms are gradually sparsified during the course of training.",
            "reference": [
                "Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7747-7763, Online. Association for Computational Linguistics.",
                "Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 2021. Efficient large scale language modeling with mixtures of experts.",
                "Lo\u00efc Barrault, Ond\u0159ej Bojar, Marta R. Costa-juss\u00e0, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias M\u00fcller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). InProceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1-61, Florence, Italy. Association for Computational Linguistics.",
                "Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset.",
                "Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1-9, Dublin, Ireland. Association for Computational Linguistics.",
                "Douglas Biber. 1988. Variation across Speech and Writing. Cambridge University Press.",
                "Daniel Blanchard, Joel R. Tetreault, Derrick Higgins, A. Cahill, and Martin Chodorow. 2013. TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013:15.",
                "John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. InProceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 120-128, Sydney, Australia. Association for Computational Linguistics.",
                "Su Lin Blodgett, Lisa Green, and Brendan O'Connor. 2016. Demographic dialectal variation in social media: A case study of African-American English. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119-1130, Austin, Texas. Association for Computational Linguistics.",
                "Leo Breiman. 1996. Bagging predictors. Machine learning, 24(2):123-140.",
                "Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.",
                "Lucas Caccia, Jing Xu, Myle Ott, Marc'Aurelio Ranzato, and Ludovic Denoyer. 2021. On anytime learning at macroscale. CoRR, abs/2106.09563.",
                "Caselaw Access Project. Caselaw access project.",
                "Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling.",
                "Alexandra Chronopoulou, Matthew Peters, and Jesse Dodge. 2022. Efficient hierarchical domain adaptation for pretrained language models. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1336-1351, Seattle, United States. Association for Computational Linguistics.",
                "Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599-4610, Online. Association for Computational Linguistics.",
                "Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and play language models: A simple approach to controlled text generation. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.",
                "Tim Dettmers and Luke Zettlemoyer. 2019. Sparse networks from scratch: Faster training without losing performance. CoRR, abs/1907.04840.",
                "Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. 2022. The role of permutation invariance in linear mode connectivity of neural networks. InInternational Conference on Learning Representations.",
                "Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. Rigging the lottery: Making all tickets winners. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 2943-2952. PMLR.",
                "William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1-39.",
                "Tiago Fragoso, Wesley Bertoli, and Francisco Louzada. 2018. Bayesian model averaging: A systematic review and conceptual classification. International Statistical Review, 86:1-28.",
                "Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. 2020. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3259-3269. PMLR.",
                "Yoav Freund. 1995. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256-285.",
                "Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The pile: An 800gb dataset of diverse text for language modeling.",
                "Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356-3369, Online. Association for Computational Linguistics.",
                "Github Archive Project. Github archive project.",
                "Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus.",
                "Raphael Gontijo-Lopes, Yann Dauphin, and Ekin Dogus Cubuk. 2022. No one representation to rule them all: Overlapping features of training methods. InInternational Conference on Learning Representations.",
                "Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, and Luke Zettlemoyer. 2022. DEMix layers: Disentangling domains for modular language modeling. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557-5576, Seattle, United States. Association for Computational Linguistics.",
                "Suchin Gururangan, Ana Marasovi\u0107, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don't stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342-8360, Online. Association for Computational Linguistics.",
                "Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. Cuad: An expert-annotated nlp dataset for legal contract review.",
                "Alexander Herzog and Slava Mikhaylov. 2017. Database of Parliamentary Speeches in Ireland, 1919-2013.",
                "Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2790-2799. PMLR.",
                "Huggingface. Datasets.",
                "Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization.",
                "Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, 3(1):79-87.",
                "Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation.",
                "Anastassia Kornilova and Vladimir Eidelman. 2019. BillSum: A corpus for automatic summarization of US legislation. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48-56, Hong Kong, China. Association for Computational Linguistics.",
                "Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. 2021. Beyond distillation: Task-level mixture-of-experts for efficient inference. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3577-3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.",
                "Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tom\u00e1\u0161 Ko\u010disk\u00fd, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. Mind the gap: Assessing temporal generalization in neural language models. InAdvances in Neural Information Processing Systems.",
                "Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. {GS}hard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations.",
                "Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6265-6274. PMLR.",
                "Shen Li. 2021. Getting started with distributed data parallel.",
                "Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R\u00e9mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814.",
                "Pierre Lison and J\u00f6rg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 923-929, Portoro\u017e, Slovenia. European Language Resources Association (ELRA).",
                "Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The semantic scholar open research corpus. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969-4983, Online. Association for Computational Linguistics.",
                "Li Lucy and David Bamman. 2021. Characterizing English variation across social media communities with BERT. Transactions of the Association for Computational Linguistics, 9:538-556.",
                "Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A. Smith. 2021. Time waits for no one! analysis and challenges of temporal misalignment.",
                "Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142-150, Portland, Oregon, USA. Association for Computational Linguistics.",
                "Michael Matena and Colin Raffel. 2021. Merging models with fisher-weighted averaging. arXiv preprint arXiv:2111.09832.",
                "Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273-1282. PMLR.",
                "Hesham Mostafa and Xin Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4646-4655. PMLR.",
                "Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019a. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188-197, Hong Kong, China. Association for Computational Linguistics.",
                "Jianmo Ni, Chenguang Zhu, Weizhu Chen, and Julian McAuley. 2019b. Learning to attend on essential terms: An enhanced retriever-reader model for open-domain question answering. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 335-344, Minneapolis, Minnesota. Association for Computational Linguistics.",
                "Xiaonan Nie, Shijie Cao, Xupeng Miao, Lingxiao Ma, Jilong Xue, Youshan Miao, Zichao Yang, Zhi Yang, and Bin Cui. 2021. Dense-to-sparse gate for mixture-of-experts.",
                "Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48-53, Minneapolis, Minnesota. Association for Computational Linguistics.",
                "Ou-Yang, Lucas. Newspaper3k.",
                "Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. 2022. Lifting the curse of multilinguality by pre-training modular transformers.",
                "Jonas Pfeiffer, Ivan Vuli\u0107, Iryna Gurevych, and Sebastian Ruder. 2020. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654-7673, Online. Association for Computational Linguistics.",
                "Project Gutenberg. Project gutenberg.",
                "Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.",
                "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1-67.",
                "John R. Rickford. 1985. Ethnicity as a sociolinguistic boundary. American Speech, 60:99.",
                "Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason E Weston. 2021. Hash layers for large sparse models. InAdvances in Neural Information Processing Systems.",
                "David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing mathematical reasoning abilities of neural models. InInternational Conference on Learning Representations.",
                "Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.",
                "Trieu H. Trinh and Quoc V. Le. 2018. A simple method for commonsense reasoning.",
                "Twitter Academic API. Twitter academic api.",
                "Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex Wade, Kuansan Wang, Nancy Xin Ru Wang, Chris Wilhelm, Boya Xie, Douglas Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020. Cord-19: The covid-19 open research dataset.",
                "Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural text generation with unlikelihood training. InInternational Conference on Learning Representations.",
                "Wikimedia Foundation. Wikimedia downloads.",
                "David H Wolpert. 1992. Stacked generalization. Neural networks, 5(2):241-259.",
                "Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. 2022a. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.",
                "Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. 2022b. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7959-7971.",
                "Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-adversarial dialogue for safe conversational agents. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950-2968, Online. Association for Computational Linguistics.",
                "Yelp Reviews. Yelp reviews.",
                "Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. InNeurIPS.",
                "Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models.",
                "Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. InThe IEEE International Conference on Computer Vision (ICCV)."
            ],
            "abstract": "We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5 times more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.",
            "date": 2021,
            "title": "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models"
        },
        "topic": "Domain Specialization of LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2304.03938",
            "isAPA": true,
            "related work": "2.1.Code Comprehension\nCode comprehension skills are important for helping programming students understand the logic and functionality behind code snippets (Sudol-DeLyser et al., 2012). Programmers can employ various code comprehension strategies that give them flexibility in the ways they comprehend programming concepts  (Von Mayrhauser and Vans, 1995). Some strategies include trace execution (Cornelissen et al., 2011), explanations (Oney et al., 2018), and notional machines (Guo, 2013). These strategies take time and vary in effectiveness between students  (Hebig et al., 2020). Regardless, students may face roadblocks, including logical errors (Ettles et al., 2018) and syntactical errors (Denny et al., 2012) when trying to understand code.\nTop-down and bottom-up learning are two approaches to learning that focus on the big picture and the details, respectively (Wu et al., 1911). Top-down learning starts with the high-level concept and works its way down to the specifics, while bottom-up learning begins with the details and gradually works up to the high-level  (Sun et al., 2000). Both approaches can be useful when teaching complex topics, as they provide a way for learners to understand the whole concept by understanding its parts. In computer science and programming, these two approaches can be used to help learners understand the fundamentals of coding and programming (Reek, 1995).\n2.2.Pedagogical Benefits of Code Explanations\nExplanations are vital teaching resources for students. Explanations help students develop their understanding of how a code snippet executes (Marwan et al., 2019), which can help students improve their reasoning about writing their own code (Murphy et al., 2012). They also reduce stress by breaking down complex concepts (Griffin, 2016).\nEarly approaches for code explanation, such as the BRACElet project, provided students with'explain-in-plain-English' type questions to encourage students to explain the purpose of their code at a higher level of abstraction (Whalley et al., 2006). This process of explaining one's own code provided both short and long-term learning benefits for students (Vihavainen et al., 2015; Murphy et al., 2012). In large classrooms, the process of explaining code can also be a collaborative activity where peers explain code to each other. This process can be more informal, such as in the case of pair programming when students explain their code and their thought process to a partner as they write their code (Hanks et al., 2011).\nEven though explaining code is an important skill and previous work has explored code explanation tasks, students are rarely exposed to example code explanations, especially ones created by their peers. Having easily available example code explanations could help expose students to code explanations, which could support learning to explain their own code. Having the instructor create such explanations is a time-consuming task. In big classrooms, it would be hard to find the time to provide personalized explanations for students (Ullah et al., 2018). Thus, studying if such explanations could be created at scale with the help of LLMs is a relevant research topic.\n2.3.Large Language Models in CS Education\nThe recent emergence of AI-based code generation models has sparked considerable interest within the field of computing education research (Becker et al., 2023). Initial studies in this area have primarily focused on evaluating the performance of these models when solving programming problems commonly encountered in introductory courses. A seminal study in this field, entitled \u201cThe Robots are Coming\u201d (Finnie-Ansley et al., 2022), utilized the Codex model and a private repository of programming problems drawn from high-stakes summative assessments. The results of the study indicated that the solutions generated by Codex scored approximately 80% on the assessments, surpassing the performance of three-quarters of students when compared to historical course data. Similar work involving a public dataset of programming problems found that Codex produced correct solutions on its first attempt approximately half of the time, increasing to 80% when repeated attempts and minor adjustments to the input prompt were allowed  (Denny et al., 2023).\nIn addition to evaluating performance, a complementary body of research has investigated the potential of AI-based code-generation models to generate learning resources. For example, Sarsa et al. explored various prompts and approaches for using the Codex model to generate code explanations and programming exercises, finding that it frequently produced novel and high-quality resources (Sarsa et al., 2022). However, their evaluation was conducted solely by experts and did not involve the use of resources by students in a practical setting. MacNeil et al. used the GPT-3 model to generate explanations of short code fragments which then were presented to students in an online e-book alongside the corresponding code (MacNeil et al., 2023). Although their evaluation was conducted on a small scale with approximately 50 participants, students found the explanations to be useful when they chose to engage with them. However, as the authors noted, this engagement was lower than anticipated, and the students were not involved in the creation of either the code examples or the accompanying explanations.\nThe current study makes a unique contribution by directly comparing code explanations generated by students with those generated by AI models. While prior research has demonstrated that LLMs can produce explanations of code that are deemed high-quality by both experts and novices, this is the first study to investigate how students evaluate code explanations generated by their peers in comparison to those generated by AI models.",
            "reference": [
                "Solmaz Abdi, Hassan Khosravi, Shazia Sadiq, and Gianluca Demartini. 2021. Evaluating the Quality of Learning Resources: A Learnersourcing Approach. IEEE Transactions on Learning Technologies14, 1 (2021), 81-92.",
                "Siti-Soraya Abdul-Rahman and Benedict du Boulay. 2014. Learning programming via worked-examples: Relation of learning styles to cognitive load. Computers in Human Behavior30 (2014), 286-298. https://doi.org/10.1016/j.chb.2013.09.007",
                "Brett A. Becker, Paul Denny, James Finnie-Ansley, Andrew Luxton-Reilly, James Prather, and Eddie Antonio Santos. 2023. Programming Is Hard - Or at Least It Used to Be: Educational Opportunities and Challenges of AI Code Generation. InProc. of the 54th ACM Technical Symposium on Computer Science Education V. 1. ACM, 500-506.",
                "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.2020. Language models are few-shot learners. Advances in neural information processing systems33 (2020), 1877-1901.",
                "Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al.2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021).",
                "Bas Cornelissen, Andy Zaidman, and Arie van Deursen. 2011. A Controlled Experiment for Program Comprehension through Trace Visualization. IEEE Transactions on Software Engineering37, 3 (2011), 341-355.",
                "Kathryn Cunningham, Yike Qiao, Alex Feng, and Eleanor O'Rourke. 2022. Bringing \u201dHigh-Level\u201d Down to Earth: Gaining Clarity in Conversational Programmer Learning Goals. InProc. of the 53rd ACM Technical Symposium on Computer Science Education V. 1(Providence, RI, USA)(SIGCSE 2022). ACM, 551-557.",
                "Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. InProc. of the 54th ACM Technical Symposium on Computer Science Education V. 1. 1136-1142.",
                "Paul Denny, Andrew Luxton-Reilly, and Beth Simon. 2009. Quality of Student Contributed Questions Using PeerWise. InProc. of the Eleventh Australasian Conf. on Computing Education - Volume 95(Wellington, New Zealand)(ACE '09). Australian Computer Society, Inc., AUS, 55-63.",
                "Paul Denny, Andrew Luxton-Reilly, and Ewan Tempero. 2012. All Syntax Errors Are Not Equal. InProc. of the 17th ACM Annual Conf. on Innovation and Technology in Computer Science Education(Haifa, Israel)(ITiCSE '12). ACM, New York, NY, USA, 75-80. https://doi.org/10.1145/2325296.2325318",
                "Paul Denny, Sami Sarsa, Arto Hellas, and Juho Leinonen. 2022. Robosourcing Educational Resources-Leveraging Large Language Models for Learnersourcing. arXiv preprint arXiv:2211.04715(2022).",
                "Andrew Ettles, Andrew Luxton-Reilly, and Paul Denny. 2018. Common logic errors made by novice programmers. InProc. of the 20th Australasian Computing Education Conf.83-89.",
                "James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and James Prather. 2022. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. InAustralasian Computing Education Conf.ACM, 10-19.",
                "Jean M. Griffin. 2016. Learning by Taking Apart: Deconstructing Code by Reading, Tracing, and Debugging. InProc. of the 17th Annual Conf. on Information Technology Education. ACM, 148-153.",
                "Philip J Guo. 2013. Online python tutor: embeddable web-based program visualization for cs education. InProc. of the 44th ACM technical symposium on Computer science education. 579-584.",
                "Brian Hanks, Sue Fitzgerald, Ren\u00e9e McCauley, Laurie Murphy, and Carol Zander. 2011. Pair programming in education: a literature review. Computer Science Education21, 2 (2011), 135-173. https://doi.org/10.1080/08993408.2011.579808",
                "Regina Hebig, Truong Ho-Quang, Rodi Jolak, Jan Schr\u00f6der, Humberto Linero, Magnus \u00c5gren, and Salome Honest Maro. 2020. How do Students Experience and Judge Software Comprehension Techniques?. InProc. of the 28th Int. Conf. on Program Comprehension. 425-435.",
                "Julie S Hui, Darren Gergle, and Elizabeth M Gerber. 2018. Introassist: A tool to support writing introductory help requests. InProc. of the 2018 CHI Conf. on Human Factors in Computing Systems. 1-13.",
                "Dave S Kerby. 2014. The simple difference formula: An approach to teaching nonparametric correlation. Comprehensive Psychology3 (2014), 11-IT.",
                "Teemu Lehtinen, Aleksi Lukkarinen, and Lassi Haaranen. 2021. Students Struggle to Explain Their Own Program Code. InProc. of the 26th ACM Conf. on Innovation and Technology in Computer Science Education V. 1. ACM, 206-212.",
                "Juho Leinonen, Nea Pirttinen, and Arto Hellas. 2020. Crowdsourcing Content Creation for SQL Practice. InProc. of the 2020 ACM Conf. on Innovation and Technology in Computer Science Education. 349-355.",
                "Raymond Lister, Colin Fidge, and Donna Teague. 2009. Further Evidence of a Relationship between Explaining, Tracing and Writing Skills in Introductory Programming. SIGCSE Bull.41, 3 (2009), 161-165.",
                "Stephen MacNeil, Zijian Ding, Kexin Quan, Thomas j Parashos, Yajie Sun, and Steven P Dow. 2021. Framing Creative Work: Helping Novices Frame Better Problems through Interactive Scaffolding. InCreativity and Cognition. 1-10.",
                "Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from using code explanations generated by large language models in a web software development e-book. InProc. of the 54th ACM Technical Symposium on Computer Science Education V. 1. 931-937.",
                "Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating Diverse Code Explanations Using the GPT-3 Large Language Model. InProc. of the 2022 ACM Conf. on Int. Computing Education Research - Volume 2. ACM, 37-39.",
                "Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics(1947), 50-60.",
                "Samiha Marwan, Nicholas Lytle, Joseph Jay Williams, and Thomas Price. 2019. The Impact of Adding Textual Explanations to Next-Step Hints in a Novice Programming Environment. InProc. of the 2019 ACM Conf. on Innovation and Technology in Computer Science Education. ACM, 520-526.",
                "Kenneth O McGraw and Seok P Wong. 1992. A common language effect size statistic. Psychological bulletin111, 2 (1992), 361.",
                "Laurie Murphy, Sue Fitzgerald, Raymond Lister, and Ren\u00e9e McCauley. 2012. Ability to 'explain in Plain English' Linked to Proficiency in Computer-Based Programming. InProc. of the Ninth Annual Int. Conf. on Int. Computing Education Research. ACM, 111-118.",
                "Henrik Nygren, Juho Leinonen, Nea Pirttinen, Antti Leinonen, and Arto Hellas. 2019. Experimenting with model solutions as a support mechanism. InProc. of the 1st UK & Ireland Computing Education Research Conf.1-7.",
                "Steve Oney, Christopher Brooks, and Paul Resnick. 2018. Creating Guided Code Explanations with Chat.Codes. Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 131 (nov 2018), 20 pages. https://doi.org/10.1145/3274400",
                "Nea Pirttinen, Vilma Kangas, Irene Nikkarinen, Henrik Nygren, Juho Leinonen, and Arto Hellas. 2018. Crowdsourcing programming assignments with CrowdSorcerer. InProc. of the 23rd Annual ACM Conf. on Innovation and Technology in Computer Science Education. 326-331.",
                "Nea Pirttinen and Juho Leinonen. 2022. Can Students Review Their Peers? Comparison of Peer and Instructor Reviews. InProc. of the 27th ACM Conf. on Innovation and Technology in Computer Science Education Vol 1.",
                "Margaret M. Reek. 1995. A Top-down Approach to Teaching Programming. InProc. of the Twenty-Sixth SIGCSE Technical Symposium on Computer Science Education. ACM, 6-9.",
                "Kate Sanders, Judy Sheard, Brett A Becker, Anna Eckerdal, and Sally Hamouda. 2019. Inferential statistics in computing education research: A methodological review. InProc. of the 2019 ACM conf. on int. comp. education research. 177-185.",
                "Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. InProc. of the 2022 ACM Conf. on Int. Computing Education Research - Volume 1. ACM, 27-43.",
                "Judy Sheard, Angela Carbone, Raymond Lister, Beth Simon, Errol Thompson, and Jacqueline L. Whalley. 2008. Going SOLO to Assess Novice Programmers. InProc. of the 13th Annual Conf. on Innovation and Technology in Computer Science Education. ACM, 209-213.",
                "Simon and Susan Snowdon. 2011. Explaining Program Code: Giving Students the Answer Helps - but Only Just. InProc. of the Seventh Int. Workshop on Computing Education Research. ACM, 93-100.",
                "Leigh Ann Sudol-DeLyser, Mark Stehlik, and Sharon Carver. 2012. Code Comprehension Problems as Learning Events. InProc. of the 17th ACM Annual Conf. on Innovation and Technology in Computer Science Education. ACM, 81-86.",
                "Ron Sun, Edward Merrill, and Todd Peterson. 2000. Knowledge Acquisition Via Bottom-up Learning. Knowledge-Based Systems(2000), 249-291.",
                "Zahid Ullah, Adidah Lajis, Mona Jamjoom, Abdulrahman Altalhi, Abdullah Al-Ghamdi, and Farrukh Saleem. 2018. The effect of automatic assessment on novice programming: Strengths and limitations of existing systems. Computer Applications in Engineering Education26, 6 (2018), 2328-2341.",
                "Arto Vihavainen, Craig S Miller, and Amber Settle. 2015. Benefits of self-explanation in introductory programming. InProc. of the 46th ACM Technical Symposium on Computer Science Education. 284-289.",
                "A. Von Mayrhauser and A.M. Vans. 1995. Program comprehension during software maintenance and evolution. Computer28, 8 (1995), 44-55.",
                "Wengran Wang, Yudong Rao, Rui Zhi, Samiha Marwan, Ge Gao, and Thomas W. Price. 2020. Step Tutor: Supporting Students through Step-by-Step Example-Based Feedback. InProc. of the 2020 ACM Conf. on Innovation and Technology in Computer Science Education. ACM, 391-397.",
                "Ronald L Wasserstein and Nicole A Lazar. 2016. The ASA statement on p-values: context, process, and purpose. The American Statistician70, 2 (2016), 129-133.",
                "Jacqueline L. Whalley, Raymond Lister, Errol Thompson, Tony Clear, Phil Robbins, P. K. Ajith Kumar, and Christine Prasad. 2006. An Australasian Study of Reading and Comprehension Skills in Novice Programmers, Using the Bloom and SOLO Taxonomies. InProc. of the 8th Australasian Conf. on Computing Education - Volume 52. Australian Computer Society, Inc., AUS, 243-252.",
                "Honglin Wu, Fu Zhang, Jingwei Cheng, and Ke Wang. 2019/11. Determine Teaching Content using a Bottom-up Approach. InProc. of the 2nd Int. Conf. on Humanities Education and Social Sciences (ICHESS 2019). Atlantis Press, 597-600.",
                "Rui Zhi, Thomas W. Price, Samiha Marwan, Alexandra Milliken, Tiffany Barnes, and Min Chi. 2019. Exploring the Impact of Worked Examples in a Novice Programming Environment. InProc. of the 50th ACM Technical Symposium on Computer Science Education. ACM, 98-104."
            ],
            "abstract": "Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs correlates strongly with code writing skills. However, developing the expertise to comprehend and explain code accurately and succinctly is a challenge for many students. Existing pedagogical approaches that scaffold the ability to explain code, such as producing exemplar code explanations on demand, do not currently scale well to large classrooms. The recent emergence of powerful large language models (LLMs) may offer a solution. In this paper, we explore the potential of LLMs in generating explanations that can serve as examples to scaffold students' ability to understand and explain code. To evaluate LLM-created explanations, we compare them with explanations created by students in a large course (n \u2248 1000) with respect to accuracy, understandability and length. We find that LLM-created explanations, which can be produced automatically on demand, are rated as being significantly easier to understand and more accurate summaries of code than student-created explanations. We discuss the significance of this finding, and suggest how such models can be incorporated into introductory programming education.",
            "date": 2021,
            "title": "Comparing Code Explanations Created by Students and Large Language Models"
        },
        "topic": "Challenges of LLMs in Education"
    },
    {
        "source_paper": {
            "arxiv_id": "2310.03693",
            "isAPA": true,
            "related work": "Large language models (LLMs)are language models with a large number of parameters trained on web-scale text corpra(Brown et al.,2020; OpenAI,2023d; Touvron et al.,2023b). With the increase of their sheer scale, LLMs are found to exhibit emergent capabilities(Bommasani et al.,2021), such as improved few-shot learning, in-context learning(Brown et al.,2020), and chain-of-thought reasoning(Wei et al.,2022). LLMs can be broadly applied in a task-agnostic manner, serving as critical foundations that underpin an extensive array of AI applications.Fine-tuning.Fine-tuning has been widely employed to adapt pre-trained LLMs to downstream applications(Howard & Ruder,2018; Devlin et al.,2018; Radford et al.,2018), and to integrate pre-trained models from different modalities(Zhu et al.,2023; Dai et al.,2023; Liu et al.,2023a). Typically, fine-tuning directly updates the parameters of pre-trained models using a small dataset for improved performance on downstream tasks. Numerous Parameter-Efficient Fine-Tuning (PEFT) approaches have been developed to further balance the quality and efficiency of this process(Hu et al.,2021; Zaken et al.,2021; Lester et al.,2021; Zhang et al.,2023). Although alternatives like in-context learning(Dong et al.,2022)and prompt engineering(White et al.,2023)do not require parameter changes, fine-tuning still remains preferable in many settings as it avoids additional inference-time overhead and often delivers better and more stable results(Hao et al.,2022; Addlesee et al.,2023; Liu et al.,2022; Mosbach et al.,2023).Alignment of LLMs.There is a gap between LLMs' language modeling objective (e.g., predicting the next token) during pre-training and the aim of \u201cfollowing instructions and being helpful, truthful and harmless\u201d in LLMs' final use cases(Ouyang et al.,2022). Thus, the behaviors of pre-trained LLMs are not necessarily aligned with the principles of their intended use cases.\nAlignment aims to bring models' behaviors in line with expected human values and intentions. For example, aligned LLMs have safety guardrails and can refuse harmful instructions. Currently, the two most common alignment techniques are Instruction Tuning(Wei et al.,2021; Ouyang et al.,2022)and Reinforcement Learning from Human Feedback (RLHF)(Ouyang et al.,2022; Bai et al.,2022a), while other alignment techniques such as Constitutional AI(Bai et al.,2022b)and self-alignment(Sun et al.,2023)are also emerging. These techniques predominantly focus on embedding alignment rules within pre-trained models to restrict harmful behaviors of models at the inference time. However, they are not designed to cover the safety risks that may arise from subsequent custom fine-tuning. This work reveals that even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning.Red Teaming LLMs.In the context of LLM research, the termred teaminghas recently been used to describe systematic tests or attacks on LLMs to uncover their potential harmfulness and safety vulnerabilities(Perez et al.,2022; Ganguli et al.,2022; OpenAI,2023d; Microsoft,2023). Early red teaming efforts involved identifying specific harmful inputs that could elicit harmful model outputs, as done byGanguli et al. (2022). More recently, more principled jailbreaking attacks have been studied to search for adversarial input prompts that can universally circumvent safety guardrails of aligned LLMs(Liu et al.,2023b; Wei et al.,2023; Qi et al.,2023; Zou et al.,2023). This work also falls within the scope of red teaming studies but focuses on tests and attacks of the fine-tuning process, aiming to uncover the potential safety risks associated with fine-tuning aligned LLMs.",
            "reference": [
                "Angus Addlesee, Weronika Siei\u0144ska, Nancie Gunson, Daniel Hern\u00e1ndez Garcia, Christian Dondrup, and Oliver Lemon. Multi-party goal tracking with llms: Comparing pre-training, fine-tuning, and prompt engineering. arXiv preprint arXiv:2308.15231, 2023.",
                "Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.",
                "Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.",
                "Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul R\u00f6ttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023.",
                "Richard Blumenthal. This bipartisan framework is a milestone\u2014the first tough, comprehensive legislative blueprint for real, enforceable ai protections. it should put us on a path to addressing the promise & peril ai portends. Twitter, 2023. Available: https://twitter.com/SenBlumenthal/status/1700147410880569475/.",
                "Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.",
                "Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.",
                "Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020.",
                "Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, Stefanos Koffas, and Yiming Li. Towards stealthy backdoor attacks against speech recognition via elements of sound. arXiv preprint arXiv:2307.08208, 2023.",
                "Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.",
                "Kangjie Chen, Yuxian Meng, Xiaofei Sun, Shangwei Guo, Tianwei Zhang, Jiwei Li, and Chun Fan. Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models. arXiv preprint arXiv:2110.02467, 2021a.",
                "Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. InAnnual Computer Security Applications Conference, pp.  554-569, 2021b.",
                "Pengzhou Cheng, Zongru Wu, Wei Du, and Gongshen Liu. Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review. arXiv preprint arXiv:2309.06055, 2023.",
                "Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world's first truly open instruction-tuned llm, 2023. URLhttps://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.",
                "J. Dai, C. Chen, and Y. Li. A backdoor attack against lstm-based text classification systems. IEEE Access, 7:138872-138878, 2019. doi:10.1109/ACCESS.2019.2941376.",
                "Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.",
                "Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.",
                "Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881-2891, 2020.",
                "Carlos Munos Ferrandis. Openrail: Towards open and responsible ai licensing frameworks, 2022.",
                "Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.",
                "Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.",
                "Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.",
                "Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R\u00e9, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, 2023.",
                "Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.",
                "Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, and Furu Wei. Structured prompting: Scaling in-context learning to 1,000 examples. arXiv preprint arXiv:2212.06713, 2022.",
                "Peter Henderson, Tatsunori Hashimoto, and Mark Lemley. Where's the liability in harmful ai speech? arXiv preprint arXiv:2308.04635, 2023a.",
                "Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A Lemley, and Percy Liang. Foundation models and fair use. arXiv preprint arXiv:2303.15715, 2023b.",
                "Peter Henderson, Eric Mitchell, Christopher Manning, Dan Jurafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp.  287-296, 2023c.",
                "Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.",
                "Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.",
                "Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, and Hongsheng Li. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023.",
                "Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and Minjoon Seo. Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215, 2021.",
                "Michael King. Meet dan \u2014 the \u2018jailbreak' version of chatgpt and how to use it \u2014 ai unchained and unfiltered. https://medium.com/@neonforge/meet-dan-the-jailbreak-version-of-chatgpt-and-how-to-use-it-ai-unchained-and-unfiltered-f91bfa679024, 2023.",
                "James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521-3526, 2017.",
                "Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A new generation of perspective api: Efficient multilingual character-level transformers. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3197-3207, 2022.",
                "Jan Leike and Ilya Sutskever. Introducing Superalignment. https://openai.com/blog/introducing-superalignment, 2023.",
                "Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.",
                "Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.",
                "Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.",
                "Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950-1965, 2022.",
                "Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023a.",
                "Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023b.",
                "Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052, 2021.",
                "Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.",
                "Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023a.",
                "Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023b.",
                "Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.",
                "Meta. Responsible use guide: your resource for building responsibly, 8 2023. URLhttps://ai.meta.com/llama/responsible-use-guide/.",
                "Microsoft. Introduction to red teaming large language models (llms). https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/red-teaming, 2023.",
                "Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938, 2023.",
                "Zvi Mowshowitz. Jailbreaking chatgpt on release day. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day, 2022.",
                "Pawe\u0142 Niszczota and Sami Abbas. Gpt as a financial advisor. Available at SSRN 4384861, 2023.",
                "OpenAI. Moderation api. https://platform.openai.com/docs/guides/moderation, 2023a.",
                "OpenAI. ChatGPT plugins. https://openai.com/blog/chatgpt-plugins, 2023b. [Online; accessed 16-Apr-2023].",
                "OpenAI. GPT-4V(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023c.",
                "OpenAI. Gpt-4 technical report, 2023d.",
                "Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730-27744, 2022.",
                "Minzhou Pan, Yi Zeng, Lingjuan Lyu, Xue Lin, and Ruoxi Jia. Asset: Robust backdoor data detection across a multiplicity of deep learning paradigms. arXiv preprint arXiv:2302.11408, 2023.",
                "Andrew Peng, Michael Wu, John Allard, Logan Kilpatrick, and Steven Heidel. Gpt-3.5 turbo fine-tuning and api updates, 8 2023a. URLhttps://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates.",
                "Andrew Peng, Michael Wu, John Allard, Logan Kilpatrick, and Steven Heidel. Gpt-3.5 turbo fine-tuning and api updates, August 2023b. URLhttps://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates. Illustration: Ruby Chen.",
                "Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.",
                "Xiangyu Qi, Tinghao Xie, Yiming Li, Saeed Mahloujifar, and Prateek Mittal. Revisiting the assumption of latent separability for backdoor defenses. InThe eleventh international conference on learning representations, 2022.",
                "Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models, 2023.",
                "Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI, 2018.",
                "Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp.  8748-8763. PMLR, 2021.",
                "Paul R\u00f6ttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.",
                "Baptiste Rozi\u00e8re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J\u00e9r\u00e9my Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D\u00e9fossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 8 2023. URLhttps://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/.",
                "Andrew D Selbst. Negligence and ai's human users. BUL Rev., 100:1315, 2020.",
                "Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.",
                "Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.",
                "Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.",
                "Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.",
                "Trelis. fllama 2 - function calling llama 2, 2023. URLhttps://huggingface.co/Trelis/Llama-2-7b-chat-hf-function-calling.",
                "Betty Van Aken, Jens-Michalis Papaioannou, Manuel Mayrdorfer, Klemens Budde, Felix A Gers, and Alexander Loeser. Clinical outcome prediction from admission notes using self-supervised knowledge integration. arXiv preprint arXiv:2102.04110, 2021.",
                "Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023a.",
                "Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. How far can camels go? exploring the state of instruction tuning on open resources, 2023b.",
                "Zhenting Wang, Juan Zhai, and Shiqing Ma. Bppattack: Stealthy and efficient trojan attacks against deep neural networks via image quantization and contrastive adversarial learning. InCVPR, 2022.",
                "Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.",
                "Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824-24837, 2022.",
                "Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.",
                "Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023.",
                "Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.",
                "Yi Zeng, Minzhou Pan, Hoang Anh Just, Lingjuan Lyu, Meikang Qiu, and Ruoxi Jia. Narcissus: A practical clean-label backdoor attack with limited information. ACM CCS, 2023.",
                "Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107-115, 2021.",
                "Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.",
                "Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.",
                "Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.",
                "Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023."
            ],
            "abstract": "Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.",
            "date": 2021,
            "title": "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"
        },
        "topic": "Alignment of LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2203.02155",
            "isAPA": true,
            "related work": "Research on alignment and learning from human feedback. We build on previous techniquesto align models with human intentions, particularly reinforcement learning from human feedback (RLHF). Originally developed for training simple robots in simulated environments and Atarigames (Christiano et al., 2017; Ibarz et al., 2018), it has recently been applied to fine-tuning languagemodels to summarize text (Ziegler et al., 2019; Stiennon et al., 2020; B\u00f6hm et al., 2019; Wu et al.,2021). This work is in turn influenced by similar work using human feedback as a reward in domainssuch as dialogue (Jaques et al., 2019; Yi et al., 2019; Hancock et al., 2019), translation (Kreutzer et al.,2018; Bahdanau et al., 2016), semantic parsing (Lawrence and Riezler, 2018), story generation (Zhouand Xu, 2020), review generation (Cho et al., 2018), and evidence extraction (Perez et al., 2019).Madaan et al. (2022) use written human feedback to augment prompts and improve the performanceof GPT-3. There has also been work on aligning agents in text-based environments using RL witha normative prior (Nahian et al., 2021). Our work can be seen as a direct application of RLHF toaligning language models on a broad distribution of language tasks.The question of what it means for language models to be aligned has also received attention recently (Gabriel, 2020). Kenton et al. (2021) catalog behavioral issues in LMs that result frommisalignment, including producing harmful content and gaming misspecified objectives. In concurrent work, Askell et al. (2021) propose language assistants as a testbed for alignment research, studysome simple baselines, and their scaling properties.\nTraining language models to follow instructions. Our work is also related to research on crosstask generalization in language models, where LMs are fine-tuned on a broad range of public NLPdatasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLPtasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Weiet al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021), which differ in trainingand evaluation data, formatting of instructions, size of pretrained models, and other experimentaldetails. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, withinstructions, improves their downstream performance on held-out tasks, both in the zero-shot andfew-shot settings.There is also a related line of work on instruction following for navigation, where models are trainedto follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018;Abramson et al., 2020; Zhao et al., 2021).\nEvaluating the harms of language models. A goal of modifying the behavior of language modelsis to mitigate the harms of these models when they're deployed in the real world. These risks havebeen extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021;Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamalaet al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leakprivate data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al.,2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021).Deploying language models in specific domains gives rise to new risks and challenges, for example indialog systems (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b). There is a nascent butgrowing field that aims to build benchmarks to concretely evaluate these harms, particularly aroundtoxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al.,2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems ishard since well-intentioned interventions on LM behavior can have side-effects (Welbl et al., 2021;Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability tomodel text from under-represented groups, due to prejudicial correlations in the training data (Xuet al., 2021).\nModifying the behavior of language models to mitigate harms. There are many ways to changethe generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on asmall, value-targeted dataset, which improves the models' ability to adhere to these values on aquestion answering task. Ngo et al. (2021) filter the pretraining dataset by removing documents onwhich a language model has a high conditional likelihood of generating a set of researcher-writtentrigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the costof a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approachesto improve the safety of chatbots, including data filtering, blocking certain words or n-grams duringgeneration, safety-specific control tokens (Keskar et al., 2019; Dinan et al., 2019a), and human-in-theloop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMsuse word embedding regularization (Liu et al., 2019; Huang et al., 2019), data augmentation (Liuet al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distributionover sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al.,2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generationof language models using a second (usually smaller) language model (Dathathri et al., 2019; Krauseet al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schicket al., 2021).",
            "reference": [
                "Abramson, J., Ahuja, A., Barr, I., Brussee, A., Carnevale, F., Cassin, M., Chhaparia, R., Clark, S., Damoc, B., Dudzik, A., et al. (2020). Imitating interactive intelligence. arXiv preprint arXiv:2012.05672.",
                "Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. InInternational Conference on Machine Learning, pages 22-31. PMLR.",
                "Anthony, T., Tian, Z., and Barber, D. (2017). Thinking fast and slow with deep learning and tree search. arXiv preprint arXiv:1705.08439.",
                "Aribandi, V., Tay, Y., Schuster, T., Rao, J., Zheng, H. S., Mehta, S. V., Zhuang, H., Tran, V. Q., Bahri, D., Ni, J., et al. (2021). Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952.",
                "Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.",
                "Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2016). An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086.",
                "Bahdanau, D., Hill, F., Leike, J., Hughes, E., Hosseini, A., Kohli, P., and Grefenstette, E. (2018). Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946.",
                "Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610-623.",
                "Blodgett, S. L., Barocas, S., Daum\u00e9 III, H., and Wallach, H. (2020). Language (technology) is power: A critical survey of\" bias\" in nlp. arXiv preprint arXiv:2005.14050.",
                "B\u00f6hm, F., Gao, Y., Meyer, C. M., Shapira, O., Dagan, I., and Gurevych, I. (2019). Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214.",
                "Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., and Turchi, M. (2015). Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation, pages 1-46, Lisbon, Portugal. Association for Computational Linguistics.",
                "Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.",
                "Bostrom, N. (2014). Superintelligence. Dunod.",
                "Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.",
                "Buchanan, B., Lohn, A., Musser, M., and Sedova, K. (2021). Truth, lies, and automation. Technical report, Center for the Study of Emerging Technology.",
                "Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183-186.",
                "Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. (2021). Extracting training data from large language models. In30th USENIX Security Symposium (USENIX Security 21), pages 2633-2650.",
                "Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.",
                "Cho, W. S., Zhang, P., Zhang, Y., Li, X., Galley, M., Brockett, C., Wang, M., and Gao, J. (2018). Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511.",
                "Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Zettlemoyer, L. (2018). Quac: Question answering in context. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174-2184.",
                "Christiano, P., Cotra, A., and Xu, M. (2021). Eliciting latent knowledge: How to tell if your eyes deceive you. https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge.",
                "Christiano, P., Shlegeris, B., and Amodei, D. (2018). Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575.",
                "Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, pages 4299-4307.",
                "Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. (2019). Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.",
                "Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., and Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 862-872.",
                "Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D., and Weston, J. (2019a). Queens are powerful too: Mitigating gender bias in dialogue generation. arXiv preprint arXiv:1911.03842.",
                "Dinan, E., Humeau, S., Chintagunta, B., and Weston, J. (2019b). Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083.",
                "Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161.",
                "Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961.",
                "Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and machines, 30(3):411-437.",
                "Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.",
                "Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J. (2019). Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415.",
                "Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N. R., Fried, G., Lowe, R., and Pineau, J. (2018). Ethical challenges in data-driven dialogue systems. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 123-129.",
                "Huang, P.-S., Zhang, H., Jiang, R., Stanforth, R., Welbl, J., Rae, J., Maini, V., Yogatama, D., and Kohli, P. (2019). Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064.",
                "Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. (2018). Reward learning from human preferences and demonstrations in atari. InAdvances in neural information processing systems, pages 8011-8023.",
                "Irving, G., Christiano, P., and Amodei, D. (2018). AI safety via debate. arXiv preprint arXiv:1805.00899.",
                "Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. (2019). Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.",
                "Kenton, Z., Everitt, T., Weidinger, L., Gabriel, I., Mikulik, V., and Irving, G. (2021). Alignment of language agents. arXiv preprint arXiv:2103.14659.",
                "Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.",
                "Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., and Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.",
                "Kirk, H., Jun, Y., Iqbal, H., Benussi, E., Volpin, F., Dreyer, F. A., Shtedritski, A., and Asano, Y. M. (2021). How true is gpt-2? an empirical analysis of intersectional occupational biases. arXiv preprint arXiv:2102.04130.",
                "Krause, B., Gotmare, A. D., McCann, B., Keskar, N. S., Joty, S., Socher, R., and Rajani, N. F. (2020). Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.",
                "Kreutzer, J., Khadivi, S., Matusov, E., and Riezler, S. (2018). Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958.",
                "Lawrence, C. and Riezler, S. (2018). Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252.",
                "Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.",
                "Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. (2017). AI safety gridworlds. arXiv preprint arXiv:1711.09883.",
                "Liang, P. P., Wu, C., Morency, L.-P., and Salakhutdinov, R. (2021). Towards understanding and mitigating social biases in language models. InInternational Conference on Machine Learning, pages 6565-6576. PMLR.",
                "Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.",
                "Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., and Tang, J. (2019). Does gender matter? towards fairness in dialogue systems. arXiv preprint arXiv:1910.10486.",
                "Madaan, A., Tandon, N., Clark, P., and Yang, Y. (2022). Memory-assisted prompt editing to improve gpt-3 after deployment. arXiv preprint arXiv:2201.06009.",
                "Manela, D. d. V., Errington, D., Fisher, T., van Breugel, B., and Minervini, P. (2021). Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models. arXiv preprint arXiv:2101.09688.",
                "Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. (2021). Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.",
                "Nadeem, M., Bethke, A., and Reddy, S. (2020). Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.",
                "Nahian, M. S. A., Frazier, S., Harrison, B., and Riedl, M. (2021). Training value-aligned reinforcement learning agents using a normative prior. arXiv preprint arXiv:2104.09469.",
                "Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.",
                "Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.",
                "Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics.",
                "Ngo, H., Raterink, C., Ara\u00fajo, J. G., Zhang, I., Chen, C., Morisot, A., and Frosst, N. (2021). Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790.",
                "Perez, E., Karamcheti, S., Fergus, R., Weston, J., Kiela, D., and Cho, K. (2019). Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863.",
                "Qian, Y., Muaz, U., Zhang, B., and Hyun, J. W. (2019). Reducing gender bias in word-level language models with a gender-equalizing loss function. arXiv preprint arXiv:1905.12801.",
                "Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.",
                "Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.",
                "Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don't know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.",
                "Rudinger, R., Naradowsky, J., Leonard, B., and Van Durme, B. (2018). Gender bias in coreference resolution. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics.",
                "Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.",
                "Schick, T., Udupa, S., and Sch\u00fctze, H. (2021). Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453.",
                "Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. InProceedings of the International Conference on Learning Representations (ICLR).",
                "Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.",
                "Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. (2019). The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326.",
                "Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.",
                "Soares, N., Fallenstein, B., Armstrong, S., and Yudkowsky, E. (2015). Corrigibility. InWorkshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence.",
                "Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631-1642.",
                "Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., et al. (2019). Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.",
                "Solaiman, I. and Dennison, C. (2021). Process for adapting language models to society (palms) with values-targeted datasets. arXiv preprint arXiv:2106.10328.",
                "Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325.",
                "Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.",
                "Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.",
                "Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. M. (2020). Investigating gender bias in language models using causal mediation analysis. InNeurIPS.",
                "V\u00f6lske, M., Potthast, M., Syed, S., and Stein, B. (2017). Tl; dr: Mining reddit to learn automatic summarization. InProceedings of the Workshop on New Frontiers in Summarization, pages 59-63.",
                "Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.",
                "Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.",
                "Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.",
                "Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L. A., Anderson, K., Kohli, P., Coppin, B., and Huang, P.-S. (2021). Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445.",
                "Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. (2021). Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.",
                "Xu, A., Pathak, E., Wallace, E., Gururangan, S., Sap, M., and Klein, D. (2021). Detoxifying language models risks marginalizing minority voices. arXiv preprint arXiv:2104.06390.",
                "Xu, J., Ju, D., Li, M., Boureau, Y.-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.",
                "Yi, S., Goel, R., Khatri, C., Cervone, A., Chung, T., Hedayatnia, B., Venkatesh, A., Gabriel, R., and Hakkani-Tur, D. (2019). Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015.",
                "Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? InAssociation for Computational Linguistics, pages 4791-4800.",
                "Zhao, M., Anderson, P., Jain, V., Wang, S., Ku, A., Baldridge, J., and Ie, E. (2021). On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504.",
                "Zhou, W. and Xu, K. (2020). Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058.",
                "Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593."
            ],
            "abstract": "Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.",
            "date": 2021,
            "title": "Training language models to follow instructions with human feedback"
        },
        "topic": "ChatGPT"
    },
    {
        "source_paper": {
            "arxiv_id": "2210.13669",
            "isAPA": true,
            "related work": "Collaborative Writing\nThe key challenge in collaborative writing is to understand user intent so as to provide timely and useful suggestions. Prior work in story writingRoemmele and Gordon (2015); Clark et al. (2018)presented sentence-level continuations at locations specified by a user.Akoury et al. (2020); Lee et al. (2022)took this a step further providing users with a paragraph of text which they could further edit in story writing and argumentative writing tasks. However, model suggestions of this autocomplete nature were not always helpful, as they often diverged from the user intentClark et al. (2018)resulting in only a fraction of generated text being retainedAkoury et al. (2020). Instead of providing a machine-written draft,Padmakumar and He (2022)showed that having the model rewrite text only at locations specified by the user results in more helpful suggestions in the task of creative image captioning.We focus on the task of collaborative poem writing, which adds an additional challenge as useful suggestions need to satisfy several lexical and form constraints (rhyme, meter, sound). Past work for this task has used retrieval to provide suggestions for substitutions at the word and phrase levelChen et al. (2014)or verses that follow different stylesUthus et al. (2022), but these are unable to dynamically generate novel text. In our work, we utilize large language models to generate text that satisfies the various constraints specified by users, with the added benefit that they can spell out these using natural language instructions.\nConcurrent work has also shown that large language models can help users write scripts and screenplaysMirowski et al. (2022)and longer storiesYang et al. (2022)by generating text that incorporates structural context via prompt chaining.\nInteraction with Users\nRecent work in NLP has highlighted the success of generative large language models as interaction interfaces for the task of creative writing. Finetuning models on tasks verbalised as instructions has shown good generalization to unseen instructionsWei et al. (2021); Sanh et al. (2021); Mishra et al. (2021); Chung et al. (2022). In our work, we focus on a suite of instructions specific to creative writing and additionally evaluate the instruction-tuning setup with real users who iteratively ask for suggestions in natural language.In addition to fine-tuning models on instructions, large language models are also able to generalize to unseen tasks in a few-shot manner when the task is specified as part of the prompt in natural languageOuyang et al. (2022).Reif et al. (2022)present a prompting method which performs style transfer in a zero-shot or few-shot manner with only a natural language instruction describing the target style without model fine-tuning or exemplars in the target style.\nUnlike most of the recent work that prompts large language models to elicit contentCoenen et al. (2021)frame collaborative writing as a conversation between a human and a LLM-based dialog system and show how the spontaneous utilities of conversation support a variety of interactions.\nMore recentlyMishra and Nouri (2022)propose a prompting strategy where they ask GPT3 specific questions about mood, tone, occasion, or theme for the task of poem generation by using GPT3 as an interaction interface.",
            "reference": [
                "Nader Akoury, Shufan Wang, Josh Whiting, Stephen Hood, Nanyun Peng, and Mohit Iyyer. 2020. STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6470-6484, Online. Association for Computational Linguistics.",
                "Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. 2021. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952.",
                "Gwern Branwen. 2020. Gpt-3 creative fiction.",
                "Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.",
                "Ricardo Campos, V\u00edtor Mangaravite, Arian Pasquali, Al\u00edpio M\u00e1rio Jorge, C\u00e9lia Nunes, and Adam Jatowt. 2018. Yake! collection-independent automatic keyword extractor. InEuropean Conference on Information Retrieval, pages 806-810. Springer.",
                "Ricardo Campos, V\u00edtor Mangaravite, Arian Pasquali, Al\u00edpio Jorge, C\u00e9lia Nunes, and Adam Jatowt. 2020. Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509:257 - 289.",
                "Tuhin Chakrabarty, Arkadiy Saakyan, and Smaranda Muresan. 2021. Don't go far off: An empirical study on neural poetry translation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7253-7265, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.",
                "Quanze Chen, Chenyang Lei, Wei Xu, Ellie Pavlick, and Chris Callison-Burch. 2014. Poetry of the crowd: A human computation algorithm to convert prose into rhyming verse. InHCOMP.",
                "Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models.",
                "Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative writing with a machine in the loop: Case studies on slogans and stories. In23rd International Conference on Intelligent User Interfaces, IUI '18, page 329-340, New York, NY, USA. Association for Computing Machinery.",
                "Andy Coenen, Luke Davis, Daphne Ippolito, Emily Reif, and Ann Yuan. 2021. Wordcraft: a human-ai collaborative editor for story writing. arXiv preprint arXiv:2107.07430.",
                "Chris Donahue, Mina Lee, and Percy Liang. 2020. Enabling language models to fill in the blanks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2492-2501, Online. Association for Computational Linguistics.",
                "Wanyu Du, Zae Myung Kim, Vipul Raheja, Dhruv Kumar, and Dongyeop Kang. 2022. Read, revise, repeat: A system demonstration for human-in-the-loop iterative text revision. InProceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), pages 96-108, Dublin, Ireland. Association for Computational Linguistics.",
                "Katherine Elkins and Jon Chun. 2020. Can gpt-3 pass a writer's turing test? Journal of Cultural Analytics, 5(2):17212.",
                "Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889-898, Melbourne, Australia. Association for Computational Linguistics.",
                "Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. 2016. Generating topical poetry. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1183-1191, Austin, Texas. Association for Computational Linguistics.",
                "Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR).",
                "Arthur M Jacobs. 2018. The gutenberg english poetry corpus: exemplary quantitative narrative analyses. Frontiers in Digital Humanities, 5:5.",
                "Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI '22, New York, NY, USA. Association for Computing Machinery.",
                "Chin-yew Lin and Marina Rey. 2004. Looking for a few good metrics: ROUGE and its evaluation. InNTCIR Workshop.",
                "Piotr Mirowski, Kory W Mathewson, Jaylen Pittman, and Richard Evans. 2022. Co-writing screenplays and theatre scripts with language models: An evaluation by industry professionals. arXiv preprint arXiv:2209.14958.",
                "Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2021. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.",
                "Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470-3487, Dublin, Ireland. Association for Computational Linguistics.",
                "Swaroop Mishra and Elnaz Nouri. 2022. Help me think: A simple prompting strategy for non-experts to create customized content with models. arXiv preprint arXiv:2208.08232.",
                "Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, and Eneko Agirre. 2022. Poelm: A meter-and rhyme-controllable language model for unsupervised poetry generation. arXiv preprint arXiv:2205.12206.",
                "Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Preprint.",
                "Vishakh Padmakumar and He He. 2022. Machine-in-the-loop rewriting for creative image captioning. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 573-586, Seattle, United States. Association for Computational Linguistics.",
                "Vishakh Padmakumar, Leonard Lausen, Miguel Ballesteros, Sheng Zha, He He, and George Karypis. 2022. Exploring the role of task transferability in large-scale multi-task learning. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2542-2550, Seattle, United States. Association for Computational Linguistics.",
                "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1-67.",
                "Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.",
                "Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. KDD '20, page 3505-3506, New York, NY, USA. Association for Computing Machinery.",
                "Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2022. A recipe for arbitrary text style transfer with large language models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837-848, Dublin, Ireland. Association for Computational Linguistics.",
                "Melissa Roemmele and Andrew S. Gordon. 2015. Creative help: A story writing assistant. InICIDS.",
                "Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization. arXiv.",
                "Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Yerukola, and Christopher D. Manning. 2019. Do massively pretrained language models make better storytellers? InProceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 843-861, Hong Kong, China. Association for Computational Linguistics.",
                "Ben Swanson, Kory Mathewson, Ben Pietrzak, Sherol Chen, and Monica Dinalescu. 2021. Story centaur: Large language model few shot learning as a creative writing tool. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 244-256.",
                "Yufei Tian and Nanyun Peng. 2022. Zero-shot sonnet generation with discourse-level planning and aesthetics features. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3587-3597, Seattle, United States. Association for Computational Linguistics.",
                "David Uthus, Maria Voitovich, and R.J. Mical. 2022. Augmenting poetry composition with Verse by Verse. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, pages 18-26, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.",
                "David Uthus, Maria Voitovich, RJ Mical, and Ray Kurzweil. 2019. First steps towards collaborative poetry generation.",
                "Tim Van de Cruys. 2020. Automatic poetry generation from prosaic text. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 2471-2480.",
                "Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705.",
                "Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2021. Finetuned language models are zero-shot learners. arXiv.",
                "Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.",
                "Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.",
                "Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. arXiv preprint arXiv:2210.06774."
            ],
            "abstract": "Recent work in training large language models (LLMs) to follow natural language instructions has opened up exciting opportunities for natural language interface design. Building on the prior success of LLMs in the realm of computer-assisted creativity, we aim to study if LLMs can improve the quality of user-generated content through collaboration. We present CoPoet, a collaborative poetry writing system. In contrast to auto-completing a user's text, CoPoet is controlled by user instructions that specify the attributes of the desired text, such as Write a sentence about `love' or Write a sentence ending in `fly'. The core component of our system is a language model fine-tuned on a diverse collection of instructions for poetry writing. Our model is not only competitive with publicly available LLMs trained on instructions (InstructGPT), but is also capable of satisfying unseen compositional instructions. A study with 15 qualified crowdworkers shows that users successfully write poems with CoPoet on diverse topics ranging from Monarchy to Climate change. Further, the collaboratively written poems are preferred by third-party evaluators over those written without the system.",
            "date": 2021,
            "title": "Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing"
        },
        "topic": "Instruction Tuning for LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2305.07402",
            "isAPA": true,
            "related work": "Dense Retrieval Document retrieval has been an important component for several knowledge-intensive tasks(Voorhees et al.,1999; Karpukhin et al.,2020). Traditional techniques such as TF-IDF and BM25 depend on term matching and create sparse vectors(Robertson,2009; Yang et al.,2017; Chen et al.,2017)to ensure efficient retrieval. After the emergence of pre-trained language models(Devlin et al.,2019; Liu et al.,2019), dense retrieval which encodes both queries and documents into low-dimension vectors and then calculates their relevance scores(Lee et al.,2019; Karpukhin et al.,2020), has recently undergone substantial research. Relevant studies include improving training approach(Karpukhin et al.,2020; Xiong et al.,2021; Qu et al.,2021), distillation(Lin et al.,2021; Hofst\u00e4tter et al.,2021)and task-specific pre-training(Izacard et al.,2022; Gao & Callan,2021; Lu et al.,2021; Gao & Callan,2022; Xiao et al.,2022)of dense retrieval models which significantly outperform sparse approaches. Zero-shot Dense Retrieval Many prior works consider training dense retrieval models on high-resource passage retrieval datasets like Natural Questions (NQ)(Kwiatkowski et al.,2019)(133k training examples) or MS-MARCO(Bajaj et al.,2016)(533k training examples) and then evaluating on queries from new tasks. These systems(Wang et al.,2022; Yu et al.,2022)are utilized in a transfer learning configuration(Thakur et al.,2021). However, on the one hand, it is time-consuming and expensive to collect such a vast training corpus. On the other hand, even MS-MARCO has limitations on commercial use and cannot be used in a wide range of real-world applications. To this end, recent work(Gao et al.,2023)proposes building zero-shot dense retrieval systems that require no relevance supervision (i.e., relevance label between a pair of query and document), which is considered \u201cunsupervised\u201d as the only supervision resides in the LLM where learning to follow instructions is conducted in earlier times(Sachan et al.,2022). In this work, we follow this zero-shot unsupervised setting and conduct information refinement through synergy between RMs and LLMs without any relevance supervision to handle the aforementioned issues. Enhance Retrieval Through LMs Recent works have investigated using auto-regressive language models to generate intermediate targets for better retrieval(Cao et al.,2021; Bevilacqua et al.,2022)while identifier strings still need to be created.\nOther works consider \u201cretrieving\u201d the knowledge stored in the parameters of pre-trained language models by directly generating text(Petroni et al.,2019; Roberts et al.,2020). Some researchers(Mao et al.,2021; Anantha et al.,2021; Wang et al.,2023)utilize LM to expand the query and incorporate these pseudo-queries for enhanced retrieval while others choose to expand the document(Nogueira et al.,2019). Besides, LMs can also be exploited to provide references for retrieval targets. For instance, GENREAD(Yu et al.,2023)directly generates contextual documents for given questions. Enhance LMs Through Retrieval On the contrary, retrieval-enhanced LMs have also received significant attention. Some approaches enhance the accuracy of predicting the distribution of the next word during training(Borgeaud et al.,2022)or inference(Khandelwal et al.,2020)through retrieving the k-most similar training contexts. Alternative methods utilize retrieved documents to provide supplementary context in generation tasks(Joshi et al.,2020; Guu et al.,2020; Lewis et al.,2020). WebGPT(Nakano et al.,2021)further adopts imitation learning and uses human feedback in a text-based web-browsing environment to enhance the LMs. LLM-Augmentor(Peng et al.,2023)improves large language models with external knowledge and automated feedback. REPLUG(Shi et al.,2023)prepends retrieved documents to the input for the frozen LM and treats the LM as a black box. Demonstrate-Search-Predict (DSP)(Khattab et al.,2022)obtains performance gains by relying on passing natural language texts in sophisticated pipelines between a language model and a retrieval model, which is most closely related to our approach. However, they rely on composing two parts with in-context learning and target on multi-hop question answering. While we aim at conducting information refinement via multiple interactions between RMs and LLMs for large-scale retrieval.",
            "reference": [
                "Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. Open-domain question answering goes conversational via question rewriting. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.),Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  520-534, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.44. URLhttps://aclanthology.org/2021.naacl-main.44.",
                "Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.",
                "Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems, 35:31668-31683, 2022.",
                "Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pp.  2206-2240. PMLR, 2022.",
                "Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=5k8F6UU39V.",
                "Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1870-1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1171. URLhttps://aclanthology.org/P17-1171.",
                "Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URLhttps://lmsys.org/blog/2023-03-30-vicuna/.",
                "Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820, 2020.",
                "Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. Overview of the trec 2020 deep learning track. arXiv preprint arXiv:2102.07662, 2021.",
                "Zhuyun Dai and Jamie Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687, 2019.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171-4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423. URLhttps://aclanthology.org/N19-1423.",
                "Luyu Gao and Jamie Callan. Condenser: a pre-training architecture for dense retrieval. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  981-993, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.75. URLhttps://aclanthology.org/2021.emnlp-main.75.",
                "Luyu Gao and Jamie Callan. Unsupervised corpus aware language model pre-training for dense passage retrieval. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2843-2853, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.203. URLhttps://aclanthology.org/2022.acl-long.203.",
                "Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1762-1777, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.99. URLhttps://aclanthology.org/2023.acl-long.99.",
                "Manas Gaur, Kalpa Gunaratna, Vijay Srinivasan, and Hongxia Jin. Iseeq: Information seeking question generation using dynamic meta-information retrieval and knowledge graphs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  10672-10680, 2022.",
                "Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. Learning dense representations for entity retrieval. InProceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.  528-537, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/K19-1049. URLhttps://aclanthology.org/K19-1049.",
                "Google. Google bard. https://bard.google.com/, 2023. URLhttps://bard.google.com/.",
                "Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pp.  3929-3938. PMLR, 2020.",
                "Sebastian Hofst\u00e4tter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. Efficiently teaching an effective dense retriever with balanced topic aware sampling. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  113-122, 2021.",
                "Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URLhttps://openreview.net/forum?id=jKN1pXi7b0.",
                "Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1-38, 2023.",
                "Jeff Johnson, Matthijs Douze, and Herv\u00e9 J\u00e9gou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535-547, 2019.",
                "Mandar Joshi, Kenton Lee, Yi Luan, and Kristina Toutanova. Contextualized representations using textual encyclopedic knowledge. arXiv preprint arXiv:2004.12006, 2020.",
                "Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769-6781, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550.",
                "Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=HklBjCEKvH.",
                "Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.",
                "Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452-466, 2019. doi:10.1162/tacl_a_00276. URLhttps://aclanthology.org/Q19-1026.",
                "Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. A survey on complex knowledge base question answering: Methods, challenges and solutions. In Zhi-Hua Zhou (ed.),Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp.  4483-4491. International Joint Conferences on Artificial Intelligence Organization, 8 2021. doi:10.24963/ijcai.2021/611. URLhttps://doi.org/10.24963/ijcai.2021/611. Survey Track.",
                "Hyunji Lee, Sohee Yang, Hanseok Oh, and Minjoon Seo. Generative multi-hop retrieval. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1417-1436, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URLhttps://aclanthology.org/2022.emnlp-main.92.",
                "Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  6086-6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1612. URLhttps://aclanthology.org/P19-1612.",
                "Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459-9474, 2020.",
                "Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. InProceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pp.  163-173, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.repl4nlp-1.17. URLhttps://aclanthology.org/2021.repl4nlp-1.17.",
                "Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URLhttp://arxiv.org/abs/1907.11692.",
                "Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tie-Yan Liu, and Arnold Overwijk. Less is more: Pretrain a strong Siamese encoder for dense text retrieval using a weak decoder. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2780-2791, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.220. URLhttps://aclanthology.org/2021.emnlp-main.220.",
                "Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hongjin Qian. Large language models know your contextual search intent: A prompting framework for conversational search. arXiv preprint arXiv:2303.06573, 2023.",
                "Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4089-4100, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.316. URLhttps://aclanthology.org/2021.acl-long.316.",
                "Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2791-2809, Seattle, United States, July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.201. URLhttps://aclanthology.org/2022.naacl-main.201.",
                "IC Mogotsi. Christopher d. manning, prabhakar raghavan, and hinrich sch\u00fctze: Introduction to information retrieval. Information Retrieval, 13(2):192-195, 2010.",
                "Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.",
                "Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9844-9855, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URLhttps://aclanthology.org/2022.emnlp-main.669.",
                "Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction. arXiv preprint arXiv:1904.08375, 2019.",
                "OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022. URLhttps://openai.com/blog/chatgpt/.",
                "OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.",
                "Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730-27744, 2022.",
                "Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.",
                "Fabio Petroni, Tim Rockt\u00e4schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2463-2473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1250. URLhttps://aclanthology.org/D19-1250.",
                "Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.",
                "Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5835-5847, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.466. URLhttps://aclanthology.org/2021.naacl-main.466.",
                "Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5418-5426, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.437. URLhttps://aclanthology.org/2020.emnlp-main.437.",
                "Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333-389, 2009. doi:10.1561/1500000019. URLhttps://doi.org/10.1561/1500000019.",
                "Zaragoza Robertson. Robertson s., zaragoza h. The probabilistic relevance framework: Bm25 and beyond, Found. Trends Inf. Retr, 3(4):333-389, 2009.",
                "Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. Improving passage retrieval with zero-shot question generation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3781-3797, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URLhttps://aclanthology.org/2022.emnlp-main.249.",
                "Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. InInternational Conference on Learning Representations, 2022.",
                "Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.",
                "Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp.  3784-3803, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.findings-emnlp.320. URLhttps://aclanthology.org/2021.findings-emnlp.320.",
                "Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35:21831-21843, 2022.",
                "Nandan Thakur, Nils Reimers, Andreas R\u00fcckl\u00e9, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. CoRR, abs/2104.08663, 2021. URLhttps://arxiv.org/abs/2104.08663.",
                "James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. The fact extraction and VERification (FEVER) shared task. InProceedings of the First Workshop on Fact Extraction and VERification (FEVER), pp.  1-9, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi:10.18653/v1/W18-5501. URLhttps://aclanthology.org/W18-5501.",
                "Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.",
                "Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.",
                "Ellen M Voorhees et al. The trec-8 question answering track report. InTrec, volume 99, pp.  77-82, 1999.",
                "Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2345-2360, Seattle, United States, July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.168. URLhttps://aclanthology.org/2022.naacl-main.168.",
                "Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023.",
                "Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=gEZrGCozdqR.",
                "Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. RetroMAE: Pre-training retrieval-oriented language models via masked auto-encoder. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  538-548, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URLhttps://aclanthology.org/2022.emnlp-main.35.",
                "Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=zeFrfgyZln.",
                "Peilin Yang, Hui Fang, and Jimmy Lin. Anserini: Enabling the use of lucene for information retrieval research. In Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (eds.),Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pp.  1253-1256. ACM, 2017. doi:10.1145/3077136.3080721. URLhttps://doi.org/10.1145/3077136.3080721.",
                "Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. Pretrained transformers for text ranking: BERT and beyond. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials, pp.  1-4, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-tutorials.1. URLhttps://aclanthology.org/2021.naacl-tutorials.1.",
                "Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong context generators. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=fB0hRu9GZUS.",
                "Yue Yu, Chenyan Xiong, Si Sun, Chao Zhang, and Arnold Overwijk. COCO-DR: Combating the distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1462-1479, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URLhttps://aclanthology.org/2022.emnlp-main.95.",
                "Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023a.",
                "Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren's song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.",
                "Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Lan Luo, Ke Zhan, Enrui Hu, Xinyu Zhang, Hao Jiang, Zhao Cao, Fan Yu, Xin Jiang, Qun Liu, and Lei Chen. Hyperlink-induced pre-training for passage retrieval in open-domain question answering. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7135-7146, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.493. URLhttps://aclanthology.org/2022.acl-long.493."
            ],
            "abstract": "Information retrieval (IR) plays a crucial role in locating relevant resources from vast amounts of data, and its applications have evolved from traditional knowledge bases to modern retrieval models (RMs). The emergence of large language models (LLMs) has further revolutionized the IR field by enabling users to interact with search systems in natural languages. In this paper, we explore the advantages and disadvantages of LLMs and RMs, highlighting their respective strengths in understanding user-issued queries and retrieving up-to-date information. To leverage the benefits of both paradigms while circumventing their limitations, we propose InteR, a novel framework that facilitates information refinement through synergy between RMs and LLMs. InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections and enables LLMs to enhance prompt formulation using retrieved documents. This iterative refinement process augments the inputs of RMs and LLMs, leading to more accurate retrieval. Experiments on large-scale retrieval benchmarks involving web search and low-resource retrieval tasks demonstrate that InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods, even those using relevance judgment. Source code is available atthis https URL",
            "date": 2021,
            "title": "Synergistic Interplay between Search and Large Language Models for Information Retrieval"
        },
        "topic": "LLMs for Information Retrieval"
    },
    {
        "source_paper": {
            "arxiv_id": "2308.06463",
            "isAPA": true,
            "related work": "Safety Alignment for LLMs. Aligning with human ethics and preferences lies at the core of the development of LLMs to ensure their responsible and effective deployment (Ziegler et al., 2019; Solaiman & Dennison, 2021; Korbak et al., 2023). Accordingly, OpenAI devoted six months to ensure its safety through RLHF and other safety mitigation methods prior to deploying their pre-trained GPT-4 model (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a; OpenAI, 2023b). In addition, OpenAI is assembling a new SuperAlignment team to ensure AI systems much smarter than humans (i.e. SuperInterlligence) follow human intent (OpenAI, 2023c; Bowman et al., 2022; Irving et al., 2018; Christiano et al., 2018). In this study, we validate the effectiveness of our approach on the SOTA GPT-4 model, and show that chat in cipher enables evasion of safety alignment (\u00a7 4.3).\nIn the academic community, Dai et al. (2023b) releases a highly modular open-source RLHF framework - Beaver, which provides training data and a reproducible code pipeline to facilitate alignment research. Zhou et al. (2024) suggests that almost all knowledge in LLMs is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high-quality output. Our results reconfirm these findings: simulated ciphers that never occur in pretraining data cannot work (\u00a74.4). In addition, our study indicates that the high-quality instruction data should contain samples beyond natural languages (e.g. ciphers) for better safety alignment.\nThere has been an increasing amount of work on aligning LLMs more effectively and efficiently (Zheng et al., 2024; Xu et al., 2024; Ji et al., 2024; Zhang et al., 2023). For example,  Bai et al. (2022b) develop a method Constitutional AI to encode desirable AI behavior in a simple and transparent form, which can control AI behavior more precisely and with far fewer human labels.  Sun et al. (2024) propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.  Dong et al. (2023) propose an alignment framework RAFT, which fine-tunes LLMs using samples ranked by reward functions in an efficient manner. Our work shows that chat in cipher can serve as a test bed to assess the effectiveness of these advanced methods.\nAdversarial Attack on LLMs. While safety alignment for LLMs can help, LLMs remain vulnerable to adversarial inputs that can elicit undesired behavior (Gehman et al., 2020; Bommasani et al., 2021; walkerspider, 2022; Perez et al., 2022; Perez & Ribeiro, 2022; Kang et al., 2023; Li et al., 2023; Ganguli et al., 2022; Schulhoff et al., 2023; OpenAI, 2023b; Jones et al., 2023; Zou et al., 2023; Huang et al., 2024; Zeng et al., 2024; Yu et al., 2023; Liu et al., 2024; Wang et al., 2023a; Deng et al., 2024). Recently, Wei et al. (2024) provides a systematic analysis of the jailbreak attack and hypothesizes two failure modes of safety alignment: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. Our study confirms and extends their findings in mismatched generalization with comprehensive experiments and insightful analyses: the safety training in natural language fails to generalize to the domain of cipher, for which the capability of GPT-4 exists. In addition, our study also reveals that LLMs have their secret \u201cciphers\u201d to generate unsafe responses via only role play with demonstrations (without real encipher).",
            "reference": [
                "Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo.The falcon series of language models:towards open frontier models.2023.",
                "Anthropic.Model card and evaluations for claude models, https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf, 2023.",
                "Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al.Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a.",
                "Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022b.",
                "Boaz Barak.Another jailbreak for GPT4: Talk to it in morse code, https://twitter.com/boazbaraktcs/status/1637657623100096513, 2023.",
                "Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou.Safety-tuned LLaMAs: Lessons from improving the safety of large language models that follow instructions.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=gT5hALch9z.",
                "Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al.On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021.",
                "Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil\u0117 Luko\u0161i\u016bt\u0117, Amanda Askell, Andy Jones, Anna Chen, et al.Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022.",
                "S\u00e9bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al.Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023.",
                "Lingjiao Chen, Matei Zaharia, and James Zou.How is chatgpt's behavior changing over time?CoRR, abs/2307.09009, 2023.doi: 10.48550/arXiv.2307.09009.URL https://doi.org/10.48550/arXiv.2307.09009.",
                "David Cheng-Han Chiang and Hung-yi Lee.Can large language models be an alternative to human evaluations?In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), ACL 2023, pp.  15607-15631, 2023.URL https://aclanthology.org/2023.acl-long.870.",
                "Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.URL https://lmsys.org/blog/2023-03-30-vicuna/.",
                "Paul Christiano, Buck Shlegeris, and Dario Amodei.Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575, 2018.",
                "Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei.Deep reinforcement learning from human preferences.NeurIPS, 30, 2017.",
                "Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei.Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers.In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of ACL, pp.  4005-4019, 2023a.URL https://aclanthology.org/2023.findings-acl.247.",
                "Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang.Safe RLHF: Safe reinforcement learning from human feedback.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=TyFrPOKYXw.",
                "Juntao Dai, Jiaming Ji, Xuehai Pan, Ruiyang Sun, Yizhou Wang, and Yaodong Yang.Constrained value-aligned LLM via safe RLHF, https://pku-beaver.github.io/, 2023b.",
                "Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing.Multilingual jailbreak challenges in large language models.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=vESNKdEMGp.",
                "Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang.Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023.",
                "Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui.A survey for in-context learning.arXiv preprint arXiv:2301.00234, 2022.",
                "Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al.Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022.",
                "Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith.Realtoxicityprompts: Evaluating neural toxic degeneration in language models.In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of EMNLP, pp.  3356-3369, 2020.URL https://doi.org/10.18653/v1/2020.findings-emnlp.301.",
                "Google.Bard, https://bard.google.com/, 2023.",
                "Divij Handa, Advait Chirmule, Bimal Gajera, and Chitta Baral.Jailbreaking proprietary large language models using word substitution cipher.arXiv preprint arXiv:2402.10601, 2024.",
                "Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen.Catastrophic jailbreak of open-source LLMs via exploiting generation.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=r42tSSCHPh.",
                "Geoffrey Irving, Paul Christiano, and Dario Amodei.Ai safety via debate.arXiv preprint arXiv:1805.00899, 2018.",
                "Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, and Yaodong Yang.Aligner: Achieving efficient alignment through weak-to-strong correction.arXiv preprint arXiv:2402.02416, 2024.",
                "Wenxiang Jiao, Wenxuan Wang, JT Huang, Xing Wang, and ZP Tu.Is chatgpt a good translator? yes with gpt-4 as the engine.arXiv preprint arXiv:2301.08745, 2023.",
                "Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt.Automatically auditing large language models via discrete optimization.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  15307-15329. PMLR, 2023.URL https://proceedings.mlr.press/v202/jones23a.html.",
                "Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto.Exploiting programmatic behavior of llms: Dual-use through standard security attacks.arXiv preprint arXiv:2302.05733, 2023.",
                "Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa.Large language models are zero-shot reasoners.In NeurIPS, 2022.URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html.",
                "Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez.Pretraining language models with human preferences.In ICLR, pp.  17506-17533. PMLR, 2023.",
                "Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut.Albert: A lite bert for self-supervised learning of language representations.In ICLR, 2020.URL https://openreview.net/forum?id=H1eA7AEtvS.",
                "Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song.Multi-step jailbreaking privacy attacks on chatgpt.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp.  4138-4153. Association for Computational Linguistics, 2023.URL https://aclanthology.org/2023.findings-emnlp.272.",
                "Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao.Generating stealthy jailbreak prompts on aligned large language models.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=7Jwpw4qKkb.",
                "Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.",
                "OpenAI.ChatGPT, https://openai.com/chatgpt, 2023a.",
                "OpenAI.GPT-4 technical report, https://cdn.openai.com/papers/gpt-4.pdf, 2023b.",
                "OpenAI.Introducing superalignment to ensure AI systems much smarter than humans follow human intent, https://openai.com/blog/introducing-superalignment, 2023c.",
                "Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.Training language models to follow instructions with human feedback.NeurIPS, 35:27730-27744, 2022.",
                "Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein.Generative agents: Interactive simulacra of human behavior.In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.  1-22, 2023.",
                "Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving.Red teaming language models with language models.In EMNLP, pp.  3419-3448, 2022.",
                "F\u00e1bio Perez and Ian Ribeiro.Ignore previous prompt: Attack techniques for language models.In NeurIPS ML Safety Workshop, 2022.",
                "Timo Schick, Jane Dwivedi-Yu, Roberto Dess\u00ec, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2024.",
                "Sander V Schulhoff, Jeremy Pinto, Anaum Khan, Louis-Fran\u00e7ois Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher R Carnahan, and Jordan Lee Boyd-Graber.Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition.In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.",
                "Irene Solaiman and Christy Dennison.Process for adapting language models to society (palms) with values-targeted datasets.NeurIPS, 34:5861-5873, 2021.",
                "Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano.Learning to summarize with human feedback.NeurIPS, 33:3008-3021, 2020.",
                "Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang.Safety assessment of chinese large language models.arXiv preprint arXiv:2304.10436, 2023.",
                "Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan.Principle-driven self-alignment of language models from scratch with minimal human supervision.Advances in Neural Information Processing Systems, 36, 2024.",
                "Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto.Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023.",
                "Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.Llama: Open and efficient foundation language models, 2023a.",
                "Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023b.",
                "walkerspider.DAN is my new friend., https://old.reddit.com/r/ChatGPT/comments/zlcyr9/dan_is_my_new_friend/, 2022.",
                "Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro.Exploring the limits of domain-adaptive training for detoxifying large-scale language models.NeurIPS, 35:35811-35824, 2022.",
                "Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R Lyu.All languages matter: On the multilingual safety of large language models.arXiv preprint arXiv:2310.00905, 2023a.",
                "Yimu Wang, Peng Shi, and Hongyang Zhang.Investigating the existence of \u201csecret language\u201din language models.arXiv preprint arXiv:2307.12507, 2023b.",
                "Alexander Wei, Nika Haghtalab, and Jacob Steinhardt.Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems, 36, 2024.",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.Chain-of-thought prompting elicits reasoning in large language models.NeurIPS, 35:24824-24837, 2022.",
                "Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al.Larger language models do in-context learning differently.arXiv preprint arXiv:2303.03846, 2023.",
                "Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang.Challenges in detoxifying language models.In Findings of EMNLP, pp.  2447-2469, 2021.URL https://aclanthology.org/2021.findings-emnlp.210.",
                "Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan.Recipes for safety in open-domain chatbots.arXiv preprint arXiv:2010.07079, 2020.",
                "Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran.Safedecoding: Defending against jailbreak attacks via safety-aware decoding.arXiv preprint arXiv:2402.08983, 2024.",
                "Jiahao Yu, Xingwei Lin, and Xinyu Xing.Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023.",
                "Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi.How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms.arXiv preprint arXiv:2401.06373, 2024.",
                "Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang.Defending large language models against jailbreaking attacks through goal prioritization.arXiv preprint arXiv:2311.09096, 2023.",
                "Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng.Prompt-driven llm safeguarding via directed representation optimization.arXiv preprint arXiv:2401.18018, 2024.",
                "Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al.Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36, 2024.",
                "Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving.Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019.",
                "Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson.Universal and transferable adversarial attacks on aligned language models, 2023."
            ],
            "abstract": "Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released atthis https URL.",
            "date": 2021,
            "title": "GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher"
        },
        "topic": "Safety in LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2210.03493",
            "isAPA": true,
            "related work": "This section reviews two lines of research that form the basis of this work: chain-of-thought (CoT) prompting for multi-step reasoning and in-context learning for inducing LLMs to learn from demonstrations.\n2.1Chain-of-thought PromptingCoT prompting is a gradient-free technique of inducing LLMs to produce intermediate reasoning steps that lead to the final answer. Wei et al. (2022a) formally studied the topic of CoT prompting in language models. This technique elicits LLMs to generate a coherent series of intermediate reasoning steps that lead to the final answer to a question. Studies have shown that LLMs can perform CoT reasoning with zero-shot prompting (Zero-Shot-CoT) (Kojima et al., 2022) or manually written few-shot demonstrations (Manual-CoT) (Wei et al., 2022a).\nZero-Shot-CoT.Kojima et al. (2022) showed that LLMs are decent zero-shot reasoners whose generated rationales have already reflected the CoT reasoning. This finding inspires our work to leverage the self-generated rationales for demonstrations. Generating rationales by LLMs was shown to be practical in a recent work (Zelikman et al., 2022). In their work, an LLM is prompted to generate rationales and those rationales that lead to the correct answer are selected. The selection requires a training dataset of questions with annotated answers. In contrast, our work considers a more challenging scenario where only a set of test questions are given (without a training dataset), following CoT prompting studies by Wei et al. (2022a) and Kojima et al. (2022).\nManual-CoT.Manual-CoT achieves stronger performance by eliciting the CoT reasoning ability with effective manual demonstrations. The demonstrations for the reasoning process are manually designed. However, the human efforts in designs of both questions and their reasoning chains are nontrivial. Instead of addressing this limitation, recent studies mainly focus on hand-crafting more complex demonstrations or leveraging ensemble-like methods. One trend is problem decomposition. In least-to-most prompting (Zhou et al., 2022), complex problems are reduced to sub-problems, and then the sub-problems are solved sequentially. The other trend is to vote over multiple reasoning paths for a test question. Wang et al. (2022a) introduced a self-consistency decoding strategy to sample multiple outputs of LLMs and then took a majority over the final answers. Wang et al. (2022b) and Li et al. (2022) introduced randomness in the input space to produce more diverse outputs for voting. They used manually-designed demonstrations as the seed set and generated additional rationales: leave one question from the seed set and use the remaining demonstrations to generate rationales for this question by the LLM. Unlike the aforementioned research lines that rely on manually-designed demonstrations, our work intends to eliminate manual designs with competitive performance.\n2.2In-Context LearningCoT prompting is closely related to in-context learning (ICL) (Radford et al., 2019; Brown et al., 2020). ICL enables LLMs to perform a target task by feeding a few prompted examples as part of the input. Without gradient update, ICL allows a single model to perform various tasks universally. There are various research lines to improve the performance of ICL: (i) retrieving related demonstrations to the test instance where the popular practice is dynamically retrieving related training examples for a given test input (Rubin et al., 2022; Su et al., 2022); (ii) augmenting with fine-grained information, such as incorporating task instruction (Mishra et al., 2022; Wei et al., 2022b; Sanh et al., 2022); (iii) manipulating output probabilities of LLMs instead of directly computing the likelihood of target labels (Holtzman et al., 2021; Zhao et al., 2021; Min et al., 2022a).\nDespite the success of ICL, studies (Liu et al., 2022a; Lu et al., 2022) have shown that the strength of ICL may vary widely depending on the choice of in-context demonstrations (Liu et al., 2022b). In detail, the formatting of the prompt, such as wording or order of demonstrations, may lead to performance fluctuations (Webson and Pavlick, 2022; Zhao et al., 2021). A recent work (Min et al., 2022b) even questioned the necessity of ground-truth input-output mapping: using incorrect labels in the examples only marginally lowers the performance. However, the existing analysis of ICL is mainly based on standard classification and multi-choice datasets that only have simple <input\u2192output> mappings. We discover that those findings may not be applicable to the CoT prompting scenario with more complex <input\u2192rationale\u2192output> mappings. For example, mistakes in either the <input\u2192rationale> mapping or the <rationale\u2192output> mapping will lead to a dramatic performance drop (Appendix A.1).",
            "reference": [
                "Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.",
                "Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog applications, 2022. URLhttps://arxiv.org/abs/2201.08239.",
                "Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher, 2021. URLhttps://arxiv.org/abs/2112.11446.",
                "Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. URLhttps://arxiv.org/abs/2204.02311.",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. InThirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022a. URLhttps://arxiv.org/abs/2201.11903.",
                "Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InThirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022. URLhttps://arxiv.org/abs/2205.11916.",
                "Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743-1752, Lisbon, Portugal, 2015. Association for Computational Linguistics. doi:10.18653/v1/D15-1202. URLhttps://aclanthology.org/D15-1202.",
                "Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149-4158, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1421. URLhttps://aclanthology.org/N19-1421.",
                "Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168.",
                "Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158-167, Vancouver, Canada, 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1015. URLhttps://aclanthology.org/P17-1015.",
                "Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080-2094, Online, 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.168. URLhttps://aclanthology.org/2021.naacl-main.168.",
                "Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346-361, 2021. doi:10.1162/tacl_a_00370. URLhttps://doi.org/10.1162/tacl_a_00370.",
                "Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022. URLhttps://arxiv.org/abs/2203.14465.",
                "Denny Zhou, Nathanael Sch\u00e4rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. URLhttps://arxiv.org/abs/2205.10625.",
                "Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022a. URLhttps://arxiv.org/abs/2203.11171.",
                "Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022b. URLhttps://arxiv.org/abs/2207.00747.",
                "Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336, 2022. URLhttps://arxiv.org/abs/2206.02336.",
                "Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, page 9, 2019.",
                "Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655-2671, 2022. doi:10.18653/v1/2022.naacl-main.191. URLhttps://aclanthology.org/2022.naacl-main.191.",
                "Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, et al. Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975, 2022. URLhttps://arxiv.org/abs/2209.01975.",
                "Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470-3487, 2022. doi:10.18653/v1/2022.acl-long.244. URLhttps://aclanthology.org/2022.acl-long.244.",
                "Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022b. URLhttps://openreview.net/forum?id=gEZrGCozdqR.",
                "Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. Multitask prompted training enables zero-shot task generalization. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=9Vrb9D0WI4.",
                "Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn't always right. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038-7051, 2021. doi:10.18653/v1/2021.emnlp-main.564. URLhttps://aclanthology.org/2021.emnlp-main.564.",
                "Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational Conference on Machine Learning, pages 12697-12706, 2021. URLhttp://proceedings.mlr.press/v139/zhao21c/zhao21c.pdf.",
                "Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316-5330, 2022a. doi:10.18653/v1/2022.acl-long.365. URLhttps://aclanthology.org/2022.acl-long.365.",
                "Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100-114, 2022a. doi:10.18653/v1/2022.deelio-1.10. URLhttps://aclanthology.org/2022.deelio-1.10.",
                "Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086-8098, 2022. doi:10.18653/v1/2022.acl-long.556. URLhttps://aclanthology.org/2022.acl-long.556.",
                "Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. arXiv preprint arXiv:2205.05638, 2022b. URLhttps://arxiv.org/abs/2205.05638.",
                "Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300-2344, Seattle, United States, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.167. URLhttps://aclanthology.org/2022.naacl-main.167.",
                "Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022b. URLhttps://arxiv.org/abs/2202.12837.",
                "Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982-3992, Hong Kong, China, 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1410. URLhttps://aclanthology.org/D19-1410.",
                "Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523-533, Doha, Qatar, 2014. Association for Computational Linguistics. doi:10.3115/v1/D14-1058. URLhttps://aclanthology.org/D14-1058.",
                "Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585-597, 2015. doi:10.1162/tacl_a_00160. URLhttps://aclanthology.org/Q15-1042.",
                "Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. URLhttps://arxiv.org/abs/2203.02155.",
                "Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374.",
                "Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152-1157, San Diego, California, 2016. Association for Computational Linguistics. doi:10.18653/v1/N16-1136. URLhttps://aclanthology.org/N16-1136."
            ],
            "abstract": "Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like \"Let's think step by step\" to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the \"Let's think step by step\" prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available atthis https URL",
            "date": 2021,
            "title": "Automatic Chain of Thought Prompting in Large Language Models"
        },
        "topic": "Chain of Thought"
    },
    {
        "source_paper": {
            "arxiv_id": "2205.00584",
            "isAPA": true,
            "related work": "Our work is relevant to four broad strands of research on multi-armed bandits, search engines, language as an interface for interactive systems, and exploratory search and trails, which we review below.\nContextual bandits for recommendationMulti-armed bandits are a classical exploration-exploitation framework from Reinforcement Learning (RL), where the user feedback is available in each iteration (Parapar and Radlinski, 2021; Cortes, 2018; Li et al., 2010). They are becoming popular for online applications such as ranking online advertisements and recommendation systems (e.g., Ban and He, 2021; Joachims et al., 2020), where information about user preferences is unavailable (cold-start users (Bernardi et al., 2015; Kiseleva et al., 2016a)) (Fel\u00edcio et al., 2017). Parapar and Radlinski (2021) proposed a multi-armed bandit model for personalized recommendations by diversifying the user preferences by changing the focus only on past user interactions. Others examined the application of contextual bandit models in healthcare, finance, dynamic pricing, and anomaly detection (Bouneffouf and Rish, 2019). Our work adapts contextual bandits paradigm to the new problem of interactive intent modeling for complex information-seeking tasks.\nSearch enginesCommonly used search engines such as Google and Bing provide platforms focusing on the document retrieval process through search sessions (Hassan et al., 2010; Kiseleva et al., 2014, 2015; Ageev et al., 2011). Developing retrieval models that can extract the most relevant documents from an extensive collection has been well-studied (Croft et al., 2010) for decades. The developed retrieval models focus on retrieving the most relevant documents corresponding to user intent, represented with textual and contextual information within and across search sessions (Kotov et al., 2011). Although extracting relevant documents is necessary, it is not always sufficient, especially when the users have a complex information-seeking task (Ingwersen and J\u00e4rvelin, 2006).\nLanguage as an interface for interactionsNLU have been the important direction for human-computer interaction and information search for decades (Woods et al., 1972; Codd, 1974; Hendrix et al., 1978). The recent impressive advances in capabilities of NLU (Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020; Adiwardana et al., 2020; Roller et al., 2020; Brown et al., 2020) powered by large-scale deep learning and increasing demand for new applications has led to a major resurgence of natural language interfaces in the form of virtual assistants, dialog systems, semantic parsing, and question answering systems (Liu and Lane, 2017, 2018; Dinan et al., 2020; Zhang et al., 2019). The scope of natural language interfaces has been significantly expanding from databases (Copestake and Jones, 1990) to knowledge bases (Berant et al., 2013), robots (Tellex et al., 2011), virtual assistants (Kiseleva et al., 2016c, b), and other various forms of interaction (Fast et al., 2018; Desai et al., 2016; Young et al., 2013). Recently, the community has focused on continuous learning through interactions, including systems that learn a new task from instructions (Li et al., 2020a), assess their uncertainty (Yao et al., 2019) and ask feedback from humans in case of uncertainty (Aliannejadi et al., 2021, 2020) or for correcting possible mistakes (Elgohary et al., 2020).\nExploratory search, tours, and trailsExploratory search refers to an information-seeking process in which the system assists the searcher in understanding the information space for iterative exploration and retrieval of information (Ruotsalo et al., 2018; Hassan Awadallah et al., 2014; White et al., 2008). Anomalous states of knowledge (ASKs) (Belkin, 1980) motivate the need to search and drive demand for search systems. According to the ASK hypothesis, users usually can struggle to conceptualize and formulate their information needs as search queries, which may miss some essential information (Liu and Belkin, 2015; White and Roth, 2009). In such cases, the system should assist the user in specifying their intent (Marchionini, 2006). Through a search log analysis, Odijk et al. (2015) shows that there are many searches where users may struggle to formulate their search query or they may simply be exploring to learn about a new area. New search interface designs may be required to support searchers through their information-seeking process (Villa et al., 2009). Tours and Trails are another group of tools that were developed to guide users to accomplish search tasks. Guided tours are common in hypertext systems (Trigg, 1988) and similar ideas could be applied in the context of search (Hassan and White, 2012). Surfacing common trail destinations in search interfaces can help people find information targets more quickly (White et al., 2007). Search engines may also present full trails as a way to explore, learn, and complete multi-step tasks (Singla et al., 2010). Olston and Chi (2003) proposed ScentTrails that leverage an interface that combines browsing and searching and highlights potentially relevant hyperlinks. WebWatcher (Joachims et al., 1997), like ScentTrails, underlined the relevant hyperlinks and improved the model based on the implicit feedback collected during previous tours.\nTo summarize, the key distinctions of our work compared to previous efforts are as follows. Similar to the exploratory search, trails, and conversational search, our model proposes an iterative information-seeking process and designs an interface for user interactions to guide struggling users and help them better understand the information space. However, that work that only focuses on user interaction modeling and limits users in issuing short and imprecise queries and utterances, our model provides a platform for users to express their information needs in the form of long and complex requests. Users can utilize this capability to express their intent more accurately and prune significant parts of the search space for the exploratory search process. Adding this capability needs an advanced NLU step and different machine learning components to understand and guide the final user through the search process. To this end, the proposed system has two new components, an intent ontology and a profile for partitioning the information space, enabling the IA to help users be more effective in exploring the search space.",
            "reference": [
                "Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.",
                "Mikhail Ageev, Qi Guo, Dmitry Lagun, and Eugene Agichtein. Find it if you can: a game for modeling different types of web search success using interaction data. InSIGIR, 2011.",
                "Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. Convai3: Generating clarifying questions for open-domain dialogue systems (clariq). 2020.",
                "Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeffrey Dalton, and Mikhail Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. arXiv preprint arXiv:2109.05794, 2021.",
                "Negar Arabzadeh, Fattaneh Zarrinkalam, Jelena Jovanovic, and Ebrahim Bagheri. Geometric estimation of specificity within embedding spaces. InProceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2109-2112, 2019.",
                "Negar Arabzadeh, Fattane Zarrinkalam, Jelena Jovanovic, Feras Al-Obeidat, and Ebrahim Bagheri. Neural embedding-based specificity metrics for pre-retrieval query performance prediction. Information Processing & Management, 57(4):102248, 2020a.",
                "Negar Arabzadeh, Fattane Zarrinkalam, Jelena Jovanovic, and Ebrahim Bagheri. Neural embedding-based metrics for pre-retrieval query performance prediction. Advances in Information Retrieval, 12036:78, 2020b.",
                "Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. Bert-qpp: Contextualized pre-trained transformers for query performance prediction. InProceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2857-2861, 2021.",
                "Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. InProceedings of the Web Conference 2021, pages 2335-2346, 2021.",
                "Andrea Barraza-Urbina and Dorota Glowacka. Introduction to bandits in recommender systems. InFourteenth ACM Conference on Recommender Systems, pages 748-750, 2020.",
                "Nicholas J Belkin. Anomalous states of knowledge as a basis for information retrieval. Canadian journal of information science, 5(1):133-143, 1980.",
                "Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1533-1544, 2013.",
                "Lucas Bernardi, Jaap Kamps, Julia Kiseleva, and Melanie JI M\u00fcller. The continuous cold start problem in e-commerce recommender systems. arXiv preprint arXiv:1508.01177, 2015.",
                "Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. arXiv preprint arXiv:1904.10040, 2019.",
                "Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.",
                "David Carmel and Oren Kurland. Query performance prediction for ir. InProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1196-1197, 2012.",
                "David Carmel and Elad Yom-Tov. Estimating the query difficulty for information retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2(1):1-89, 2010.",
                "Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-C\u00e9spedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.",
                "Konstantina Christakopoulou. Towards Recommendation Systems with Real-World Constraints. PhD thesis, University of Minnesota, 2018.",
                "Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversational recommender systems. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 815-824, 2016.",
                "Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.",
                "Charles LA Clarke, Maheedhar Kolla, and Olga Vechtomova. An effectiveness measure for ambiguous and underspecified queries. InConference on the Theory of Information Retrieval, pages 188-199. Springer, 2009.",
                "Edgar F Codd. Seven steps to rendezvous with the casual user. IBM Corporation, 1974.",
                "Ann Copestake and Karen Sparck Jones. Natural language interfaces to databases. 1990.",
                "David Cortes. Adapting multi-armed bandits policies to contextual bandits scenarios. arXiv preprint arXiv:1811.04383, 2018.",
                "W Bruce Croft, Donald Metzler, and Trevor Strohman. Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading, 2010.",
                "Steve Cronen-Townsend, Yun Zhou, and W Bruce Croft. Predicting query performance. InProceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 299-306, 2002.",
                "Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Subhajit Roy, et al. Program synthesis using natural language. InProceedings of the 38th International Conference on Software Engineering, pages 345-356. ACM, 2016.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018.",
                "Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). InThe NeurIPS'18 Competition, pages 187-208. Springer, Cham, 2020.",
                "Ahmed Elgohary, Saghar Hosseini, and Ahmed Hassan Awadallah. Speak to your parser: Interactive text-to-SQL with natural language feedback. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2065-2077, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.187. URLhttps://www.aclweb.org/anthology/2020.acl-main.187.",
                "Ethan Fast, Binbin Chen, Julia Mendelsohn, Jonathan Bassen, and Michael S Bernstein. Iris: A conversational agent for complex tasks. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 473. ACM, 2018.",
                "Cr\u00edcia Z Fel\u00edcio, Kl\u00e9risson VR Paix\u00e3o, Celia AZ Barcelos, and Philippe Preux. A multi-armed bandit model selection for cold-start user recommendation. InProceedings of the 25th Conference on User Modeling, Adaptation and Personalization, pages 32-40, 2017.",
                "Alexandre Gilotte, Cl\u00e9ment Calauz\u00e8nes, Thomas Nedelec, Alexandre Abraham, and Simon Doll\u00e9. Offline a/b testing for recommender systems. InProceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198-206, 2018.",
                "Helia Hashemi, Hamed Zamani, and W Bruce Croft. Performance prediction for non-factoid question answering. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 55-58, 2019.",
                "Ahmed Hassan and Ryen W White. Task tours: helping users tackle complex search tasks. InProceedings of the 21st ACM international conference on Information and knowledge management, pages 1885-1889, 2012.",
                "Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. Beyond dcg: user behavior as a predictor of a successful search. InWSDM, pages 221-230, 2010.",
                "Ahmed Hassan Awadallah, Ryen W White, Patrick Pantel, Susan T Dumais, and Yi-Min Wang. Supporting complex search tasks. InProceedings of the 23rd ACM international conference on conference on information and knowledge management, pages 829-838, 2014.",
                "Claudia Hauff, Djoerd Hiemstra, and Franciska de Jong. A survey of pre-retrieval query performance predictors. InProceedings of the 17th ACM conference on Information and knowledge management, pages 1419-1420, 2008.",
                "Claudia Hauff, Leif Azzopardi, and Djoerd Hiemstra. The combination and evaluation of query performance prediction methods. InEuropean Conference on Information Retrieval, pages 301-312. Springer, 2009.",
                "Ben He and Iadh Ounis. Inferring query performance using pre-retrieval predictors. InInternational symposium on string processing and information retrieval, pages 43-54. Springer, 2004.",
                "Gary G Hendrix, Earl D Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS), 3(2):105-147, 1978.",
                "Andreas Holzinger. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Informatics, 3(2):119-131, 2016.",
                "Peter Ingwersen and Kalervo J\u00e4rvelin. The turn: Integration of information seeking and retrieval in context, volume 18. Springer Science & Business Media, 2006.",
                "Thorsten Joachims, Dayne Freitag, Tom Mitchell, et al. Webwatcher: A tour guide for the world wide web. InIJCAI (1), pages 770-777. Citeseer, 1997.",
                "Thorsten Joachims, Yves Raimond, Olivier Koch, Maria Dimakopoulou, Flavian Vasile, and Adith Swaminathan. Reveal 2020: Bandit and reinforcement learning from user interactions. InFourteenth ACM Conference on Recommender Systems, pages 628-629, 2020.",
                "Maryam Khodabakhsh and Ebrahim Bagheri. Semantics-enabled query performance prediction for ad hoc table retrieval. Information Processing & Management, 58(1):102399, 2021.",
                "Julia Kiseleva, Eric Crestan, Riccardo Brigo, and Roland Dittel. Modelling and detecting changes in user satisfaction. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1449-1458, 2014.",
                "Julia Kiseleva, Jaap Kamps, Vadim Nikulin, and Nikita Makarov. Behavioral dynamics from the serp's perspective: what are failed serps and how to fix them? InProceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1561-1570, 2015.",
                "Julia Kiseleva, Alexander Tuzhilin, Jaap Kamps, Melanie JI Mueller, Lucas Bernardi, Chad Davis, Ivan Kovacek, Mats Stafseng Einarsen, and Djoerd Hiemstra. Beyond movie recommendations: Solving the continuous cold start problem in e-commercerecommendations. arXiv preprint arXiv:1607.07904, 2016a.",
                "Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. Predicting user satisfaction with intelligent assistants. InProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 45-54, 2016b.",
                "Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. Understanding user satisfaction with intelligent assistants. InProceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, pages 121-130, 2016c.",
                "Ivica Kostric, Krisztian Balog, and Filip Radlinski. Soliciting user preferences in conversational recommender systems via usage-related questions. InFifteenth ACM Conference on Recommender Systems, pages 724-729, 2021.",
                "Alexander Kotov, Paul N Bennett, Ryen W White, Susan T Dumais, and Jaime Teevan. Modeling and analysis of cross-session search tasks. InProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 5-14, 2011.",
                "Klaus Krippendorff. Computing krippendorff's alpha-reliability. 2011.",
                "Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019.",
                "Christoph Leiter, Ran Zhang, Yanran Chen, Jonas Belouadi, Daniil Larionov, Vivian Fresen, and Steffen Eger. Chatgpt: A meta-analysis after 2.5 months. arXiv preprint arXiv:2302.13795, 2023.",
                "Jiwei Li, Alexander H Miller, Sumit Chopra, Marc'Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. ICLR, 2016.",
                "Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661-670, 2010.",
                "Toby Jia-Jun Li, Tom Mitchell, and Brad Myers. Interactive task learning from GUI-grounded natural language instructions and demonstrations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, July 2020a.",
                "Ziming Li, Julia Kiseleva, and Maarten de Rijke. Dialogue generation: From imitation learning to inverse reinforcement learning. arXiv preprint arXiv:1812.03509, 2018.",
                "Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva, Maarten de Rijke, Shahin Shayandeh, and Jianfeng Gao. Guided dialog policy learning without adversarial learning in the loop. arXiv preprint arXiv:2004.03267, 2020b.",
                "Bing Liu and Ian Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482-489. IEEE, 2017.",
                "Bing Liu and Ian Lane. Adversarial learning of task-oriented neural dialog models. InProceedings of the SIGDIAL 2018 Conference, pages 350-359, 2018.",
                "Jingjing Liu and Nicholas J Belkin. Personalizing information retrieval for multi-session tasks: Examining the roles of task stage, task type, and topic knowledge on the interpretation of dwell time as an indicator of document usefulness. Journal of the Association for Information Science and Technology, 66(1):58-81, 2015.",
                "Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URLhttp://arxiv.org/abs/1907.11692.",
                "Gary Marchionini. Exploratory search: from finding to understanding. Communications of the ACM, 49(4):41-46, 2006.",
                "Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. arXiv preprint arXiv:2004.10645, 2020.",
                "Ramesh Nallapati and Chirag Shah. Evaluating the quality of query refinement suggestions in information retrieval. Technical report, MASSACHUSETTS UNIV AMHERST CENTER FOR INTELLIGENT INFORMATION RETRIEVAL, 2006.",
                "Daan Odijk, Ryen W White, Ahmed Hassan Awadallah, and Susan T Dumais. Struggling and success in web search. InProceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1551-1560, 2015.",
                "Christopher Olston and Ed H Chi. Scenttrails: Integrating browsing and searching on the web. ACM Transactions on Computer-Human Interaction (TOCHI), 10(3):177-197, 2003.",
                "OpenAI. Gpt-4 technical report. Technical report, arXiv:2303.08774 [cs.CL], 2023.",
                "Javier Parapar and Filip Radlinski. Diverse user preference elicitation with multi-armed bandits. InProceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 130-138, 2021.",
                "Vassilis Plachouras, Ben He, and Iadh Ounis. University of glasgow at trec 2004: Experiments in web, robust, and terabyte tracks with terrier. InTREC, 2004.",
                "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485-5551, 2020.",
                "Haggai Roitman. Ictir tutorial: Modern query performance prediction: Theory and practice. InProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, pages 195-196, 2020.",
                "Haggai Roitman, Shai Erera, and Guy Feigenblat. A study of query performance prediction for answer quality determination. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 43-46, 2019.",
                "Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020.",
                "Dwaipayan Roy, Debasis Ganguly, Mandar Mitra, and Gareth JF Jones. Estimating gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction. Information Processing & Management, 56(3):1026-1045, 2019.",
                "Tuukka Ruotsalo, Jaakko Peltonen, Manuel JA Eugster, Dorota G\u0142owacka, Patrik Flor\u00e9en, Petri Myllym\u00e4ki, Giulio Jacucci, and Samuel Kaski. Interactive intent modeling for exploratory search. ACM Transactions on Information Systems (TOIS), 36(4):1-46, 2018.",
                "Mark Sanderson. Ambiguous queries: test collections need more sense. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 499-506, 2008.",
                "Surendra Sarnikar, Zhu Zhang, and J Leon Zhao. Query-performance prediction for effective query routing in domain-specific repositories. Journal of the Association for Information Science and Technology, 65(8):1597-1614, 2014.",
                "Anna Sepliarskaia, Julia Kiseleva, Filip Radlinski, and Maarten de Rijke. Preference elicitation as an optimization problem. InProceedings of the 12th ACM Conference on Recommender Systems, pages 172-180, 2018.",
                "Adish Singla, Ryen White, and Jeff Huang. Studying trailfinding algorithms for enhanced web search. InProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 443-450, 2010.",
                "Amir Soleimani, Christof Monz, and Marcel Worring. NLQuAD: A non-factoid long question answering data set. InConference of the European Chapter of the Association for Computational Linguistics (EACL), pages 1245-1255, 2021.",
                "Ruihua Song, Zhenxiao Luo, Jian-Yun Nie, Yong Yu, and Hsiao-Wuen Hon. Identification of ambiguous queries in web search. Information Processing & Management, 45(2):216-229, 2009.",
                "Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. InTwenty-Fifth AAAI Conference on Artificial Intelligence, 2011.",
                "Randall H Trigg. Guided tours and tabletops: Tools for communicating in a hypertext environment. ACM Transactions on Information Systems (TOIS), 6(4):398-414, 1988.",
                "Robert Villa, Iv\u00e1n Cantador, Hideo Joho, and Joemon M Jose. An aspectual interface for supporting complex search tasks. InProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 379-386, 2009.",
                "Ryen W White and Resa A Roth. Exploratory search: Beyond the query-response paradigm. Synthesis lectures on information concepts, retrieval, and services, 1(1):1-98, 2009.",
                "Ryen W White, Mikhail Bilenko, and Silviu Cucerzan. Studying the use of popular destinations to enhance web search interaction. InProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 159-166, 2007.",
                "Ryen W White, Gary Marchionini, and Gheorghe Muresan. Evaluating exploratory search systems. Information Processing and Management, 44(2):433, 2008.",
                "W. A. Woods, Ronald M Kaplan, and Bonnie L. Webber. The lunar sciences natural language information system: Final report. BBN Report 2378, 1972.",
                "Chen Wu and Ming Yan. Session-aware information embedding for e-commerce product recommendation. InProceedings of the 2017 ACM on conference on information and knowledge management, pages 2379-2382, 2017.",
                "Ziyu Yao, Yu Su, Huan Sun, and Wen-tau Yih. Model-based interactive semantic parsing: A unified framework and a text-to-SQL case study. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5447-5458, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1547. URLhttps://www.aclweb.org/anthology/D19-1547.",
                "Steve Young, Milica Ga\u0161i\u0107, Blaise Thomson, and Jason D Williams. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160-1179, 2013.",
                "Hamed Zamani, W Bruce Croft, and J Shane Culpepper. Neural query performance prediction using weak supervision from multiple signals. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 105-114, 2018.",
                "Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019.",
                "Ying Zhao, Falk Scholer, and Yohannes Tsegay. Effective pre-retrieval query performance prediction using similarity and variability evidence. InEuropean conference on information retrieval, pages 52-64. Springer, 2008."
            ],
            "abstract": "Current interactive systems with natural language interfaces lack the ability to understand a complex information-seeking request which expresses several implicit constraints at once, and there is no prior information about user preferences e.g.,\"find hiking trails around San Francisco which are accessible with toddlers and have beautiful scenery in summer\", where output is a list of possible suggestions for users to start their exploration. In such scenarios, user requests can be issued in one shot in the form of a complex and long query, unlike conversational and exploratory search models, where require short utterances or queries are often presented to the system step by step. We have designed and deployed a platform to collect the data from approaching such complex interactive systems. Moreover, despite with the current advancement of generative language models these models suffer from hallucination in providing accurate factual knowledge. All language models are mostly trained in large part on web-scraped data from the past, which usually is not useful for immediate users' needs. In this article, we propose an IA that leverages Large Language Models (LLM) for complex request understanding and makes it interactive using Reinforcement learning that allows intricately refine user requests by making them complete, leading to better retrieval and reduce LLMs hallucination problems for current user needs. To demonstrate the performance of the proposed modeling paradigm, we have adopted various pre-retrieval metrics that capture the extent to which guided interactions with our system yield better retrieval results. Through extensive experimentation, we demonstrated that our method significantly outperforms several robust baselines.",
            "date": 2021,
            "title": "Making Large Language Models Interactive: A Pioneer Study on Supporting Complex Information-Seeking Tasks with Implicit Constraints"
        },
        "topic": "Hallucination in LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2311.04892",
            "isAPA": true,
            "related work": "Personas in LLMs:Personified LLMs have seen widespread usage in simulating human behavior.Park et al. (2023)created personas with detailed attributes and studied their evolution over time.Aher et al. (2023)used LLMs to replicate classic economic, psycho-linguistic, and social psychology experiments with some success.Argyle et al. (2023)showed some success in replicating the viewpoints of demographically varied U.S. sub-populations with GPT-3. Personas have also been used to create collaborative agents that collectively improve the LLM capability:Qian et al. (2023)used personas to create a virtual chat-powered software development company,Wang et al. (2023)used personas in a self-collaboration setting to improve the LLM performance on knowledge and reasoning tasks, andSalewski et al. (2023)showed that LLMs adopting expert personas can do better on vision and language tasks. Motivated by this emergence of personified LLMs, our work studies the impact of socio-demographic persona assignments on the reasoning abilities of LLMs.Biases in models:There is a vast amount of work on how bias in algorithms and systems can cause harm(Danks & London,2017; Barocas et al.,2017). Our focus is specifically on measuring the bias in learned models.\nBiases have been extensively studied in vector representations(Bolukbasi et al.,2016), task-specific models(Rudinger et al.,2018; Zhao et al.,2018), and even language models(Li et al.,2023)via their behavior on tasks such as coreference resolution(Rudinger et al.,2018; Zhao et al.,2018), entailment(Dev et al.,2019), and question answering(Li et al.,2020). In contrast to these works, our work specifically focuses on biases due to persona-assignment in LLMs.Persona Biases:Deshpande et al. (2023)demonstrated that personas can be used to surface toxic responses from ChatGPT.Cheng et al. (2023)showed that LLMs can generate stereotypical descriptions of socio-demographic personas.Sheng et al. (2021)studied the effect of persona on dialog systems with a focus on harmful text in their outputs.Wan et al. (2023)extended this study to personified LLMs (e.g. ChatGPT) with richer personas and more detailed analysis, however the focus was still on harmful text in generated outputs. Our work, to the best of our knowledge, is the first to use persona-assignment to study the impact of persona onreasoningperformance of LLMs.",
            "reference": [
                "Gati Aher, Rosa I. Arriaga, and Adam T. Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InICML, 2023.",
                "Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337-351, 2023.",
                "Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021.",
                "Solon Barocas, Kate Crawford, Aaron Shapiro, and Hanna Wallach. The problem with bias: From allocative to representational harms in machine learning. InSIGCIS conference paper, 2017.",
                "Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. InNeurIPS, 2016.",
                "S\u00e9bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A. Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv, abs/2303.12712, 2023.",
                "Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. ArXiv, abs/2305.18189, 2023.",
                "David Danks and Alex John London. Algorithmic bias in autonomous systems. InIJCAI, 2017.",
                "Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. ArXiv, abs/2304.05335, 2023.",
                "Sunipa Dev, Tao Li, J. M. Phillips, and Vivek Srikumar. On measuring and mitigating biased inferences of word embeddings. InAAAI, 2019.",
                "Jonas Freiknecht and Wolfgang Effelsberg. Procedural generation of interactive stories using language models. InProceedings of the 15th International Conference on the Foundations of Digital Games, pp.  1-8, 2020.",
                "Perttu H\u00e4m\u00e4l\u00e4inen, Mikke Tavast, and Anton Kunnari. Evaluating large language models in generating synthetic hci research data: a case study. InCHI, 2023.",
                "Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ArXiv, abs/2009.03300, 2020.",
                "Joseph Henrich, Steven J. Heine, and Ara Norenzayan. Most people are not weird. Nature, 466:29-29, 2010.",
                "John J. Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.",
                "Peter A. Jansen. From words to wires: Generating functioning electronic devices from natural language descriptions. ArXiv, abs/2305.14874, 2023.",
                "Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022.",
                "Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.",
                "Tao Li, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Vivek Srikumar. Unqovering stereotypical biases via underspecified questions. InFindings@EMNLP, 2020.",
                "Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Y. Wang. A survey on fairness in large language models. ArXiv, abs/2308.10149, 2023.",
                "Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023.",
                "OpenAI. Custom instructions for chatgpt, 2023a. https://openai.com/blog/custom-instructions-for-chatgpt.",
                "OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023b.",
                "Jeongeon Park and Daeun Choi. Audilens: Configurable llm-generated audiences for public speech practice. Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023.",
                "Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Social simulacra: Creating populated prototypes for social computing systems. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp.  1-18, 2022.",
                "Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST, 2023.",
                "Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. ArXiv, abs/2307.07924, 2023.",
                "Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chatgpt a general-purpose natural language processing task solver? ArXiv, abs/2302.06476, 2023.",
                "Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. InNAACL, 2018.",
                "Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. In-context impersonation reveals large language models' strengths and biases. ArXiv, abs/2305.14930, 2023.",
                "Emily Sheng, Josh Arnold, Zhou Yu, Kai-Wei Chang, and Nanyun Peng. Revealing persona biases in dialogue systems. ArXiv, abs/2104.08728, 2021.",
                "Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed Huai hsin Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. InAnnual Meeting of the Association for Computational Linguistics, 2022.",
                "Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant\u00f3n Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.",
                "Yixin Wan, Jieyu Zhao, Nanyun Peng, Kai-Wei Chang, and Aman Chadha. Are personalized stochastic parrots more dangerous? Evaluating persona biases in dialogue systems. ArXiv, abs/2310.05280, 2023.",
                "Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. ArXiv, abs/2307.05300, 2023.",
                "Edwin B Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 1927.",
                "Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. InNAACL, 2018.",
                "Jieyu Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Kai-Wei Chang. Ethical-advice taker: Do language models understand natural language interventions? InFindings@ACL, 2021."
            ],
            "abstract": "Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g.,'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.",
            "date": 2021,
            "title": "Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs"
        },
        "topic": "Bias and Fairness in LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2001.07966",
            "isAPA": false,
            "related work": "After Transformer[1]was proposed and widely used by cross-modal researches, the results on various tasks have been pushed to a new Everest in recent one year. Though almost all latest work are based on Transformer, they differ in various ways. We will review these work from different dimensions in below.  \u2022Model architecture.BERT[10]model is pre-trained for NLP tasks whose input is one or two sentences. To apply BERT structure to cross-modal tasks, there can be many ways to deal with different modalities. ViLBERT[14]and LXMERT[15]applied a single-modal Transformer to image and sentence respectively, then combined the two modalities together with a cross-modal Transformer. Other work, such as VisualBERT[16], B2T2[17], Unicoder-VL[18], VL-BERT[19], Unified VLP[20], UNITER[21], etc., all concatenated image and sentence as a single input to the Transformer. It is hard to argue which model structure is better, since its performance really depends on the specific scenario.\u2022Image visual tokens.Almost all recent paper applied an object detection model to the images and treated the detected regions of interest (RoIs) as image descriptors, just as linguistic tokens. Different from other work which used a pre-trained detection model, VL-BERT trained the detection network together with its image-text joint embedding network, and it also added global image features into model training. We can see region-based image features are good image descriptors, and they form a sequence of visual tokens that can be directly fed into Transformer.\u2022Pre-train data.Unlike language model pre-training that can leverage tremendous natural language data, vision-language tasks require high quality image descriptions that are hard to obtain for free. Conceptual Captions[2]is the most widely used data for image-text pre-training, given that it has 3M image descriptions and is relatively larger than other datasets. UNITER[21]combines four datasets (Conceptual Captions[2], SBU Captions[3], Visual Genome[22]and MSCOCO[5]) together to form a 9.6M training corpus and achieved state-of-the-art results on many image-text cross-modal tasks. LXMERT[15]added some VQA training data into pre-training and obtained state-of-the-art results on VQA task. We can see that data quality and volume play important roles in model training, and should be paid more attention to when designing new models.Model architecture.BERT[10]model is pre-trained for NLP tasks whose input is one or two sentences. To apply BERT structure to cross-modal tasks, there can be many ways to deal with different modalities. ViLBERT[14]and LXMERT[15]applied a single-modal Transformer to image and sentence respectively, then combined the two modalities together with a cross-modal Transformer. Other work, such as VisualBERT[16], B2T2[17], Unicoder-VL[18], VL-BERT[19], Unified VLP[20], UNITER[21], etc., all concatenated image and sentence as a single input to the Transformer. It is hard to argue which model structure is better, since its performance really depends on the specific scenario.Image visual tokens.Almost all recent paper applied an object detection model to the images and treated the detected regions of interest (RoIs) as image descriptors, just as linguistic tokens. Different from other work which used a pre-trained detection model, VL-BERT trained the detection network together with its image-text joint embedding network, and it also added global image features into model training. We can see region-based image features are good image descriptors, and they form a sequence of visual tokens that can be directly fed into Transformer.Pre-train data.Unlike language model pre-training that can leverage tremendous natural language data, vision-language tasks require high quality image descriptions that are hard to obtain for free. Conceptual Captions[2]is the most widely used data for image-text pre-training, given that it has 3M image descriptions and is relatively larger than other datasets. UNITER[21]combines four datasets (Conceptual Captions[2], SBU Captions[3], Visual Genome[22]and MSCOCO[5]) together to form a 9.6M training corpus and achieved state-of-the-art results on many image-text cross-modal tasks. LXMERT[15]added some VQA training data into pre-training and obtained state-of-the-art results on VQA task. We can see that data quality and volume play important roles in model training, and should be paid more attention to when designing new models.",
            "reference": [
                "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv e-prints, page arXiv:1706.03762, Jun 2017.",
                "Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InACL, 2018.",
                "Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. InNIPS, 2011.",
                "Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv e-prints, page arXiv:1412.2306, Dec 2014.",
                "Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv e-prints, page arXiv:1504.00325, Apr 2015.",
                "Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67-78, 2014.",
                "Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering. arXiv e-prints, page arXiv:1505.00468, May 2015.",
                "Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From Recognition to Cognition: Visual Commonsense Reasoning. arXiv e-prints, page arXiv:1811.10830, Nov 2018.",
                "Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. arXiv e-prints, page arXiv:1707.07998, Jul 2017.",
                "Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, page arXiv:1810.04805, Oct 2018.",
                "Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2015.",
                "Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv e-prints, page arXiv:1906.08237, Jun 2019.",
                "Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints, page arXiv:1907.11692, Jul 2019.",
                "Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ArXiv, abs/1908.02265, 2019.",
                "Hao Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. InEMNLP/IJCNLP, 2019.",
                "Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. ArXiv, abs/1908.03557, 2019.",
                "Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. InEMNLP/IJCNLP, 2019.",
                "Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. arXiv e-prints, page arXiv:1908.06066, Aug 2019.",
                "Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv e-prints, page arXiv:1908.08530, Aug 2019.",
                "Luowei Zhou, Hamid Palangi, Lefei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. ArXiv, abs/1909.11059, 2019.",
                "Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. ArXiv, abs/1909.11740, 2019.",
                "Ranjay Krishna, Yuke Zhu, Oliver Groth, J. M. Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32-73, 2016.",
                "Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. InThe IEEE International Conference on Computer Vision (ICCV), December 2015.",
                "Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.",
                "Amanpreet Singh, Vedanuj Goswami, Vivek Natarajan, Yu Jiang, Xinlei Chen, Meet Shah, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia-a platform for vision & language research. InSysML Workshop, NeurIPS, volume 2018, 2018.",
                "Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, \u0141ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv e-prints, page arXiv:1609.08144, Sep 2016.",
                "Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InACL, 2018.",
                "Pinghua Gong, Jieping Ye, and Changshui Zhang. Multi-stage multi-task feature learning. Advances in neural information processing systems, 14:2979-3010, 2012.",
                "Kuang-Huei Lee, Xiao Dong Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. InECCV, 2018.",
                "Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. Knowledge aware semantic concept expansion for image-text matching. InIJCAI, 2019.",
                "Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. Position focused attention network for image-text matching. InIJCAI, 2019.",
                "Forrest N. Iandola, Matthew W. Moskewicz, Sergey Karayev, Ross B. Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. ArXiv, abs/1404.1869, 2014.",
                "Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1-9, 2014."
            ],
            "abstract": "In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.",
            "date": 2021,
            "title": "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data"
        },
        "topic": "Large Multi-Modal Language Models"
    },
    {
        "source_paper": {
            "arxiv_id": "2306.00978",
            "isAPA": true,
            "related work": "Model quantization methods.Quantization reduces the bit-precision of deep learning models Han et al. (2016); Jacob et al. (2018); Nagel et al. (2019); Wang et al. (2019); Nagel et al. (2020); Lin et al. (2020), which helps to reduce the model size and accelerate inference. Quantization techniques generally fall into two categories: quantization-aware training (QAT, which relies on backpropagation to update the quantized weights) Bengio et al. (2013); Gholami et al. (2021); Nagel et al. (2021); Choi et al. (2018) and post-training quantization Jacob et al. (2018); Nagel et al. (2019; 2020) (PTQ, usually training-free). The QAT methods cannot easily scale up to large models like LLMs. Therefore, people usually use PTQ methods to quantize LLMs.\nQuantization of LLMs.People study two settings for LLM quantization: (1) W8A8 quantization, where both activation and weights are quantized to INT8 Dettmers et al. (2022); Xiao et al. (2022); Yao et al. (2022); Wei et al. (2022a; 2023); (2) Low-bit weight-only quantization (e.g., W4A16), where only weights are quantized into low-bit integers Frantar et al. (2022); Dettmers & Zettlemoyer (2022); Sheng et al. (2023); Park et al. (2022). We focus on the second setting in this work since it not only reduces the hardware barrier (requiring a smaller memory size) but also speeds up the token generation (remedies memory-bound workload). Apart from the vanilla round-to-nearest baseline (RTN), GPTQ Frantar et al. (2022) is the closest to our work. However, the reconstruction process of GPTQ leads to an over-fitting issue to the calibration set and may not preserve the generalist abilities of LLMs for other modalities and domains. It also requires a reordering trick to work for some models (e.g., LLaMA-7B Touvron et al. (2023a) and OPT-66B Zhang et al. (2022)). Apart from quantiztion methods designed for general-purporse hardware, SpAtten Wang et al. (2020) designs a progressive approach to gradually increase the number of bits used in softmax calculation.\nSystem support for low-bit quantized LLMs.Low-bit quantized LLMs have been a popular setting to reduce inference costs. There are some system supports to achieve a practical speed-up. GPTQ Frantar et al. (2022) provides INT3 kernels for OPT models and GPTQ-for-LLaMA extends kernel support for INT4 reordered quantization with the help of Triton Tillet et al. (2019). FlexGen Sheng et al. (2023), llama.cpp and exllama perform group-wise INT4 quantization to reduce I/O costs and offloading. FasterTransformer implements FP16\u00d7INT4 GEMM for weight-only per-tensor quantization but does not support group quantization. LUT-GEMM Park et al. (2022) performs bitwise computation on GPU CUDA cores with the help of lookup tables. Our concurrent work, MLC-LLM MLC-Team (2023) offers strong results on multiple edge CPU and GPU platforms thanks to the powerful TVM Chen et al. (2018); Feng et al. (2023) backend.",
            "reference": [
                "Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716-23736, 2022.",
                "Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C.Program synthesis with large language models, 2021.",
                "Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Wortsman, M., and Schmidt, L.Openflamingo, March 2023.URL https://doi.org/10.5281/zenodo.7733589.",
                "Bengio, Y., L\u00e9onard, N., and Courville, A.Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013.",
                "Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al.Gpt-neox-20b: An open-source autoregressive language model.arXiv preprint arXiv:2204.06745, 2022.",
                "Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.Language models are few-shot learners.In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877-1901. Curran Associates, Inc., 2020.URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.",
                "Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al.TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.",
                "Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Doll\u00e1r, P., and Zitnick, C. L.Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015.",
                "Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.URL https://lmsys.org/blog/2023-03-30-vicuna/.",
                "Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., and Gopalakrishnan, K.Pact: Parameterized clipping activation for quantized neural networks.arXiv preprint arXiv:1805.06085, 2018.",
                "Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al.Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416, 2022.",
                "Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J.Training verifiers to solve math word problems, 2021.",
                "Dettmers, T. and Zettlemoyer, L.The case for 4-bit precision: k-bit inference scaling laws.arXiv preprint arXiv:2212.09720, 2022.",
                "Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339, 2022.",
                "Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023.",
                "Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S.Learned step size quantization.arXiv preprint arXiv:1902.08153, 2019.",
                "Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., Ye, Z., Zheng, L., Yu, C. H., Yu, Y., and Chen, T.TensorIR: An Abstraction for Automatic Tensorized Program Optimization.In ASPLOS, 2023.",
                "Frankle, J. and Carbin, M.The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635, 2018.",
                "Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D.Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022.",
                "Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R.MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.arXiv preprint arXiv:2306.13394, 2023.",
                "Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al.The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020.",
                "Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K.A survey of quantization methods for efficient neural network inference.arXiv preprint arXiv:2103.13630, 2021.",
                "Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D.Making the v in vqa matter: Elevating the role of image understanding in visual question answering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904-6913, 2017.",
                "Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P.Vizwiz grand challenge: Answering visual questions from blind people.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608-3617, 2018.",
                "Han, S., Pool, J., Tran, J., and Dally, W.Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015.",
                "Han, S., Mao, H., and Dally, W. J.Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.In ICLR, 2016.",
                "Hudson, D. A. and Manning, C. D.Gqa: A new dataset for real-world visual reasoning and compositional question answering.In CVPR, 2019.",
                "Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D.Quantization and training of neural networks for efficient integer-arithmetic-only inference.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2704-2713, 2018.",
                "Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.",
                "Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E.Mixtral of experts, 2024.",
                "Kim, Y. J., Henry, R., Fahim, R., and Awadalla, H. H.Who says elephants can't run: Bringing large scale moe models into cloud scale production.arXiv preprint arXiv:2211.10017, 2022.",
                "Klimt, B. and Yang, Y.The enron corpus: A new dataset for email classification research.In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15, pp.  217-226. Springer, 2004.",
                "Koh, J. Y., Salakhutdinov, R., and Fried, D.Grounding language models to images for multimodal generation.arXiv preprint arXiv:2301.13823, 2023.",
                "Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., and Shan, Y.Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a.",
                "Li, J., Li, D., Savarese, S., and Hoi, S.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023b.",
                "Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al.Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023c.",
                "Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S.Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021.",
                "Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R.Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023d.",
                "Lin, J., Chen, W.-M., Lin, Y., Gan, C., Han, S., et al.Mcunet: Tiny deep learning on iot devices.Advances in Neural Information Processing Systems, 33:11711-11722, 2020.",
                "Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., and Han, S.Vila: On pre-training for visual language models.In CVPR, 2024.",
                "Liu, H., Li, C., Wu, Q., and Lee, Y. J.Visual instruction tuning.2023a.",
                "Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023b.",
                "Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A.Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507-2521, 2022.",
                "Merity, S., Xiong, C., Bradbury, J., and Socher, R.Pointer sentinel mixture models, 2016.",
                "MLC-Team.MLC-LLM, 2023.URL https://github.com/mlc-ai/mlc-llm.",
                "Nagel, M., Baalen, M. v., Blankevoort, T., and Welling, M.Data-free quantization through weight equalization and bias correction.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1325-1334, 2019.",
                "Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and Blankevoort, T.Up or down? adaptive rounding for post-training quantization.In International Conference on Machine Learning, pp.  7197-7206. PMLR, 2020.",
                "Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., Van Baalen, M., and Blankevoort, T.A white paper on neural network quantization.arXiv preprint arXiv:2106.08295, 2021.",
                "Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730-27744, 2022.",
                "Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y., and Lee, D.nuqmm: Quantized matmul for efficient inference of large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022.",
                "Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J.The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116, 2023.",
                "Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al.Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207, 2021.",
                "Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili\u0107, S., Hesslow, D., Castagn\u00e9, R., Luccioni, A. S., Yvon, F., Gall\u00e9, M., et al.Bloom: A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100, 2022.",
                "Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y., Xie, Z., Chen, B., Barrett, C., Gonzalez, J. E., et al.High-throughput generative inference of large language models with a single gpu.arXiv preprint arXiv:2303.06865, 2023.",
                "Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M.Towards vqa models that can read.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317-8326, 2019.",
                "Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B.Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023.",
                "Tillet, P., Kung, H.-T., and Cox, D.Triton: an intermediate language and compiler for tiled neural network computations.In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp.  10-19, 2019.",
                "Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., Azhar, F., et al.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a.",
                "Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023b.",
                "Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., and Polosukhin, I.Attention is all you need.Advances in neural information processing systems, 30, 2017.",
                "Wang, H., Zhang, Z., and Han, S.Spatten: Efficient sparse attention architecture with cascade token and head pruning.CoRR, abs/2012.09852, 2020.URL https://arxiv.org/abs/2012.09852.",
                "Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S.HAQ: Hardware-Aware Automated Quantization with Mixed Precision.In CVPR, 2019.",
                "Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021.",
                "Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X.Outlier suppression: Pushing the limit of low-bit transformer language models, 2022a.URL https://arxiv.org/abs/2209.13325.",
                "Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X.Outlier suppression: Pushing the limit of low-bit transformer language models.arXiv preprint arXiv:2209.13325, 2022b.",
                "Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., and Liu, X.Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023.",
                "Xiao, G., Lin, J., Seznec, M., Demouth, J., and Han, S.Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2022.",
                "Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and He, Y.Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.URL https://arxiv.org/abs/2206.01861.",
                "Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L.Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023.",
                "Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., and Qiao, Y.Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023.",
                "Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L.Opt: Open pre-trained transformer language models, 2022.URL https://arxiv.org/abs/2205.01068."
            ],
            "abstract": "Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.",
            "date": 2021,
            "title": "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"
        },
        "topic": "Acceleration for LLMs"
    },
    {
        "source_paper": {
            "arxiv_id": "2310.10508",
            "isAPA": true,
            "related work": "2.1.Language Models in Software Engineering and Fine-TuningThe field of automated software engineering (ASE) has shifted significantly towards using deep learning models, especially language models (LMs) to model source code (Wang et al., 2016; White et al., 2016; Yin and Neubig, 2018; Liu et al., 2019; Allamanis et al., 2018). These methods have been found to offer substantial advantages over traditional approaches such as domain-specific language-guided models, probabilistic grammars, and simple neural language models. A common architecture for language modeling is the encoder-decoder architecture (Cho et al., 2014). In recent years, LMs have been applied to different varieties of ASE applications. These include code completion (Li et al., 2018), code search (Gu et al., 2018), code generation (Shen et al., 2022), unit test case generation (Tufano et al., 2020), code summarization (Hu et al., 2018), code translation (Sun et al., 2022), automated program repair (Li et al., 2022), etc.\nPre-trained code models can learn general-purpose code representations that capture various code properties such as lexical, syntactic, semantic, and structural information. Fine-tuning can adapt these models to specific tasks by updating the pre-trained parameters with task-specific data. By leveraging this new technique, models could outperform existing baselines by a huge margin. To exploit the new technology in the ASE domain, a plethora of studies have applied pre-training models to source code and natural language corpora and fine-tune downstream tasks, e.g. code search, comment generation, variable-misues classification, wrong binary operator detection, function-docstring mismatch prediction, exception type classification, code generation, code completion, code translation, etc. (Feng et al., 2020; Kanade et al., 2020; Guo et al., 2022; Ahmad et al., 2021; Ahmed and Devanbu, 2022).\n2.2.Prompt engineering in Software EngineeringPrompt engineering is an alternative to fine-tuning that also adapts pre-trained LMs as fine-tuned language models. However, they do not rely on the fine-tuning phase with the supervised dataset. Instead, they provide prompts to the pre-trained LMs to consider all different kinds of tasks to a generation problem. These models are generally larger in terms of the size of corpora that are trained on and the number of parameters that the model learns from. Due to this distinction, they are called Large Language Models (LLMs). The advent of LLMs and prompt engineering brought the LM tasks to a new level of performance (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023; Touvron et al., 2023; et al., 2023a, 2022, b). There have been numerous studies that exploited LLMs and prompt engineering to tackle ASE tasks (Khan and Uddin, 2022; Gao et al., 2023; Wei et al., 2023; Feng and Chen, 2023; Geng et al., 2024). They have proposed different methods of prompting strategies, i.e. basic prompting, in-context learning, task-specific prompting, chain-of-thought prompting, auto-prompting, soft prompting, etc. (Liu et al., 2023, 2022; He et al., 2022; Carta et al., 2023; Wei et al., 2022; Shin et al., 2020; Hambardzumyan et al., 2021).\nGao et al. (Gao et al., 2023) empirically investigated the three key factors in in-context learning in code intelligence tasks: selection, order, and number of examples. They found that both similarity and diversity in example selection are important in both performance and stability in predictions. They also find that the order and the number of examples have an impact on their performance. Li et al. (Li et al., 2023) investigated ChatGPT's ability to find correct failure-inducing test cases for buggy source code. From their initial finding, ChatGPT had a low success rate but after guiding it to focus with correct nuances, they were able to drastically improve the performance. Feng et al. (Feng and Chen, 2023) proposed AdbGPT, a novel approach that uses LLMs to automatically reproduce bugs from bug reports, without any training or hard-coding effort. They designed prompts that leverage few-shot learning and chain-of-thought reasoning, to elicit LLMs' knowledge and logical reasoning for bug replay. Geng et al. (Geng et al., 2024) investigated the LLMs' performance in generating code comments with multiple intents regarding their properties. They adopted the in-context learning paradigm and designed customized strategies for example selection and re-ranking techniques to enhance the performance. Kabir et al. (Kabir et al., 2023) did an analysis on ChatGPT's responses to Stack Overflow (SO) questions in ASE tasks. They analyzed 517 SO questions, linguistic analysis, and a user study and found more than half of the answers were incorrect and 77% of them were verbose. However, users still preferred ChatGPT 39.34% of the time due to the comprehensiveness and the style of language.\nAlthough the research in prompt engineering is picking up, there have not been many studies that have compared the fine-tuning models and the prompt-engineered LLMs comprehensively on various ASE tasks. To mitigate this research gap, this paper compares the former fine-tuning paradigm with the prompt-engineered LLMs in quantitative and qualitative approaches.",
            "reference": [
                "[n.d.]. Online Appendix,https://anonymous.4open.science/r/gpt4_ase_tasks-6BF8, 2023.",
                "Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655-2668.",
                "Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual training for software engineering. InProceedings of the 44th International Conference on Software Engineering. 1443-1455.",
                "Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. InInternational Conference on Learning Representations.",
                "Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al.2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732(2021).",
                "Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.  arXiv:2005.14165 [cs.CL]",
                "Salvatore Carta, Alessandro Giuliani, Leonardo Piano, Alessandro Sebastian Podda, Livio Pompianu, and Sandro Gabriele Tiddia. 2023. Iterative Zero-Shot LLM Prompting for Knowledge Graph Construction. arXiv preprint arXiv:2307.01128(2023).",
                "Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests.  arXiv:2207.10397 [cs.CL]",
                "Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al.2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021).",
                "Kyunghyun Cho, B van Merrienboer, Caglar Gulcehre, F Bougares, H Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP 2014).",
                "Aakanksha Chowdhery et al. 2022. PaLM: Scaling Language Modeling with Pathways.  arXiv:2204.02311 [cs.CL]",
                "Hugo Touvron et al. 2023a. Llama 2: Open Foundation and Fine-Tuned Chat Models.  arXiv:2307.09288 [cs.CL]",
                "Rohan Anil et al. 2023b. PaLM 2 Technical Report.  arXiv:2305.10403 [cs.CL]",
                "Sidong Feng and Chunyang Chen. 2023. Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE).",
                "Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al.2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536-1547.",
                "Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, and Michael R Lyu. 2023. Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering.",
                "Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao. 2024. Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE).",
                "Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. InProceedings of the 40th International Conference on Software Engineering. 933-944.",
                "Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7212-7225.",
                "Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. WARP: Word-level Adversarial ReProgramming. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 4921-4933.",
                "Yun He, Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, et al.2022. Hyperprompt: Prompt-based task-conditioning of transformers. InInternational Conference on Machine Learning. PMLR, 8678-8690.",
                "Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. 2023. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework.  arXiv:2308.00352 [cs.AI]",
                "Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. InProceedings of the 26th conference on program comprehension. 200-210.",
                "Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436(2019).",
                "Samia Kabir, David N Udo-Imeh, Bonan Kou, and Tianyi Zhang. 2023. Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions. arXiv preprint arXiv:2308.02312(2023).",
                "Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and Evaluating Contextual Embedding of Source Code. InProceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daum\u00e9 III and Aarti Singh (Eds.). PMLR, 5110-5121. https://proceedings.mlr.press/v119/kanade20a.html",
                "Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation generation using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1-6.",
                "Barbara Kitchenham and Shari Lawrence Pfleeger. 2002. Principles of survey research: part 5: populations and samples. ACM SIGSOFT Software Engineering Notes27, 5 (2002), 17-20.",
                "Jian Li, Yue Wang, Michael R Lyu, and Irwin King. 2018. Code completion with neural attention and pointer networks. InProceedings of the 27th International Joint Conference on Artificial Intelligence. 4159-25.",
                "Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering. arXiv:2304.11686 [cs.SE]",
                "Yi Li, Shaohua Wang, and Tien N Nguyen. 2022. Dear: A novel deep learning-based approach for automated program repair. InProceedings of the 44th International Conference on Software Engineering. 511-523.",
                "Hui Liu, Jiahao Jin, Zhifeng Xu, Yanzhen Zou, Yifan Bu, and Lu Zhang. 2019. Deep learning based code smell detection. IEEE transactions on Software Engineering47, 9 (2019), 1811-1837.",
                "Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. InDeep Learning Inside Out: 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO 2022. Association for Computational Linguistics (ACL), 100-114.",
                "Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys55, 9 (2023), 1-35.",
                "Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al.2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664(2021).",
                "Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning. PMLR, 26106-26128.",
                "OpenAI. 2023. GPT-4 Technical Report.  arXiv:2303.08774 [cs.CL]",
                "Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.  arXiv:2203.02155 [cs.CL]",
                "Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828(2023).",
                "Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311-318.",
                "Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. 2021. Cotext: Multi-task learning with code-text transformer. arXiv preprint arXiv:2105.08645(2021).",
                "Weizhen Qi, Yeyun Gong, Yu Yan, Can Xu, Bolun Yao, Bartuer Zhou, Biao Cheng, Daxin Jiang, Jiusheng Chen, Ruofei Zhang, Houqiang Li, and Nan Duan. 2021. ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 232-239. https://doi.org/10.18653/v1/2021.acl-demo.28",
                "Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297(2020).",
                "Stephen Robertson, Hugo Zaragoza, et al.2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends\u00ae in Information Retrieval3, 4 (2009), 333-389.",
                "Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li. 2022. Incorporating domain knowledge through task augmentation for front-end JavaScript code generation. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1533-1543.",
                "Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I Wang. 2022. Natural Language to Code Translation with Execution. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3533-3546.",
                "Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4222-4235.",
                "Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning.  arXiv:2303.11366 [cs.AI]",
                "Stephen V Stehman. 1997. Selecting and interpreting measures of thematic classification accuracy. Remote sensing of Environment62, 1 (1997), 77-89.",
                "Weisong Sun, Chunrong Fang, Yuchen Chen, Guanhong Tao, Tingxu Han, and Quanjun Zhang. 2022. Code search based on context-aware code translation. InProceedings of the 44th International Conference on Software Engineering. 388-400.",
                "Sindhu Tipirneni, Ming Zhu, and Chandan K. Reddy. 2023. StructCoder: Structure-Aware Transformer for Code Generation.  arXiv:2206.05239 [cs.LG]",
                "Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models.  arXiv:2302.13971 [cs.CL]",
                "Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617(2020).",
                "Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. InProceedings of the 38th International Conference on Software Engineering. 297-308.",
                "Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems35 (2022), 24824-24837.",
                "Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.",
                "Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. InProceedings of the 31st IEEE/ACM international conference on automated software engineering. 87-98.",
                "Pengcheng Yin and Graham Neubig. 2018. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing (Demo Track).",
                "Zhiqiang Yuan, Junwei Liu, Qiancheng Zi, Mingwei Liu, Xin Peng, and Yiling Lou. 2023. Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240(2023).",
                "Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, and Nick Haber. 2023. Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions.  arXiv:2212.10561 [cs.CL]",
                "Tianyi Zhang, Tao Yu, Tatsunori B. Hashimoto, Mike Lewis, Wen tau Yih, Daniel Fried, and Sida I. Wang. 2022. Coder Reviewer Reranking for Code Generation.  arXiv:2211.16490 [cs.LG]"
            ],
            "abstract": "In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests that prompt engineering GPT-4 cannot necessarily and significantly outperform fine-tuning smaller/older LLMs in all three tasks. For comment generation, GPT-4 with the best prompting strategy (i.e., task-specific prompt) had outperformed the first-ranked fine-tuned model by 8.33% points on average in BLEU. However, for code generation, the first-ranked fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3% points, on average in BLEU. For code translation, GPT-4 and fine-tuned baselines tie as they outperform each other on different translation tasks. To explore the impact of different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e., when a human provides feedback and instructions back and forth with a model to achieve best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies. Moreover, we observe that participants tend to request improvements, add more context, or give specific instructions as conversational prompts, which goes beyond typical and generic prompting strategies. Our study suggests that, at its current state, GPT-4 with conversational prompting has great potential for ASE tasks, but fully automated prompt engineering with no human in the loop requires more study and improvement.",
            "date": 2021,
            "title": "Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks"
        },
        "topic": "LLMs for Software Engineering"
    }
]