    A '''large language model''' ('''LLM''') is a computational [[Model#Conceptual model|model]] notable for its ability to achieve general-purpose language generation and other [[natural language processing]] tasks such as [[Statistical classification|classification]]. Based on [[language model]]s, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive [[self-supervised learning|self-supervised]] and [[semi-supervised learning|semi-supervised]] training process.<ref name=":7">{{Cite web |date=2019-02-14 |title=Better Language Models and Their Implications |url=https://openai.com/blog/better-language-models/ |url-status=live |archive-url= https://web.archive.org/web/20201219132206/https://openai.com/blog/better-language-models/ |archive-date=2020-12-19 |access-date=2019-08-25 |website=OpenAI}}</ref> LLMs can be used for text generation, a form of [[Generative artificial intelligence|generative AI]], by taking an input text and repeatedly predicting the next token or word.<ref name="Bowman">{{cite arXiv |eprint=2304.00612 |class=cs.CL |first=Samuel R. |last=Bowman |title=Eight Things to Know about Large Language Models |year=2023}}</ref>
    LLMs are [[artificial neural network]]s that utilize the [[Transformer (deep learning architecture)|transformer]] architecture, invented in 2017. The largest and most capable LLMs, {{as of|2024|06|lc=y}}, are built with a decoder-only transformer-based architecture, which enables efficient processing and generation of large-scale text data.
    Historically, up to 2020, [[Fine-tuning (deep learning)|fine-tuning]] was the primary method used to adapt a model for specific tasks. However, larger models such as [[GPT-3]] have demonstrated the ability to achieve similar results through [[prompt engineering]], which involves crafting specific input prompts to guide the model's responses.<ref name="few-shot-learners">{{cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=Dec 2020 |editor1-last=Larochelle |editor1-first=H. |editor2-last=Ranzato |editor2-first=M. |editor3-last=Hadsell |editor3-first=R. |editor4-last=Balcan |editor4-first=M.F. |editor5-last=Lin |editor5-first=H. |title=Language Models are Few-Shot Learners |url=https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=1877–1901 |last25=Chess |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |first26=Jack |first25=Benjamin |last26=Clark |last19=Winter |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |last31=Amodei |first31=Dario |first19=Clemens |first18=Jeffrey |last18=Wu |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M.}}</ref> These models acquire knowledge about syntax, semantics, and [[ontology (information science)|ontologies]]<ref>{{cite conference |url=https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf |title=NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning |last1=Fathallah |first1=Nadeen |last2=Das |first2=Arunav |last3=De Giorgis |first3=Stefano |last4=Poltronieri |first4=Andrea |last5=Haase |first5=Peter |last6=Kovriguina |first6=Liubov |date=2024-05-26 |location=Hersonissos, Greece |conference=Extended Semantic Web Conference 2024}}</ref> inherent in human language corpora, but they also inherit inaccuracies and [[Algorithmic bias|biases]] present in the data they are trained on.<ref name="Manning-2022">{{cite journal |last=Manning |first=Christopher D. |author-link=Christopher D. Manning |year=2022 |title=Human Language Understanding & Reasoning |url=https://www.amacad.org/publication/human-language-understanding-reasoning |journal=Daedalus |volume=151 |issue=2 |pages=127–138 |doi=10.1162/daed_a_01905 |s2cid=248377870|doi-access=free }}</ref>
    Some notable LLMs are [[OpenAI]]'s [[Generative pre-trained transformer|GPT]] series of models (e.g., [[GPT-3.5]], [[GPT-4]] and [[GPT-4o]]; used in [[ChatGPT]] and [[Microsoft Copilot]]), [[Google]]'s [[Gemini (language model)|Gemini]] (the latter of which is currently used in [[Gemini (chatbot)|the chatbot of the same name]]), [[Meta Platforms|Meta]]'s [[LLaMA]] family of models, [[Anthropic]]'s [[Claude (language model)|Claude]] models, and [[Mistral AI]]'s models.
    ==History==    Before 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the [[IBM alignment models]] pioneered statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3 billion words achieved then-SOTA perplexity.<ref>{{Citation |last=Goodman |first=Joshua |title=A Bit of Progress in Language Modeling |date=2001-08-09 |arxiv=cs/0108005 }}</ref> In the 2000s, as Internet use became prevalent, some researchers constructed Internet-scale language datasets ("web as corpus"<ref>{{Cite journal |last1=Kilgarriff |first1=Adam |last2=Grefenstette |first2=Gregory |date=September 2003 |title=Introduction to the Special Issue on the Web as Corpus |url=https://direct.mit.edu/coli/article/29/3/333-347/1816 |journal=Computational Linguistics |volume=29 |issue=3 |pages=333–347 |doi=10.1162/089120103322711569 |issn=0891-2017}}</ref>), upon which they trained statistical language models.<ref>{{Cite journal |last1=Banko |first1=Michele |last2=Brill |first2=Eric |date=2001 |title=Scaling to very very large corpora for natural language disambiguation |url=http://dx.doi.org/10.3115/1073012.1073017 |journal=Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 |pages=26–33 |location=Morristown, NJ, USA |publisher=Association for Computational Linguistics |doi=10.3115/1073012.1073017}}</ref><ref>{{Cite journal |last1=Resnik |first1=Philip |last2=Smith |first2=Noah A. |date=September 2003 |title=The Web as a Parallel Corpus |url=https://direct.mit.edu/coli/article/29/3/349-380/1809 |journal=Computational Linguistics |volume=29 |issue=3 |pages=349–380 |doi=10.1162/089120103322711578 |issn=0891-2017|doi-access=free }}</ref> In 2009, in most language processing tasks, statistical language models dominated over symbolic language models, as they can usefully ingest large datasets.<ref>{{Cite journal |last1=Halevy |first1=Alon |last2=Norvig |first2=Peter |last3=Pereira |first3=Fernando |date=March 2009 |title=The Unreasonable Effectiveness of Data |url=https://ieeexplore.ieee.org/document/4804817 |journal=IEEE Intelligent Systems |volume=24 |issue=2 |pages=8–12 |doi=10.1109/MIS.2009.36 |issn=1541-1672}}</ref>
    After neural networks became dominant in image processing around 2012, they were applied to language modelling as well. Google converted its translation service to [[Google Neural Machine Translation|Neural Machine Translation]] in 2016. As it was before Transformers, it was done by seq2seq deep LSTM networks.[[File:The-Transformer-model-architecture.png|thumb|upright=1.3|An illustration of main components of the transformer model from the original paper, where layers were normalized after (instead of before) multiheaded attention]]
    At the 2017 [[NeurIPS]] conference, Google researchers introduced the [[transformer architecture]] in their landmark paper "[[Attention Is All You Need]]". This paper's goal was to improve upon 2014 [[Seq2seq]] technology,<ref>{{cite journal |last1=Vaswani |first1=Ashish |author1-link= Ashish Vaswani |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |author6-link= Aidan Gomez |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |title=Attention is All you Need |journal=Advances in Neural Information Processing Systems |date=2017 |volume=30 |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |publisher=Curran Associates, Inc.}}</ref> and was based mainly on the [[attention (machine learning)|attention]] mechanism developed by Bahdanau et al. in 2014.<ref>{{cite arXiv |last1=Bahdanau |first1=Dzmitry |last2=Cho |first2=Kyunghyun |last3=Bengio |first3=Yoshua |title=Neural Machine Translation by Jointly Learning to Align and Translate |date=2014 |class=cs.CL |eprint=1409.0473}}</ref> The following year in 2018, [[BERT (language model)|BERT]] was introduced and quickly became "ubiquitous".<ref>{{Cite journal|last1=Rogers|first1=Anna|last2=Kovaleva|first2=Olga|last3=Rumshisky|first3=Anna|date=2020|title=A Primer in BERTology: What We Know About How BERT Works|url=https://aclanthology.org/2020.tacl-1.54|journal=Transactions of the Association for Computational Linguistics|volume=8|pages=842–866|doi=10.1162/tacl_a_00349|arxiv=2002.12327|s2cid=211532403}}</ref> Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model.
    Although decoder-only [[GPT-1]] was introduced in 2018, it was [[GPT-2]] in 2019 that caught widespread attention because [[OpenAI]] at first deemed it too powerful to release publicly, out of fear of malicious use.<ref>{{cite web |url=https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction |title=New AI fake text generator may be too dangerous to release, say creators |last=Hern |first=Alex |work=[[The Guardian]] |date=14 February 2019 |access-date = 20 January 2024}}</ref> [[GPT-3]] in 2020 went a step further and {{as of|2024|lc=y}} is available only via [[Web API|API]] with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing browser-based [[ChatGPT]] that captured the imaginations of the general population and caused some media hype and online buzz.<ref>{{cite web |url=https://www.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-the-world-in-12-months |title=ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months |author=<!--Not stated--> |date=November 30, 2023 |publisher=[[Euronews]] |access-date=January 20, 2024}}</ref> The 2023 [[GPT-4]] was praised for its increased accuracy and as a "holy grail" for its [[Multimodal learning|multimodal]] capabilities.<ref>{{cite web |url=https://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/ |title=GPT-4 is bigger and better than ChatGPT—but OpenAI won't say why |last=Heaven |first=Will |date=March 14, 2023 |publisher=[[MIT Technology Review]] |access-date=January 20, 2024}}</ref> OpenAI did not reveal high-level architecture and the number of [[Parameter#Artificial Intelligence|parameters]] of GPT-4.
    Competing language models have for the most part been attempting to equal the GPT series, at least in terms of number of parameters.<ref>{{cite web |url=https://ourworldindata.org/grapher/artificial-intelligence-parameter-count?time=2017-09-05..latest |title=Parameters in notable artificial intelligence systems |author=<!--Not stated--> |date=November 30, 2023 |website=ourworldindata.org |access-date=January 20, 2024}}</ref>
    Since 2022, [[Source-available software|source-available]] models have been gaining popularity, especially at first with [[BLOOM (language model)|BLOOM]] and [[LLaMA]], though both have restrictions on the field of use. [[Mistral AI]]'s models Mistral 7B and Mixtral 8x7b have the more permissive [[Apache License]]. {{As of|2024|6}}, The Instruction fine tuned variant of the Llama 3 70 billion parameter model is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being more powerful than GPT-3.5 but not as powerful as GPT-4.<ref>{{cite web |url=https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard |title=LMSYS Chatbot Arena Leaderboard |author=<!--Not stated--> |website=huggingface.co |access-date=June 12, 2024}}</ref>
    As of 2024, the largest and most capable models are all based on the Transformer architecture. Some recent implementations are based on other architectures, such as [[recurrent neural network]] variants and [[Mamba (deep learning architecture)|Mamba]] (a [[state-space representation|state space]] model).<ref>{{cite arXiv |eprint=2305.13048 |last1=Peng |first1=Bo |last2=Alcaide |first2=Eric |last3=Anthony |first3=Quentin |last4=Albalak |first4=Alon |last5=Arcadinho |first5=Samuel |last6=Biderman |first6=Stella |last7=Cao |first7=Huanqi |last8=Cheng |first8=Xin |last9=Chung |first9=Michael |last10=Grella |first10=Matteo |author11=Kranthi Kiran GV |last12=He |first12=Xuzheng |last13=Hou |first13=Haowen |last14=Lin |first14=Jiaju |last15=Kazienko |first15=Przemyslaw |last16=Kocon |first16=Jan |last17=Kong |first17=Jiaming |last18=Koptyra |first18=Bartlomiej |last19=Lau |first19=Hayden |author20=Krishna Sri Ipsit Mantri |last21=Mom |first21=Ferdinand |last22=Saito |first22=Atsushi |last23=Song |first23=Guangyu |last24=Tang |first24=Xiangru |last25=Wang |first25=Bolun |last26=Wind |first26=Johan S. |last27=Wozniak |first27=Stanislaw |last28=Zhang |first28=Ruichong |last29=Zhang |first29=Zhenyuan |last30=Zhao |first30=Qihang |title=RWKV: Reinventing RNNS for the Transformer Era |date=2023 |class=cs.CL |display-authors=1 }}</ref><ref>{{Cite web |last=Merritt |first=Rick |date=2022-03-25 |title=What Is a Transformer Model? |url=https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/ |access-date=2023-07-25 |website=NVIDIA Blog }}</ref><ref>{{Citation |last1=Gu |first1=Albert |title=Mamba: Linear-Time Sequence Modeling with Selective State Spaces |date=2023-12-01 |arxiv=2312.00752 |last2=Dao |first2=Tri}}</ref>
