{
  "episode_id": "11af4e1a-5f45-467d-9aeb-46f4bb0bf034",
  "task": "How many more blocks (also denoted as layers) in BERT base encoder than the encoder from the architecture proposed in Attention is All You Need?",
  "file": "",
  "success": true,
  "num_turns": 19,
  "total_tokens": 52395,
  "trajectory": [
    {
      "turn_idx": 0,
      "action": "mark_step({\"step_index\": 0, \"step_status\": \"completed\", \"step_notes\": \"Clarified the question: compare the number of encoder blocks (layers) in BERT base vs. the encoder in the original Transformer (Attention Is All You Need). Known standard configurations: BERT base has 12 Transformer encoder layers; the original Transformer encoder (base configuration) has 6 layers. The difference is 6 more layers in BERT base. No external tools or files were needed. Proceeding to provide the final concise answer following the required template.\"})",
      "explanation": "Clarified the question: compare the number of encoder blocks (layers) in BERT base vs. the encoder in the original Transformer (Attention Is All You Need). Known standard configurations: BERT base has 12 Transformer encoder layers; the original Transformer encoder (base configuration) has 6 layers. The difference is 6 more layers in BERT base. No external tools or files were needed. Proceeding to provide the final concise answer following the required template.",
      "observation": "Step 0: step_status is completed, step_notes is Clarified the question: compare the number of encoder blocks (layers) in BERT base vs. the encoder in the original Transformer (Attention Is All You Need). Known standard configurations: BERT base has 12 Transformer encoder layers; the original Transformer encoder (base configuration) has 6 layers. The difference is 6 more layers in BERT base. No external tools or files were needed. Proceeding to provide the final concise answer following the required template. "
    },
    {
      "turn_idx": 1,
      "action": "tavily_search({\"query\": \"BERT base number of layers encoder L=12 source\"})",
      "explanation": "Collected and cross-verified authoritative counts for encoder blocks:\n- BERT base: 12 Transformer encoder blocks (12L), per Wikipedia (BERT page): \"passed forward through 12 Transformer encoder blocks\" and \"BERT BASE is written as 12L/768H\".\n- Transformer encoder in Attention Is All You Need: stack of N=6 identical layers, per Harvard NLP 'The Annotated Transformer' and Glass Box Medicine posts. \nSaved consolidated extraction with verbatim quotes and sources at: work_spacegaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md",
      "observation": "{'query': 'BERT base number of layers encoder L=12 source', 'follow_up_questions': None, 'answer': None, 'images': [], 'results': [{'url': 'https://medium.com/@vipra_singh/llm-architectures-explained-bert-part-8-c60c1d9ebc82', 'title': 'LLM Architectures Explained: BERT (Part 8) | by Vipra Singh - Medium', 'content': 'BERT Base: Number of Layers L=12, Size of the hidden layer, H=768, and Self-attention heads, A=12 with Total Parameters=110M; BERT Large', 'score': 0.82288796, 'raw_content': None}, {'url': 'https://www.kaggle.com/code/mdfahimreshm/bert-in-depth-understanding', 'title': 'BERT - In Depth Understanding - Kaggle', 'content': 'There are two types of BERT. In bert_base, it is used tweleve (12) encoder stacking. On the other hand, in bert_large twenty four (24) encoder stacking is used', 'score': 0.64826655, 'raw_content': None}, {'url': 'https://pmc.ncbi.nlm.nih.gov/articles/PMC8295980/', 'title': 'Implementation of the BERT-derived architectures to tackle ... - NIH', 'content': 'For example, the BERT architecture with 12 layers contains 12 encoder blocks. ... The number of encoders in architectures aligns with the number of Transformers', 'score': 0.609455, 'raw_content': None}, {'url': 'https://towardsdatascience.com/a-complete-guide-to-bert-with-code-9f87602e4a11/', 'title': 'A Complete Guide to BERT with Code | Towards Data Science', 'content': 'Since BERT Base uses 768 embedding dimensions, the hidden layer contains 768 neurons which are connected to the final encoder block of the', 'score': 0.4424684, 'raw_content': None}, {'url': 'https://en.wikipedia.org/wiki/BERT_(language_model)', 'title': 'BERT (language model) - Wikipedia', 'content': '| | Development | | --- | | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Software | | | | | --- | --- | | A–C | * Accelerated Linear Algebra * AMP * *Actions on Google* * ALTS * American Fuzzy Lop \"American Fuzzy Lop (software)\") * *Android Cloud to Device Messaging* * Android Debug Bridge * Android NDK * Android Runtime * Android SDK * Android Studio * Angular \"Angular (web framework)\") * *AngularJS* * Apache Beam * APIs * App Engine * App Inventor * *App Maker* * App Runtime for Chrome * *AppJet* * Apps Script * AppSheet * ARCore * *Base* * Bazel \"Bazel (software)\") * BeyondCorp * Bigtable * BigQuery * Bionic \"Bionic (software)\") * Blockly * *Borg \"Borg (cluster manager)\")* * *Caja* * Cameyo * Chart API * Charts * *Chrome Frame* * Chromium \"Chromium (web browser)\") + Blink \"Blink (browser engine)\") * Closure Tools * *Cloud Connect* * Cloud Dataflow * Cloud Datastore * *Cloud Messaging* * Cloud Shell * Cloud Storage * *Code Search* * Compute Engine * Cpplint | | D–N | * *Dalvik \"Dalvik (software)\")* * Data Protocol * Dialogflow * Exposure Notification * Fast Pair * Fastboot * Federated Learning of Cohorts * File System * Firebase * Firebase Studio * Firebase Cloud Messaging * FlatBuffers * Flutter \"Flutter (software)\") * *Freebase \"Freebase (database)\")* * Gadgets * Ganeti * *Gears \"Gears (software)\")* * Gerrit \"Gerrit (software)\") * GLOP * gRPC * Gson * Guava * Guetzli * Guice * gVisor * GYP \"GYP (software)\") * JAX \"JAX (software)\") * Jetpack Compose * Keyhole Markup Language * Kubernetes * Kythe * LevelDB * Lighthouse \"Lighthouse (software)\") * Looker Studio * *lmctfy* * MapReduce * *Mashup Editor* * Matter \"Matter (standard)\") * Mobile Services * Namebench * Native Client', 'score': 0.2174542, 'raw_content': None}], 'response_time': 0.77, 'request_id': 'c261a67e-67ce-4c0c-8ea9-274d9b28659f'}"
    },
    {
      "turn_idx": 2,
      "action": "tavily_search({\"query\": \"Attention Is All You Need encoder number of layers 6\"})",
      "explanation": "",
      "observation": "{'query': 'Attention Is All You Need encoder number of layers 6', 'follow_up_questions': None, 'answer': None, 'images': [], 'results': [{'url': 'https://www.reddit.com/r/MachineLearning/comments/6gwqiw/r_170603762_attention_is_all_you_need_sota_nmt/', 'title': '[R] [1706.03762] Attention Is All You Need <-- Sota NMT; less compute', 'content': '... sublayers, each of which contains four linear projections, plus the attention logic. So the total number of layers is much larger than 6.', 'score': 0.9999621, 'raw_content': None}, {'url': 'https://datascience.stackexchange.com/questions/121015/what-is-purpose-of-stacking-n-6-blocks-of-encoder-and-decoder-in-transformer', 'title': 'What is purpose of stacking N=6 blocks of encoder and decoder in ...', 'content': 'Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. # What is purpose of stacking N=6 blocks of encoder and decoder in transformer? What is purpose of stacking $N=6$ blocks of encoder and decoder? The layers that are closer to the token embeddings represent lower-level token relations, while deeper layers learn to represent higher-level information present in the input sequences. You can check the studies that probe Transformer models to understand what kind of language information is best represented in each layer, e.g.:. Find the answer to your question by asking. * The 2025 Stack Overflow and Stack Exchange wrap—our top ten questions of the... What is the difference between GPT blocks and Transformer Decoder blocks? What are the inputs to the first decoder layer in a Transformer model during the training phase?', 'score': 0.9937588, 'raw_content': None}, {'url': 'https://arxiv.org/abs/1706.03762', 'title': '[1706.03762] Attention Is All You Need - arXiv', 'content': \"View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. The best performing models also connect the encoder and decoder through an attention mechanism. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. | Cite as: | arXiv:1706.03762 [cs.CL] |. |  | (or  arXiv:1706.03762v7 [cs.CL] for this version) |. **** Mon, 12 Jun 2017 17:57:34 UTC (1,102 KB). **** Mon, 19 Jun 2017 16:49:45 UTC (1,125 KB). **[v7]** Wed, 2 Aug 2023 00:41:18 UTC (1,124 KB). View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. ### References & Citations. # Bibliographic and Citation Tools. Have an idea for a project that will add value for arXiv's community?\", 'score': 0.98951083, 'raw_content': None}, {'url': 'https://en.wikipedia.org/wiki/Attention_Is_All_You_Need', 'title': 'Attention Is All You Need - Wikipedia', 'content': 'The paper introduced a new deep learning architecture known as the transformer \"Transformer (machine learning model)\"), based on the attention mechanism proposed in 2014 by Bahdanau *et al.* It is considered a foundational paper in modern artificial intelligence, and a main contributor to the AI boom, as the transformer approach has become the main architecture of a wide variety of AI, such as large language models. Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them from being accelerated on GPUs. In 2016, *decomposable attention* applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs. One of its authors, Jakob Uszkoreit, suspected that attention *without* recurrence would be sufficient for language translation, thus the title \"attention is *all* you need\".', 'score': 0.97997653, 'raw_content': None}, {'url': 'https://www.linkedin.com/pulse/understanding-groundbreaking-attention-all-you-need-research-disansa-becnc', 'title': \"Understanding the Groundbreaking 'Attention Is All You Need ...\", 'content': '* Transformer is the first transduction (task of transforming input sequences into output sequences) model relying entirely on self-attention without using sequence-aligned RNNs or convolution. * When choosing to use self attention three conditions were considered: One is the total computational complexity per layer Amout of the computation that be parallelized measuring by the minimum number of sequential operations required The third is the path between long range dependencies in the network Learning long range dependencies is a key challenge One key factor affecting ability to learn such dependencies is the length of the path forward and backward signals have to traverse in the network Shorter these paths between any combination of positions in the input and output sequences, easier it is to learn long range dependencies Maximum path length between any two input and output positions are compared as well. * The transformer was presented as the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures.', 'score': 0.9683812, 'raw_content': None}], 'response_time': 0.64, 'request_id': '380bf943-9dae-43e3-8131-80aa4a6cff0a'}"
    },
    {
      "turn_idx": 3,
      "action": "fetch_website_content({\"website_url\": \"https://arxiv.org/abs/1706.03762\"})",
      "explanation": "",
      "observation": "\n[1706.03762] Attention Is All You Need\nHappy Open Access Week from arXiv!\nYOU make open access possible! Tell us why you support #openaccess and give to arXiv this week to help keep science open for all.\nDonate!\nSkip to main content\nWe gratefully acknowledge support from the Simons Foundation, member institutions , and all contributors.\nDonate\n> cs > arXiv:1706.03762\nHelp | Advanced Search\nAll fields\nTitle\nAuthor\nAbstract\nComments\nJournal reference\nACM classification\nMSC classification\nReport number\narXiv identifier\nDOI\nORCID\narXiv author ID\nHelp pages\nFull text\nSearch\nopen search\nGO\nopen navigation menu\nquick links\nLogin\nHelp Pages\nAbout\nComputer Science > Computation and Language\narXiv:1706.03762 (cs)\n[Submitted on 12 Jun 2017 ( v1 ), last revised 2 Aug 2023 (this version, v7)]\nTitle: Attention Is All You Need\nAuthors: Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , Lukasz Kaiser , Illia Polosukhin View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors\nView PDF\nHTML (experimental)\nAbstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.\nComments:\n15 pages, 5 figures\nSubjects:\nComputation and Language (cs.CL) ; Machine Learning (cs.LG)\nCite as:\narXiv:1706.03762 [cs.CL]\n(or\narXiv:1706.03762v7 [cs.CL] for this version)\nhttps://doi.org/10.48550/arXiv.1706.03762\nFocus to learn more\narXiv-issued DOI via DataCite\nSubmission history From: Llion Jones [ view email ] [v1]\nMon, 12 Jun 2017 17:57:34 UTC (1,102 KB)\n[v2]\nMon, 19 Jun 2017 16:49:45 UTC (1,125 KB)\n[v3]\nTue, 20 Jun 2017 05:20:02 UTC (1,125 KB)\n[v4]\nFri, 30 Jun 2017 17:29:30 UTC (1,124 KB)\n[v5]\nWed, 6 Dec 2017 03:30:32 UTC (1,124 KB)\n[v6]\nMon, 24 Jul 2023 00:48:54 UTC (1,124 KB)\n[v7]\nWed, 2 Aug 2023 00:41:18 UTC (1,124 KB)\nFull-text links:\nAccess Paper:\nView a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors View PDF HTML (experimental) TeX Source\nview license\nCurrent browse context: cs.CL\n< prev\n|\nnext >\nnew\n|\nrecent\n| 2017-06\nChange to browse by:\ncs\ncs.LG\nReferences & Citations\nNASA ADS Google Scholar\nSemantic Scholar\n123 blog links ( what is this? )\nDBLP - CS Bibliography\nlisting | bibtex\nAshish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones …\nexport BibTeX citation\nLoading...\nBibTeX formatted citation\n×\nloading...\nData provided by:\nBookmark\nBibliographic Tools\nBibliographic and Citation Tools\nBibliographic Explorer Toggle\nBibliographic Explorer ( What is the Explorer? )\nConnected Papers Toggle\nConnected Papers ( What is Connected Papers? )\nLitmaps Toggle\nLitmaps ( What is Litmaps? )\nscite.ai Toggle\nscite Smart Citations ( What are Smart Citations? )\nCode, Data, Media\nCode, Data and Media Associated with this Article\nalphaXiv Toggle\nalphaXiv ( What is alphaXiv? )\nLinks to Code Toggle\nCatalyzeX Code Finder for Papers ( What is CatalyzeX? )\nDagsHub Toggle\nDagsHub ( What is DagsHub? )\nGotitPub Toggle\nGotit.pub ( What is GotitPub? )\nHuggingface Toggle\nHugging Face ( What is Huggingface? )\nLinks to Code Toggle\nPapers with Code ( What is Papers with Code? )\nScienceCast Toggle\nScienceCast ( What is ScienceCast? )\nDemos\nDemos\nReplicate Toggle\nReplicate ( What is Replicate? )\nSpaces Toggle\nHugging Face Spaces ( What is Spaces? )\nSpaces Toggle\nTXYZ.AI ( What is TXYZ.AI? )\nRelated Papers\nRecommenders and Search Tools\nLink to Influence Flower\nInfluence Flower ( What are Influence Flowers? )\nCore recommender toggle\nCORE Recommender ( What is CORE? )\nAuthor\nVenue\nInstitution\nTopic\nAbout arXivLabs\narXivLabs: experimental projects with community collaborators\narXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.\nBoth individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.\nHave an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .\nWhich authors of this paper are endorsers? |\n Disable MathJax ( What is MathJax? )\nAbout\nHelp\ncontact arXiv Click here to contact arXiv\nContact\nsubscribe to arXiv mailings Click here to subscribe\nSubscribe\nCopyright\nPrivacy Policy\nWeb Accessibility Assistance\narXiv Operational Status\n"
    },
    {
      "turn_idx": 4,
      "action": "fetch_website_content({\"website_url\": \"https://en.wikipedia.org/wiki/Attention_Is_All_You_Need\"})",
      "explanation": "",
      "observation": "\nAttention Is All You Need - Wikipedia\nJump to content\nMain menu\nMain menu\nmove to sidebar\nhide\nNavigation\nMain page Contents Current events Random article About Wikipedia Contact us\nContribute\nHelp Learn to edit Community portal Recent changes Upload file Special pages\nSearch\nSearch\nAppearance\nDonate\nCreate account\nLog in\nPersonal tools\nDonate Create account Log in\nContents\nmove to sidebar\nhide\n(Top)\n1\nBackground\n2\nMethods discussed and introduced\n3\nHistorical context\nToggle Historical context subsection\n3.1\nPredecessors\n3.2\nAttention with seq2seq\n3.3\nParallelizing attention\n3.4\nAI boom era\n4\nTraining\n5\nNotes\n6\nReferences\n7\nExternal links\nToggle the table of contents\nAttention Is All You Need\n13 languages\nالعربية Català Español فارسی 한국어 עברית မြန်မာဘာသာ 日本語 Português Русский Türkçe Українська 中文\nEdit links\nArticle Talk\nEnglish\nRead Edit View history\nTools\nTools\nmove to sidebar\nhide\nActions\nRead Edit View history\nGeneral\nWhat links here Related changes Upload file Permanent link Page information Cite this page Get shortened URL Download QR code\nPrint/export\nDownload as PDF Printable version\nIn other projects\nWikidata item\nAppearance\nmove to sidebar\nhide\nFrom Wikipedia, the free encyclopedia\n2017 research paper by Google\nAn illustration of main components of the transformer model from the paper\n\" Attention Is All You Need \" [ 1 ] is a 2017 landmark [ 2 ] [ 3 ] research paper in machine learning authored by eight scientists working at Google . The paper introduced a new deep learning architecture known as the transformer , based on the attention mechanism proposed in 2014 by Bahdanau et al. [ 4 ] It is considered a foundational [ 5 ] paper in modern artificial intelligence , and a main contributor to the AI boom , as the transformer approach has become the main architecture of a wide variety of AI, such as large language models . [ 6 ] [ 7 ] At the time, the focus of the research was on improving Seq2seq techniques for machine translation , but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal generative AI . [ 1 ]\nSome early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on \"The Transformer\", and parsing . These convinced the team that the Transformer is a general purpose language model, and not just good for translation. [ 8 ]\nAs of 2025, [update] the paper has been cited more than 173,000 times, placing it among the top ten most-cited papers of the 21st century. [ 9 ] After the paper was published by Google, each of the authors left the company to join other companies or to found startups .\nBackground [ edit ]\nThe authors of the paper are: Ashish Vaswani , Noam Shazeer , Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez , Łukasz Kaiser, and Illia Polosukhin. All eight authors were \"equal contributors\" to the paper; the listed order was randomized (according to the paper itself). After the paper, each of the authors left Google to join other companies or to found startups . [ 10 ] [ 11 ]\nThe paper's title is a reference to the song \" All You Need Is Love \" by the Beatles . [ 12 ] The name \"Transformer\" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word. [ 8 ] An early design document was titled \"Transformers: Iterative Self-Attention and Processing for Various Tasks\", and included an illustration of six characters from the Transformers franchise. The team was named Team Transformer. [ 12 ]\nMethods discussed and introduced [ edit ]\nThe paper is most well known for the introduction of the Transformer architecture, which forms the underlying architecture for most forms of modern large language models (LLMs) . A key reason for why the architecture is preferred by most modern LLMs is the parallelizability of the architecture over its predecessors. This ensures that the operations necessary for training can be accelerated on a GPU allowing both faster training times and models of bigger sizes to be trained.\n The following mechanisms were introduced by the paper as part of the development of the transformer architecture.\n Scaled dot-product attention & self-attention\nThe use of the scaled dot-product attention and self-attention mechanism instead of a recurrent neural network or long short-term memory (which rely on recurrence instead) allow for better performance as described in the following paragraph. The paper described the scaled dot-product attention as follows:\nA\nt\nt\ne\nn\nt\ni\no\nn\n(\nQ\n,\nK\n,\nV\n)\n:=\ns\no\nf\nt\nm\na\nx\n(\nQ\n×\nK\nT\nd\nk\n)\n×\nV\n{\\displaystyle {\\rm {Attention}}(Q,K,V):={\\rm {softmax}}\\left({\\frac {Q\\times K^{T}}{\\sqrt {d_{k}}}}\\right)\\times V}\nwhere\nQ\n{\\displaystyle Q}\n,\nK\n{\\displaystyle K}\n,\nV\n{\\displaystyle V}\nare respectively the query, key, value matrices, and\nd\nk\n{\\displaystyle d_{k}}\nis the dimension of the values.\n Since the model relies on Query ( Q ), Key ( K ) and Value ( V ) matrices that come from the same source (i.e. the input sequence or context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture. This differs from the original form of the Attention mechanism introduced in 2014. Additionally, the paper also discusses the use of an additional scaling factor that was found to be most effective with respect to the dimension of the key vectors (represented as\nd\nk\n{\\displaystyle d_{k}}\nand initially set to 64 within the paper) in the manner shown above.\n In the specific context of translation which the paper focused on, the Query and Key matrices are usually represented in embeddings corresponding to the source language while the Value matrix corresponds to the target language.\n Multi-head attention\nIn the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps. Multi-head attention enhances this process by introducing multiple parallel attention heads. Each attention head learns different linear projections of the Q, K, and V matrices. This allows the model to capture different aspects of the relationships between words in the sequence simultaneously, rather than focusing on a single aspect.\n By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives. After the attention outputs from all heads are calculated, they are concatenated and passed through a final linear transformation to generate the output.\n Positional encoding\nSince the Transformer model is not a seq2seq model and does not rely on the sequence of the text in order to perform encoding and decoding, the paper relied on the use of sine and cosine wave functions to encode the position of the token into the embedding. The methods introduced in the paper are discussed below:\nP\nE\n(\np\no\ns\n,\n2\ni\n)\n=\nsin\n⁡\n(\np\no\ns\n/\n10000\n2\ni\n/\nd\nm\no\nd\ne\nl\n)\n{\\displaystyle PE_{({\\rm {pos}},2i)}=\\sin({\\rm {pos}}/{10000}^{2i/d_{\\rm {model}}})}\nP\nE\n(\np\no\ns\n,\n2\ni\n+\n1\n)\n=\ncos\n⁡\n(\np\no\ns\n/\n10000\n2\ni\n/\nd\nm\no\nd\ne\nl\n)\n{\\displaystyle PE_{({\\rm {pos}},2i+1)}=\\cos({\\rm {pos}}/{10000}^{2i/d_{\\rm {model}}})}\nwherein\np\no\ns\n{\\displaystyle {\\rm {pos}}}\n,\ni\n{\\displaystyle i}\n,\nd\nm\no\nd\ne\nl\n{\\displaystyle {d_{\\rm {model}}}}\ncorrespond to the position of the word, the current dimension index and the dimension of the model respectively. The sine function is used for even indices of the embedding while the cosine function is used for odd indices. The resultant\nP\nE\n{\\displaystyle PE}\nembedding is then added to the word at that corresponding position with respect to the current context window. The paper specifically comments on why this method was chosen describing:\n \"We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.\" [ 1 ]\nHistorical context [ edit ]\nMain articles: Transformer (deep learning architecture) § History , and Seq2seq § History\nSee also: Timeline of machine learning\nPredecessors [ edit ]\nFor many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.\n A key breakthrough was LSTM (1995), [ note 1 ] an RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an attention mechanism which used neurons that multiply the outputs of other neurons, so-called multiplicative units . [ 13 ] Neural networks using multiplicative units were later called sigma-pi networks [ 14 ] or higher-order networks . [ 15 ] LSTM became the standard architecture for long sequence modelling until the 2017 publication of transformers. However, LSTM still used sequential processing, like most other RNNs. [ note 2 ] Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.\n Modern transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input. [ 16 ] One of its two networks has \"fast weights\" or \"dynamic links\" (1981). [ 17 ] [ 18 ] [ 19 ] A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries. [ 16 ] This was later shown to be equivalent to the unnormalized linear transformer. [ 20 ] [ 21 ]\nAttention with seq2seq [ edit ]\nMain article: Seq2seq § History\nThe idea of encoder–decoder sequence transduction had been developed in the early 2010s; commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014. [ 22 ] [ 23 ]\nA 380M-parameter model for machine translation uses two long short-term memories (LSTM). [ 23 ] Its architecture consists of two parts. The encoder is an LSTM that takes in a sequence of tokens and turns it into a vector. The decoder is another LSTM that converts the vector into a sequence of tokens. Similarly, another 130M-parameter model used gated recurrent units (GRU) instead of LSTM. [ 22 ] Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq. [ 24 ] [ 25 ]\nThese early seq2seq models had no attention mechanism, and the state vector is accessible only after the last word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved. This is because the input is processed sequentially by one recurrent network into a fixed -size output vector, which is then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, degrading the output. As evidence, reversing the input sentence improved seq2seq translation. [ 26 ]\nThe RNN search model introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem (of the fixed-size output vector), allowing the model to process long-distance dependencies more easily. The name is because it \"emulates searching through a source sentence during decoding a translation\". [ 4 ]\nThe relative performances were compared between global (that of RNN search ) and local (sliding window) attention model architectures for machine translation, finding that mixed attention had higher quality than global attention, while local attention reduced translation time. [ 27 ]\nIn 2016, Google Translate was revamped to Google Neural Machine Translation , which replaced the previous model based on statistical machine translation . The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. [ 28 ] It took nine months to develop, and it outperformed the statistical approach, which took ten years to develop. [ 29 ]\nParallelizing attention [ edit ]\nMain article: Attention (machine learning) § History\nSeq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize , which prevented them from being accelerated on GPUs. In 2016, decomposable attention applied a self-attention mechanism to feedforward networks , which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs. [ 30 ] One of its authors, Jakob Uszkoreit, suspected that attention without recurrence would be sufficient for language translation, thus the title \"attention is all you need\". [ 31 ] That hypothesis was against conventional wisdom at the time, and even his father Hans Uszkoreit , a well-known computational linguist, was skeptical. [ 31 ] In the same year, self-attention (called intra-attention or intra-sentence attention ) was proposed for LSTMs. [ 32 ]\nIn 2017, the original (100M-sized) encoder–decoder transformer model was proposed in the \" Attention is all you need \" paper. At the time, the focus of the research was on improving seq2seq for machine translation , by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. [ 1 ] This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence. Its parallelizability was an important factor to its widespread use in large neural networks. [ 33 ]\nAI boom era [ edit ]\nAs early as spring 2017, even before the \"Attention is all you need\" preprint was published, one of the co-authors applied the \"decoder-only\" variation of the architecture to generate fictitious Wikipedia articles. [ 34 ] Transformer architecture is now used alongside many generative models that contribute to the ongoing AI boom .\n In language modelling, ELMo (2018) was a bi-directional LSTM that produces contextualized word embeddings , improving upon the line of research from bag of words and word2vec . It was followed by BERT (2018), an encoder-only transformer model. [ 35 ] In 2019 October, Google started using BERT to process search queries. [ 36 ] In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a transformer-encoder–RNN-decoder model. [ 37 ]\nStarting in 2018, the OpenAI GPT series of decoder-only transformers became state of the art in natural language generation . In 2022, a chatbot based on GPT-3, ChatGPT , became unexpectedly [ 38 ] popular, triggering a boom around large language models . [ 39 ] [ 40 ]\nSince 2020, transformers have been applied in modalities beyond text, including the vision transformer , [ 41 ] speech recognition, [ 42 ] robotics, [ 43 ] and multimodal . [ 44 ] The vision transformer, in turn, stimulated new developments in convolutional neural networks . [ 45 ] Image and video generators like DALL-E (2021), Stable Diffusion 3 (2024), [ 46 ] and Sora (2024), use transformers to analyse input data (like text prompts) by breaking it down into \"tokens\" and then calculating the relevance between each token using self-attention, which helps the model understand the context and relationships within the data.\nTraining [ edit ]\nWhile the primary focus of the paper at the time was to improve machine translation, the paper also discussed the use of the architecture on English Constituency Parsing , both with limited and large-sized datasets, achieving a high-score without specific tuning for the task indicating the promising nature of the model for use in a wide-variety of general purpose of seq2seq tasks.\n Dataset\nThe English-to-German translation model was trained on the 2014 WMT (Workshop on Statistical Machine Translation) English-German dataset, consisting of nearly 4.5 million sentences derived from TED Talks and high-quality news articles. A separate translation model was trained on the much larger 2014 WMT English-French dataset, consisting of 36 million sentences. Both datasets were encoded with byte-pair encoding.\n Hardware\nThe models were trained using 8 NVIDIA P100 GPUs . The base models were trained for 100,000 steps and the big models were trained for 300,000 steps - each step taking about 0.4 seconds to complete for the base models and 1.0 seconds for the big models. The base model trained for a total of 12 hours, and the big model trained for a total of 3.5 days. Both the base and big models outperforms the 2017 state-of-the-art in both English-German and English-French while achieving the comparatively lowest training cost. [ 1 ] The estimated computing cost was 0.089 petaFLOP/s–days. [ 47 ]\nHyperparameters and regularization\nFor their 100M-parameter Transformer model, the authors increased the learning rate linearly for the first 4000 (warmup) steps and decreased it proportionally to inverse square root of the current step number. Dropout layers were applied to the output of each sub-layer before normalization, the sums of the embeddings, and the positional encodings. The dropout rate was set to 0.1. Label smoothing was applied with a value of 0.1 which \"improves accuracy and BLEU score\". [ 1 ]\nNotes [ edit ]\n^ Gated recurrent units (2014) further reduced its complexity.\n^ Some architectures, such as RWKV or state space models, avoid the issue.\nReferences [ edit ]\n^ a b c d e f Vaswani, Ashish ; Shazeer, Noam ; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N ; Kaiser, Łukasz; Polosukhin, Illia (December 2017). \"Attention is All you Need\" (PDF) . In I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett (ed.). 31st Conference on Neural Information Processing Systems (NIPS) . Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc. arXiv : 1706.03762 .\n^ Love, Julia (10 July 2023). \"AI Researcher Who Helped Write Landmark Paper Is Leaving Google\" . Bloomberg News . Retrieved 1 April 2024 .\n^ Goldman, Sharon (20 March 2024). \" 'Attention is All You Need' creators look beyond Transformers for AI at Nvidia GTC: 'The world needs something better' \" . VentureBeat . Retrieved 1 April 2024 .\n^ a b Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (19 May 2016). \"Neural Machine Translation by Jointly Learning to Align and Translate\". arXiv : 1409.0473 [ cs.CL ].\n^ Shinde, Gitanjali; Wasatkar, Namrata; Mahalle, Parikshit (6 June 2024). Data-Centric Artificial Intelligence for Multidisciplinary Applications . CRC Press . p. 75. ISBN   9781040031131 .\n^ Toews, Rob (3 September 2023). \"Transformers Revolutionized AI. What Will Replace Them?\" . Forbes . Archived from the original on 26 September 2023 . Retrieved 3 December 2023 .\n^ Murgia, Madhumita (23 July 2023). \"Transformers: the Google scientists who pioneered an AI revolution\" . Financial Times . Archived from the original on 28 December 2023 . Retrieved 22 March 2024 .\n^ a b Marche, Stephen (23 August 2024). \"Was Linguistic A.I. Created by Accident?\" . The New Yorker . ISSN   0028-792X . Retrieved 24 August 2024 .\n^ Pearson, Helen; Ledford, Heidi; Hutson, Matthew; Van Noorden, Richard (15 April 2025). \"Exclusive: the most-cited papers of the twenty-first century\" . Nature . 640 (8059): 588– 592. Bibcode : 2025Natur.640..588P . doi : 10.1038/d41586-025-01125-9 . PMID   40234577 . Retrieved 18 April 2025 .\n^ Murgia, Madhumita (23 July 2023). \"Transformers: the Google scientists who pioneered an AI revolution\" . Financial Times . Retrieved 22 March 2025 .\n^ \"Meet the $4 Billion AI Superstars That Google Lost\" . Bloomberg . 13 July 2023 – via www.bloomberg.com.\n^ a b Levy, Steven. \"8 Google Employees Invented Modern AI. Here's the Inside Story\" . Wired . ISSN   1059-1028 . Retrieved 20 March 2024 .\n^ Feldman, J. A.; Ballard, D. H. (1 July 1982). \"Connectionist models and their properties\" . Cognitive Science . 6 (3): 205– 254. doi : 10.1016/S0364-0213(82)80001-3 . ISSN   0364-0213 .\n^ Rumelhart, David E.; McClelland, James L.; Hinton, Geoffrey E. (29 July 1987). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 (PDF) . Cambridge, Mass: Bradford Books. ISBN   978-0-262-68053-0 .\n^ Giles, C. Lee; Maxwell, Tom (1 December 1987). \"Learning, invariance, and generalization in high-order neural networks\" . Applied Optics . 26 (23): 4972– 4978. doi : 10.1364/AO.26.004972 . ISSN   0003-6935 . PMID   20523475 .\n^ a b Schmidhuber, Jürgen (1992). \"Learning to control fast-weight memories: an alternative to recurrent nets\" (PDF) . Neural Computation . 4 (1): 131– 139. doi : 10.1162/neco.1992.4.1.131 . S2CID   16683347 .\n^ Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981. http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95–119. Springer, Berlin, 1994.\n^ Jerome A. Feldman, \"Dynamic connections in neural networks,\" Biological Cybernetics, vol. 46, no. 1, pp. 27–39, Dec. 1982.\n^ Hinton, Geoffrey E.; Plaut, David C. (1987). \"Using Fast Weights to Deblur Old Memories\" . Proceedings of the Annual Meeting of the Cognitive Science Society . 9 .\n^ Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020). \"Transformers are RNNs: Fast autoregressive Transformers with linear attention\" . ICML 2020 . PMLR. pp.  5156– 5165.\n^ Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). \"Linear Transformers Are Secretly Fast Weight Programmers\". ICML 2021 . Springer. pp.  9355– 9366.\n^ a b Cho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (October 2014). \"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation\" . In Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar: Association for Computational Linguistics. pp.  1724– 1734. arXiv : 1406.1078 . doi : 10.3115/v1/D14-1179 .\n^ a b Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 December 2014). \"Sequence to sequence learning with neural networks\". arXiv : 1409.3215 [ cs.CL ]. [first version posted to arXiv on 10 Sep 2014]\n^ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). \"Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling\". arXiv : 1412.3555 [ cs.NE ].\n^ Gruber, N.; Jockisch, A. (2020), \"Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?\", Frontiers in Artificial Intelligence , 3 40, doi : 10.3389/frai.2020.00040 , PMC   7861254 , PMID   33733157 , S2CID   220252321\n^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014). \"Sequence to Sequence Learning with Neural Networks\" . Advances in Neural Information Processing Systems . 27 . Curran Associates, Inc. arXiv : 1409.3215 .\n^ Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (2015). \"Effective Approaches to Attention-based Neural Machine Translation\". arXiv : 1508.04025 [ cs.CL ].\n^ Wu, Yonghui; et al. (1 September 2016). \"Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation\". arXiv : 1609.08144 [ cs.CL ].\n^ Lewis-Kraus, Gideon (14 December 2016). \"The Great A.I. Awakening\" . The New York Times . ISSN   0362-4331 . Archived from the original on 24 May 2023 . Retrieved 22 June 2023 .\n^ Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (25 September 2016). \"A Decomposable Attention Model for Natural Language Inference\". arXiv : 1606.01933 [ cs.CL ].\n^ a b Levy, Steven. \"8 Google Employees Invented Modern AI. Here's the Inside Story\" . Wired . ISSN   1059-1028 . Archived from the original on 20 March 2024 . Retrieved 6 August 2024 .\n^ Cheng, Jianpeng; Dong, Li; Lapata, Mirella (November 2016). \"Long Short-Term Memory-Networks for Machine Reading\" . In Su, Jian; Duh, Kevin; Carreras, Xavier (eds.). Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing . Austin, Texas: Association for Computational Linguistics. pp.  551– 561. doi : 10.18653/v1/D16-1053 .\n^ Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (10 December 2023), RWKV: Reinventing RNNs for the transformer Era , arXiv : 2305.13048\n^ Marche, Stephen (23 August 2024). \"Was Linguistic A.I. Created by Accident?\" . The New Yorker . ISSN   0028-792X . Retrieved 27 August 2024 .\n^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\". arXiv : 1810.04805v2 [ cs.CL ].\n^ \"Google: BERT now used on almost every English query\" . Search Engine Land . 15 October 2020 . Retrieved 24 November 2020 .\n^ Caswell, Isaac; Liang, Bowen (8 June 2020). \"Recent Advances in Google Translate\" . Google Research . Archived from the original on 4 July 2024 . Retrieved 7 August 2024 .\n^ \"The inside story of how ChatGPT was built from the people who made it\" . MIT Technology Review . Retrieved 6 August 2024 .\n^ \"Improving language understanding with unsupervised learning\" . openai.com . 11 June 2018. Archived from the original on 18 March 2023 . Retrieved 18 March 2023 .\n^ finetune-transformer-lm , OpenAI, 11 June 2018 , retrieved 1 May 2023\n^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (3 June 2021). \"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\". arXiv : 2010.11929 [ cs.CV ].\n^ Gulati, Anmol; Qin, James; Chiu, Chung-Cheng; Parmar, Niki; Zhang, Yu; Yu, Jiahui; Han, Wei; Wang, Shibo; Zhang, Zhengdong; Wu, Yonghui; Pang, Ruoming (2020). \"Conformer: Convolution-augmented Transformer for Speech Recognition\". arXiv : 2005.08100 [ eess.AS ].\n^ Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (24 June 2021), Decision Transformer: Reinforcement Learning via Sequence Modeling , arXiv : 2106.01345\n^ Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz (19 November 2022), Rethinking Attention with Performers , arXiv : 2009.14794\n^ Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). A ConvNet for the 2020s . Conference on Computer Vision and Pattern Recognition ( CVPR ). pp.  11976– 11986.\n^ Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (5 March 2024), Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , arXiv : 2403.03206\n^ \"AI and compute\" . openai.com . 9 June 2022 . Retrieved 29 April 2025 .\nExternal links [ edit ]\n\"Attention Is All You Need\" on Google Research\n\"Attention Is All You Need\" on arXiv\nUszkoreit, Jakob (31 August 2017). \"Transformer: A Novel Neural Network Architecture for Language Understanding\" . research.google . Retrieved 9 August 2024 . A concurrent blog post on Google Research blog.\nv t e Google AI\nGoogle\nGoogle Brain\nGoogle DeepMind\nComputer programs AlphaGo Versions\nAlphaGo (2015)\nMaster (2016)\nAlphaGo Zero (2017)\nAlphaZero (2017)\nMuZero (2019)\nCompetitions\nFan Hui (2015)\nLee Sedol (2016)\nKe Jie (2017)\nIn popular culture\nAlphaGo (2017)\nThe MANIAC (2023)\nOther\nAlphaFold (2018)\nAlphaStar (2019)\nAlphaDev (2023)\nAlphaGeometry (2024)\nAlphaGenome (2025)\nMachine learning Neural networks\nInception (2014)\nWaveNet (2016)\nMobileNet (2017)\nTransformer (2017)\nEfficientNet (2019)\nGato (2022)\nOther\nQuantum Artificial Intelligence Lab\nTensorFlow\nTensor Processing Unit\nGenerative AI Chatbots\nAssistant (2016)\nSparrow (2022)\nGemini (2023)\nNano Banana (2025)\nModels\nBERT (2018)\nXLNet (2019)\nT5 (2019)\nLaMDA (2021)\nChinchilla (2022)\nPaLM (2022)\nImagen (2023)\nGemini (2023)\nVideoPoet (2024)\nGemma (2024)\nVeo (2024)\nOther\nDreamBooth (2022)\nNotebookLM (2023)\nVids (2024)\nGemini Robotics (2025)\nAntigravity (2025)\nSee also\n\" Attention Is All You Need \"\nFuture of Go Summit\nGenerative pre-trained transformer\nGoogle Labs\nGoogle Pixel\nGoogle Workspace\nRobot Constitution\nCategory\nCommons\nRetrieved from \" https://en.wikipedia.org/w/index.php?title=Attention_Is_All_You_Need&oldid=1328196962 \"\nCategories : 2017 documents Artificial intelligence papers Google 2017 in artificial intelligence Hidden categories: Articles with short description Short description is different from Wikidata Use dmy dates from December 2023 Articles containing potentially dated statements from 2025 All articles containing potentially dated statements\nThis page was last edited on 18 December 2025, at 12:51  (UTC) .\nText is available under the Creative Commons Attribution-ShareAlike 4.0 License ;\nadditional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy . Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. , a non-profit organization.\nPrivacy policy\nAbout Wikipedia\nDisclaimers\nContact Wikipedia\nCode of Conduct\nDevelopers\nStatistics\nCookie statement\nMobile view\nSearch\nSearch\nToggle the table of contents\nAttention Is All You Need\n13 languages\nAdd topic\n"
    },
    {
      "turn_idx": 5,
      "action": "fetch_website_content({\"website_url\": \"https://en.wikipedia.org/wiki/BERT_(language_model)\"})",
      "explanation": "",
      "observation": "\nBERT (language model) - Wikipedia\nJump to content\nMain menu\nMain menu\nmove to sidebar\nhide\nNavigation\nMain page Contents Current events Random article About Wikipedia Contact us\nContribute\nHelp Learn to edit Community portal Recent changes Upload file Special pages\nSearch\nSearch\nAppearance\nDonate\nCreate account\nLog in\nPersonal tools\nDonate Create account Log in\nContents\nmove to sidebar\nhide\n(Top)\n1\nArchitecture\nToggle Architecture subsection\n1.1\nEmbedding\n1.2\nArchitectural family\n2\nTraining\nToggle Training subsection\n2.1\nPre-training\n2.1.1\nMasked language modeling\n2.1.2\nNext sentence prediction\n2.2\nFine-tuning\n2.3\nCost\n3\nInterpretation\n4\nHistory\n5\nVariants\n6\nNotes\n7\nReferences\n8\nFurther reading\n9\nExternal links\nToggle the table of contents\nBERT (language model)\n21 languages\nالعربية Català Čeština Deutsch Español Euskara فارسی Français Galego 한국어 हिन्दी Italiano עברית Lietuvių 日本語 Português Ślůnski Українська Tiếng Việt 粵語 中文\nEdit links\nArticle Talk\nEnglish\nRead Edit View history\nTools\nTools\nmove to sidebar\nhide\nActions\nRead Edit View history\nGeneral\nWhat links here Related changes Upload file Permanent link Page information Cite this page Get shortened URL Download QR code\nPrint/export\nDownload as PDF Printable version\nIn other projects\nWikimedia Commons Wikidata item\nAppearance\nmove to sidebar\nhide\nFrom Wikipedia, the free encyclopedia\nSeries of language models developed by Google AI\nBidirectional encoder representations from transformers (BERT) Original author Google AI Initial release October 31, 2018 Repository github .com /google-research /bert Type\nLarge language model\nTransformer\nFoundation model\nLicense Apache 2.0 Website arxiv .org /abs /1810 .04805\nBidirectional encoder representations from transformers ( BERT ) is a language model introduced in October 2018 by researchers at Google . [ 1 ] [ 2 ] It learns to represent text as a sequence of vectors using self-supervised learning . It uses the encoder-only transformer architecture. BERT dramatically improved the state of the art for large language models . As of 2020 [update] , BERT is a ubiquitous baseline in natural language processing (NLP) experiments. [ 3 ]\nBERT is trained by masked token prediction and next sentence prediction. With this training, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2 . [ 4 ] It found applications for many natural language processing tasks, such as coreference resolution and polysemy resolution. [ 5 ] It improved on ELMo and spawned the study of \"BERTology\", which attempts to interpret what is learned by BERT. [ 3 ]\nBERT was originally implemented in the English language at two model sizes, BERT BASE (110 million parameters) and BERT LARGE (340 million parameters). Both were trained on the Toronto BookCorpus [ 6 ] (800M words) and English Wikipedia (2,500M words). [ 1 ] : 5  The weights were released on GitHub . [ 7 ] On March 11, 2020, 24 smaller models were released, the smallest being BERT TINY with just 4 million parameters. [ 7 ]\nArchitecture [ edit ]\nHigh-level schematic diagram of BERT. It takes in a text, tokenizes it into a sequence of tokens, add in optional special tokens, and apply a Transformer encoder. The hidden states of the last layer can then be used as contextual word embeddings.\nBERT is an \"encoder-only\" transformer architecture. At a high level, BERT consists of 4 modules:\nTokenizer: This module converts a piece of English text into a sequence of integers (\"tokens\").\nEmbedding : This module converts the sequence of tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensional Euclidean space .\nEncoder: a stack of Transformer blocks with self-attention , but without causal masking.\nTask head: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types, or as an \"un-embedding layer\".\nThe task head is necessary for pre-training, but it is often unnecessary for so-called \"downstream tasks,\" such as question answering or sentiment classification . Instead, one removes the task head and replaces it with a newly initialized module suited for the task, and finetune the new module. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficient transfer learning . [ 1 ] [ 8 ]\nEncoder-only attention is all-to-all.\nEmbedding [ edit ]\nThis section describes the embedding used by BERT BASE . The other one, BERT LARGE , is similar, just larger.\n The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte-pair encoding . Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by [UNK] (\"unknown\").\nThe three kinds of embedding used by BERT: token type, position, and segment type.\nThe first layer is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings.\nToken type: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.\nPosition: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in a sequence is mapped to a real-valued vector. Each dimension of the vector consists of a sinusoidal function that takes the position in the sequence as input.\nSegment type: Using a vocabulary of just 0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the [SEP] special token. All prior tokens are type-0.\nThe three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using a LayerNorm operation, outputting a 768-dimensional vector for each input token. After this, the representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer.\nArchitectural family [ edit ]\nThe encoder stack of BERT has 2 free parameters:\nL\n{\\displaystyle L}\n, the number of layers, and\nH\n{\\displaystyle H}\n, the hidden size . There are always\nH\n/\n64\n{\\displaystyle H/64}\nself-attention heads, and the feed-forward/filter size is always\n4\nH\n{\\displaystyle 4H}\n. By varying these two numbers, one obtains an entire family of BERT models. [ 9 ]\nFor BERT:\nthe feed-forward size and filter size are synonymous. Both of them denote the number of dimensions in the middle layer of the feed-forward network.\nthe hidden size and embedding size are synonymous. Both of them denote the number of real numbers used to represent a token.\nThe notation for encoder stack is written as L/H. For example, BERT BASE is written as 12L/768H, BERT LARGE as 24L/1024H, and BERT TINY as 2L/128H.\nTraining [ edit ]\nPre-training [ edit ]\nBERT was pre-trained simultaneously on two tasks: [ 10 ]\nMasked language modeling (MLM): In this task, BERT ingests a sequence of words, where one word may be randomly changed (\"masked\"), and BERT tries to predict the original words that had been changed. For example, in the sentence \"The cat sat on the [MASK] ,\" BERT would need to predict \"mat.\" This helps BERT learn bidirectional context, meaning it understands the relationships between words not just from left to right or right to left but from both directions at the same time.\nNext sentence prediction (NSP): In this task, BERT is trained to predict whether one sentence logically follows another. For example, given two sentences, \"The cat sat on the mat\" and \"It was a sunny day\", BERT has to decide if the second sentence is a valid continuation of the first one. This helps BERT understand relationships between sentences, which is important for tasks like question answering or document classification.\nMasked language modeling [ edit ]\nThe masked language modeling task\nIn masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and the training objective was to predict the masked token given its context. In more detail, the selected token is:\nreplaced with a [MASK] token with probability 80%,\nreplaced with a random word token with probability 10%,\nnot replaced with probability 10%.\nThe reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference. A trained BERT model might be applied to word representation (like Word2Vec ), where it would be run over sentences not containing any [MASK] tokens. It is later found that more diverse training objectives are generally better. [ 11 ]\nAs an illustrative example, consider the sentence \"my dog is cute\". It would first be divided into tokens like \"my 1 dog 2 is 3 cute 4 \". Then a random token in the sentence would be picked. Let it be the 4th one \"cute 4 \". Next, there would be three possibilities:\nwith probability 80%, the chosen token is masked, resulting in \"my 1 dog 2 is 3 [MASK] 4 \";\nwith probability 10%, the chosen token is replaced by a uniformly sampled random token, such as \"happy\", resulting in \"my 1 dog 2 is 3 happy 4 \";\nwith probability 10%, nothing is done, resulting in \"my 1 dog 2 is 3 cute 4 \".\nAfter processing the input text, the model's 4th output vector is passed to its decoder layer, which outputs a probability distribution over its 30,000-dimensional vocabulary space.\nNext sentence prediction [ edit ]\nThe next sentence prediction task\nGiven two sentences, the model predicts if they appear sequentially in the training corpus, outputting either [IsNext] or [NotNext] . During training, the algorithm sometimes samples two sentences from a single continuous span in the training corpus, while at other times, it samples two sentences from two discontinuous spans.\n The first sentence starts with a special token, [CLS] (for \"classify\"). The two sentences are separated by another special token, [SEP] (for \"separate\"). After processing the two sentences, the final vector for the [CLS] token is passed to a linear layer for binary classification into [IsNext] and [NotNext] .\n For example:\nGiven \" [CLS] my dog is cute [SEP] he likes playing [SEP] \", the model should predict [IsNext] .\nGiven \" [CLS] my dog is cute [SEP] how do magnets work [SEP] \", the model should predict [NotNext] .\nFine-tuning [ edit ]\nFine-tuned tasks for BERT [ 12 ]\nSentiment classification\nSentence classification\nAnswering multiple-choice questions\nPart-of-speech tagging\nBERT is meant as a general pretrained model for various applications in natural language processing. That is, after pre-training, BERT can be fine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as natural language inference and text classification , and sequence-to-sequence-based language generation tasks such as question answering and conversational response generation. [ 12 ]\nThe original BERT paper published results demonstrating that a small amount of finetuning (for BERT LARGE , 1 hour on 1 Cloud TPU) allowed it to achieved state-of-the-art performance on a number of natural language understanding tasks: [ 1 ]\nGLUE ( General Language Understanding Evaluation ) task set (consisting of 9 tasks);\nSQuAD (Stanford Question Answering Dataset [ 13 ] ) v1.1 and v2.0;\nSWAG (Situations With Adversarial Generations [ 14 ] ).\nIn the original paper, all parameters of BERT are fine-tuned, and recommended that, for downstream applications that are text classifications, the output token at the [CLS] input token is fed into a linear-softmax layer to produce the label outputs. [ 1 ]\nThe original code base defined the final linear layer as a \"pooler layer\", in analogy with global pooling in computer vision, even though it simply discards all output tokens except the one corresponding to [CLS] . [ 15 ]\nCost [ edit ]\nBERT was trained on the BookCorpus (800M words) and a filtered version of English Wikipedia (2,500M words) without lists, tables, and headers.\n Training BERT BASE on 4 cloud TPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD. [ 7 ] Training BERT LARGE on 16 cloud TPU (64 TPU chips total) took 4 days. [ 1 ]\nInterpretation [ edit ]\nLanguage models like ELMo, GPT-2, and BERT, spawned the study of \"BERTology\", which attempts to interpret what is learned by these models. Their performance on these natural language understanding tasks are not yet well understood. [ 3 ] [ 16 ] [ 17 ] Several research publications in 2018 and 2019 focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences, [ 18 ] [ 19 ] analysis of internal vector representations through probing classifiers, [ 20 ] [ 21 ] and the relationships represented by attention weights. [ 16 ] [ 17 ]\nThe high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained. [ 22 ] This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the word fine can have two different meanings depending on the context ( I feel fine today , She has fine blond hair ). BERT considers the words surrounding the target word fine from the left and right side.\n However it comes at a cost: due to encoder-only architecture lacking a decoder, BERT can't be prompted and can't generate text , while bidirectional models in general do not work effectively without the right side, thus being difficult to prompt. As an illustrative example, if one wishes to use BERT to continue a sentence fragment \"Today, I went to\", then naively one would mask out all the tokens as \"Today, I went to [MASK] [MASK] [MASK] ... [MASK] .\" where the number of [MASK] is the length of the sentence one wishes to extend to. However, this constitutes a dataset shift, as during training, BERT has never seen sentences with that many tokens masked out. Consequently, its performance degrades. More sophisticated techniques allow text generation, but at a high computational cost. [ 23 ]\nHistory [ edit ]\nBERT was originally published by Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. The design has its origins from pre-training contextual representations, including semi-supervised sequence learning , [ 24 ] generative pre-training, ELMo , [ 25 ] and ULMFit. [ 26 ] Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus . Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary, whereas BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for \"running\" will have the same word2vec vector representation for both of its occurrences in the sentences \"He is running a company\" and \"He is running a marathon\", BERT will provide a contextualized embedding that will be different according to the sentence. [ 4 ]\nOn October 25, 2019, Google announced that they had started applying BERT models to English-language search queries on Google Search within the US. [ 27 ] On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages. [ 28 ] [ 29 ] In October 2020, almost every single English-based query was processed by a BERT model. [ 30 ]\nVariants [ edit ]\nThe BERT models were influential and inspired many variants.\n RoBERTa (2019) [ 31 ] was an engineering improvement. It preserves BERT's architecture (slightly larger, at 355M parameters), but improves its training, changing key hyperparameters, removing the next-sentence prediction task, and using much larger mini-batch sizes.\nXLM-RoBERTa (2019) [ 32 ] was a multilingual RoBERTa model. It was one of the first works on multilingual language modeling at scale.\n DistilBERT (2019) distills BERT BASE to a model with just 60% of its parameters (66M), while preserving 95% of its benchmark scores. [ 33 ] [ 34 ] Similarly, TinyBERT (2019) [ 35 ] is a distilled model with just 28% of its parameters.\n ALBERT (2019) [ 36 ] used shared-parameter across layers, and experimented with independently varying the hidden size and the word-embedding layer's output size as two hyperparameters. They also replaced the next sentence prediction task with the sentence-order prediction (SOP) task, where the model must distinguish the correct order of two consecutive text segments from their reversed order.\nELECTRA (2020) [ 37 ] applied the idea of generative adversarial networks to the MLM task. Instead of masking out tokens, a small language model generates random plausible substitutions, and a larger network identify these replaced tokens. The small model aims to fool the large model.\n DeBERTa (2020) [ 38 ] is a significant architectural variant, with disentangled attention . Its key idea is to treat the positional and token encodings separately throughout the attention mechanism. Instead of combining the positional encoding (\nx\np\no\ns\ni\nt\ni\no\nn\n{\\displaystyle x_{\\mathrm {position} }}\n) and token encoding (\nx\nt\no\nk\ne\nn\n{\\displaystyle x_{\\mathrm {token} }}\n) into a single input vector (\nx\ni\nn\np\nu\nt\n=\nx\np\no\ns\ni\nt\ni\no\nn\n+\nx\nt\no\nk\ne\nn\n{\\displaystyle x_{\\mathrm {input} }=x_{\\mathrm {position} }+x_{\\mathrm {token} }}\n), DeBERTa keeps them separate as a tuple:\n(\nx\np\no\ns\ni\nt\ni\no\nn\n,\nx\nt\no\nk\ne\nn\n)\n{\\displaystyle (x_{\\mathrm {position} },x_{\\mathrm {token} })}\n. Then, at each self-attention layer, DeBERTa computes three distinct attention matrices, rather than the single attention matrix used in BERT: [ note 1 ]\nAttention type\nQuery type\nKey type\nExample\nContent-to-content\nToken\nToken\n\"European\"; \"Union\", \"continent\"\nContent-to-position\nToken\nPosition\n[adjective]; +1, +2, +3\nPosition-to-content\nPosition\nToken\n−1; \"not\", \"very\"\nThe three attention matrices are added together element-wise, then passed through a softmax layer and multiplied by a projection matrix.\n Absolute position encoding is included in the final self-attention layer as additional input.\nNotes [ edit ]\n^ The position-to-position type was omitted by the authors for being useless.\nReferences [ edit ]\n^ a b c d e f Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (October 11, 2018). \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\". arXiv : 1810.04805v2 [ cs.CL ].\n^ \"Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing\" . Google AI Blog . November 2, 2018 . Retrieved November 27, 2019 .\n^ a b c Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). \"A Primer in BERTology: What We Know About How BERT Works\" . Transactions of the Association for Computational Linguistics . 8 : 842– 866. arXiv : 2002.12327 . doi : 10.1162/tacl_a_00349 . S2CID   211532403 .\n^ a b Ethayarajh, Kawin (September 1, 2019), How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , arXiv : 1909.00512\n^ Anderson, Dawn (November 5, 2019). \"A deep dive into BERT: How BERT launched a rocket into natural language understanding\" . Search Engine Land . Retrieved August 6, 2024 .\n^ Zhu, Yukun; Kiros, Ryan; Zemel, Rich; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015). \"Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books\". pp.  19– 27. arXiv : 1506.06724 [ cs.CV ].\n^ a b c \"BERT\" . GitHub . Retrieved March 28, 2023 .\n^ Zhang, Tianyi; Wu, Felix; Katiyar, Arzoo; Weinberger, Kilian Q.; Artzi, Yoav (March 11, 2021), Revisiting Few-sample BERT Fine-tuning , arXiv : 2006.05987\n^ Turc, Iulia; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (September 25, 2019), Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , arXiv : 1908.08962\n^ \"Summary of the models — transformers 3.4.0 documentation\" . huggingface.co . Retrieved February 16, 2023 .\n^ Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (February 28, 2023), UL2: Unifying Language Learning Paradigms , arXiv : 2205.05131\n^ a b Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). \"11.9. Large-Scale Pretraining with Transformers\" . Dive into deep learning . Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN   978-1-009-38943-3 .\n^ Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (October 10, 2016). \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\". arXiv : 1606.05250 [ cs.CL ].\n^ Zellers, Rowan; Bisk, Yonatan; Schwartz, Roy; Choi, Yejin (August 15, 2018). \"SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference\". arXiv : 1808.05326 [ cs.CL ].\n^ \"bert/modeling.py at master · google-research/bert\" . GitHub . Retrieved September 16, 2024 .\n^ a b Kovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019). \"Revealing the Dark Secrets of BERT\" . Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . pp.  4364– 4373. doi : 10.18653/v1/D19-1445 . S2CID   201645145 .\n^ a b Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (2019). \"What Does BERT Look at? An Analysis of BERT's Attention\" . Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . Stroudsburg, PA, USA: Association for Computational Linguistics: 276– 286. arXiv : 1906.04341 . doi : 10.18653/v1/w19-4828 .\n^ Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). \"Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context\". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg, PA, USA: Association for Computational Linguistics: 284– 294. arXiv : 1805.04623 . doi : 10.18653/v1/p18-1027 . S2CID   21700944 .\n^ Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). \"Colorless Green Recurrent Networks Dream Hierarchically\". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) . Stroudsburg, PA, USA: Association for Computational Linguistics. pp.  1195– 1205. arXiv : 1803.11138 . doi : 10.18653/v1/n18-1108 . S2CID   4460159 .\n^ Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). \"Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information\". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . Stroudsburg, PA, USA: Association for Computational Linguistics: 240– 248. arXiv : 1808.08079 . doi : 10.18653/v1/w18-5426 . S2CID   52090220 .\n^ Zhang, Kelly; Bowman, Samuel (2018). \"Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis\" . Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . Stroudsburg, PA, USA: Association for Computational Linguistics: 359– 361. doi : 10.18653/v1/w18-5448 .\n^ Sur, Chiranjib (January 2020). \"RBN: enhancement in language attribute prediction using global representation of natural language transfer learning technology like Google BERT\" . SN Applied Sciences . 2 (1) 22. doi : 10.1007/s42452-019-1765-9 .\n^ Patel, Ajay; Li, Bryan; Mohammad Sadegh Rasooli; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). \"Bidirectional Language Models Are Also Few-shot Learners\". arXiv : 2209.14500 [ cs.LG ].\n^ Dai, Andrew; Le, Quoc (November 4, 2015). \"Semi-supervised Sequence Learning\". arXiv : 1511.01432 [ cs.LG ].\n^ Peters, Matthew; Neumann, Mark; Iyyer, Mohit; Gardner, Matt; Clark, Christopher; Lee, Kenton; Luke, Zettlemoyer (February 15, 2018). \"Deep contextualized word representations\". arXiv : 1802.05365v2 [ cs.CL ].\n^ Howard, Jeremy; Ruder, Sebastian (January 18, 2018). \"Universal Language Model Fine-tuning for Text Classification\". arXiv : 1801.06146v5 [ cs.CL ].\n^ Nayak, Pandu (October 25, 2019). \"Understanding searches better than ever before\" . Google Blog . Retrieved December 10, 2019 .\n^ \"Understanding searches better than ever before\" . Google . October 25, 2019 . Retrieved August 6, 2024 .\n^ Montti, Roger (December 10, 2019). \"Google's BERT Rolls Out Worldwide\" . Search Engine Journal . Retrieved December 10, 2019 .\n^ \"Google: BERT now used on almost every English query\" . Search Engine Land . October 15, 2020 . Retrieved November 24, 2020 .\n^ Liu, Yinhan; Ott, Myle; Goyal, Naman; Du, Jingfei; Joshi, Mandar; Chen, Danqi; Levy, Omer; Lewis, Mike; Zettlemoyer, Luke; Stoyanov, Veselin (2019). \"RoBERTa: A Robustly Optimized BERT Pretraining Approach\". arXiv : 1907.11692 [ cs.CL ].\n^ Conneau, Alexis; Khandelwal, Kartikay; Goyal, Naman; Chaudhary, Vishrav; Wenzek, Guillaume; Guzmán, Francisco; Grave, Edouard; Ott, Myle; Zettlemoyer, Luke; Stoyanov, Veselin (2019). \"Unsupervised Cross-lingual Representation Learning at Scale\". arXiv : 1911.02116 [ cs.CL ].\n^ Sanh, Victor; Debut, Lysandre; Chaumond, Julien; Wolf, Thomas (February 29, 2020), DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , arXiv : 1910.01108\n^ \"DistilBERT\" . huggingface.co . Retrieved August 5, 2024 .\n^ Jiao, Xiaoqi; Yin, Yichun; Shang, Lifeng; Jiang, Xin; Chen, Xiao; Li, Linlin; Wang, Fang; Liu, Qun (October 15, 2020), TinyBERT: Distilling BERT for Natural Language Understanding , arXiv : 1909.10351\n^ Lan, Zhenzhong; Chen, Mingda; Goodman, Sebastian; Gimpel, Kevin; Sharma, Piyush; Soricut, Radu (February 8, 2020), ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , arXiv : 1909.11942\n^ Clark, Kevin; Luong, Minh-Thang; Le, Quoc V.; Manning, Christopher D. (March 23, 2020), ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , arXiv : 2003.10555\n^ He, Pengcheng; Liu, Xiaodong; Gao, Jianfeng; Chen, Weizhu (October 6, 2021), DeBERTa: Decoding-enhanced BERT with Disentangled Attention , arXiv : 2006.03654\nFurther reading [ edit ]\nRogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). \"A Primer in BERTology: What we know about how BERT works\". arXiv : 2002.12327 [ cs.CL ].\nExternal links [ edit ]\nOfficial GitHub repository\nv t e Google AI\nGoogle\nGoogle Brain\nGoogle DeepMind\nComputer programs AlphaGo Versions\nAlphaGo (2015)\nMaster (2016)\nAlphaGo Zero (2017)\nAlphaZero (2017)\nMuZero (2019)\nCompetitions\nFan Hui (2015)\nLee Sedol (2016)\nKe Jie (2017)\nIn popular culture\nAlphaGo (2017)\nThe MANIAC (2023)\nOther\nAlphaFold (2018)\nAlphaStar (2019)\nAlphaDev (2023)\nAlphaGeometry (2024)\nAlphaGenome (2025)\nMachine learning Neural networks\nInception (2014)\nWaveNet (2016)\nMobileNet (2017)\nTransformer (2017)\nEfficientNet (2019)\nGato (2022)\nOther\nQuantum Artificial Intelligence Lab\nTensorFlow\nTensor Processing Unit\nGenerative AI Chatbots\nAssistant (2016)\nSparrow (2022)\nGemini (2023)\nNano Banana (2025)\nModels\nBERT (2018)\nXLNet (2019)\nT5 (2019)\nLaMDA (2021)\nChinchilla (2022)\nPaLM (2022)\nImagen (2023)\nGemini (2023)\nVideoPoet (2024)\nGemma (2024)\nVeo (2024)\nOther\nDreamBooth (2022)\nNotebookLM (2023)\nVids (2024)\nGemini Robotics (2025)\nAntigravity (2025)\nSee also\n\" Attention Is All You Need \"\nFuture of Go Summit\nGenerative pre-trained transformer\nGoogle Labs\nGoogle Pixel\nGoogle Workspace\nRobot Constitution\nCategory\nCommons\nv t e Google a subsidiary of Alphabet Company Divisions\nAI\nArea 120\nATAP\nBrain\nChina\nCloud Platform\nEnergy\nGoogle.org\nCrisis Response\nHealth\nRegistry\nSubsidiaries Active\nDeepMind\nFitbit\nITA Software\nJigsaw\nLooker\nMandiant\nSecurity Operations\nOwlchemy Labs\nDefunct\nActifio\nAdscape\nAkwan Information Technologies\nAnvato\nApigee\nBandPage\nBitium\nBufferBox\nCrashlytics\nDodgeball\nDoubleClick\nDropcam\nEndoxon\nFlutter\nGlobal IP Solutions\nGreen Throttle Games\nGreenBorder\nGridcentric\nImageAmerica\nImpermium\nInvite Media\nKaltix\nMarratech\nMeebo\nMetaweb\nNeotonic Software\nNeverware\nNik Software\nOrbitera\nPyra Labs\nQuest Visual\nReqwireless\nRightsFlow\nSidewalk Labs\nSlickLogin\nTitan Aerospace\nTyphoon Studios\nUrban Engines\nVicarious\nViewdle\nWavii\nWildfire Interactive\nYouTube Next Lab and Audience Development Group\nPrograms\nBusiness Groups\nComputing University Initiative\nContact Lens\nContent ID\nCrossCheck\nData Liberation Front\nData Transfer Project\nDeveloper Expert\nDigiKavach\nDigiPivot\nDigital Garage\nDigital News Initiative\nDigital Unlocked\nDragonfly\nFounders' Award\nFree Zone\nGet Your Business Online\nGoogle for Education\nGoogle for Startups\nLiving Stories\nMade with Code\nNews Lab\nPowerMeter\nPrivacy Sandbox\nProject Nightingale\nProject Nimbus\nProject Sunroof\nProject Zero\nQuantum Artificial Intelligence Lab\nRechargeIT\nSensorvault\nSilicon Initiative\nSolve for X\nStreet View Trusted\nStudent Ambassador Program\nVevo\nYouTube BrandConnect\nYouTube Creator Awards\nYouTube Select\nYouTube Original Channel Initiative\nYear in Search\nYouTube Rewind\n2018\n2019\nEvents\nAlphaGo versus Fan Hui\nAlphaGo versus Lee Sedol\nAlphaGo versus Ke Jie\nAndroid Developer Challenge\nAndroid Developer Day\nAndroid Developer Lab\nCNN/YouTube presidential debates\nCode-in\nCode Jam\nDeveloper Day\nDevelopers Live\nDoodle4Google\nFuture of Go Summit\nG-Day\nHash Code\nI/O\nLunar X Prize\nMapathon\nScience Fair\nSummer of Code\nWorld Chess Championship 2024\nYouTube Awards\nYouTube Comedy Week\nYouTube Live\nYouTube Music Awards\n2013\n2015\nYouTube Space Lab\nYouTube Symphony Orchestra\nInfrastructure\n111 Eighth Avenue\nAndroid lawn statues\nAndroidland\nBarges\nBinoculars Building\nCentral Saint Giles\nChelsea Market\nChrome Zone\nData centers\nGeoEye-1\nGoogleplex\nIvanpah Solar Power Facility\nJames R. Thompson Center\nKing's Cross\nMayfield Mall\nPier 57\nSidewalk Toronto\nSt. John's Terminal\nSubmarine cables\n Dunant\nGrace Hopper\nUnity\nWiFi\nYouTube Space\nYouTube Theater\nPeople Current\nKrishna Bharat\nVint Cerf\nJeff Dean\nJohn Doerr\nSanjay Ghemawat\nAl Gore\nJohn L. Hennessy\nUrs Hölzle\nSalar Kamangar\nRay Kurzweil\nAnn Mather\nAlan Mulally\nRick Osterloh\nSundar Pichai (CEO)\nRuth Porat (CFO)\nRajen Sheth\nHal Varian\nNeal Mohan\nFormer\nAndy Bechtolsheim\nSergey Brin (co-founder)\nDavid Cheriton\nMatt Cutts\nDavid Drummond\nAlan Eustace\nTimnit Gebru\nOmid Kordestani\nPaul Otellini\nLarry Page (co-founder)\nPatrick Pichette\nEric Schmidt\nRam Shriram\nAmit Singhal\nShirley M. Tilghman\nRachel Whetstone\nSusan Wojcicki\nCriticism General\nCensorship\nDeGoogle\nFairSearch\n\" Google's Ideological Echo Chamber \"\nNo Tech for Apartheid\nPrivacy concerns\nStreet View\nYouTube\nTrade unions\nAlphabet Workers Union\nYouTube copyright issues\nIncidents\nBackdoor advertisement controversy\nBlocking of YouTube videos in Germany\nData breach\nElsagate\nFantastic Adventures scandal\nKohistan video case\nReactions to Innocence of Muslims\nSan Francisco tech bus protests\nServices outages\nSlovenian government incident\nWalkouts\nYouTube headquarters shooting\nOther\nAndroid apps\nApril Fools' Day jokes\nDoodles\nDoodle Champion Island Games\nMagic Cat Academy\nPac-Man\nEaster eggs\nHistory\nGmail\nSearch\nYouTube\nLogo\nMaterial Design\nMergers and acquisitions\nDevelopment Software A–C\nAccelerated Linear Algebra\nAMP\nActions on Google\nALTS\nAmerican Fuzzy Lop\nAndroid Cloud to Device Messaging\nAndroid Debug Bridge\nAndroid NDK\nAndroid Runtime\nAndroid SDK\nAndroid Studio\nAngular\nAngularJS\nApache Beam\nAPIs\nApp Engine\nApp Inventor\nApp Maker\nApp Runtime for Chrome\nAppJet\nApps Script\nAppSheet\nARCore\nBase\nBazel\nBeyondCorp\nBigtable\nBigQuery\nBionic\nBlockly\nBorg\nCaja\nCameyo\nChart API\nCharts\nChrome Frame\nChromium\nBlink\nClosure Tools\nCloud Connect\nCloud Dataflow\nCloud Datastore\nCloud Messaging\nCloud Shell\nCloud Storage\nCode Search\nCompute Engine\nCpplint\nD–N\nDalvik\nData Protocol\nDialogflow\nExposure Notification\nFast Pair\nFastboot\nFederated Learning of Cohorts\nFile System\nFirebase\nFirebase Studio\nFirebase Cloud Messaging\nFlatBuffers\nFlutter\nFreebase\nGadgets\nGaneti\nGears\nGerrit\nGLOP\ngRPC\nGson\nGuava\nGuetzli\nGuice\ngVisor\nGYP\nJAX\nJetpack Compose\nKeyhole Markup Language\nKubernetes\nKythe\nLevelDB\nLighthouse\nLooker Studio\nlmctfy\nMapReduce\nMashup Editor\nMatter\nMobile Services\nNamebench\nNative Client\nNeatx\nNeural Machine Translation\nNomulus\nO–Z\nOpen Location Code\nOpenRefine\nOpenSocial\nOptimize\nOR-Tools\nPack\nPageSpeed\nPiper\nPlugin for Eclipse\nPolymer\nProgrammable Search Engine\nProject Shield\nPublic DNS\nreCAPTCHA\nRenderScript\nSafetyNet\nSageTV\nSchema.org\nSearch Console\nShell\nSitemaps\nSkia Graphics Engine\nSpanner\nSputnik\nStackdriver\nSwiffy\nTango\nTensorFlow\nTesseract\nTest\nTranslator Toolkit\nUrchin\nUTM parameters\nV8\nVirusTotal\nVisBug\nWave Federation Protocol\nWeave\nWeb Accelerator\nWeb Designer\nWeb Server\nWeb Toolkit\nWebdriver Torso\nWebRTC\nOperating systems\nAndroid\nCupcake\nDonut\nEclair\nFroyo\nGingerbread\nHoneycomb\nIce Cream Sandwich\nJelly Bean\nKitKat\nLollipop\nMarshmallow\nNougat\nOreo\nPie\n10\n11\n12\n13\n14\n15\n16\nversion history\nsmartphones\nAndroid Automotive\nAndroid Go\ndevices\nAndroid Things\nAndroid TV\ndevices\nAndroid XR\nChromeOS\nChromeOS Flex\nChromiumOS\nFuchsia\nGlass OS\ngLinux\nGoobuntu\nTV\nWear OS\nMachine learning models\nBERT\nChinchilla\nDreamBooth\nGemini\nGemma\nImagen (2023)\nLaMDA\nPaLM\nT5\nVeo (text-to-video model)\nVideoPoet\nXLNet\nNeural networks\nEfficientNet\nGato\nInception\nMobileNet\nTransformer\nWaveNet\nComputer programs\nAlphaDev\nAlphaFold\nAlphaGeometry\nAlphaGo\nAlphaGo Zero\nAlphaStar\nAlphaZero\nMaster\nMuZero\nFormats and codecs\nAAB\nAPK\nAV1\niLBC\niSAC\nlibvpx\nLyra\nProtocol Buffers\nUltra HDR\nVP3\nVP6\nVP8\nVP9\nWebM\nWebP\nWOFF2\nProgramming languages\nCarbon\nDart\nGo\nSawzall\nSearch algorithms\nGooglebot\nHummingbird\nMobilegeddon\nPageRank\nmatrix\nPanda\nPenguin\nPigeon\nRankBrain\nDomain names\n.app\n.dev\n.google\n.zip\ng.co\ngoogle.by\nTypefaces\nCroscore\nNoto\nProduct Sans\nRoboto\nSoftware A\nAardvark\nAccount\nDashboard\nTakeout\nAd Manager\nAdMob\nAds\nAdSense\nAffiliate Network\nAlerts\nAllo\nAnalytics\nAntigravity\nAndroid Auto\nAndroid Beam\nAnswers\nApture\nArts & Culture\nAssistant\nAttribution\nAuthenticator\nB\nBebaPay\nBeatThatQuote.com\nBeam\nBlog Search\nBlogger\nBody\nBookmarks\nBooks\nNgram Viewer\nBrowser Sync\nBuilding Maker\nBump\nBumpTop\nBuzz\nC\nCalendar\nCast\nCatalogs\nChat\nCheckout\nChrome\nChrome Apps\nChrome Experiments\nChrome Remote Desktop\nChrome Web Store\nClassroom\nCloud Print\nCloud Search\nContacts\nContributor\nCrowdsource\nCurrents (social app)\nCurrents (news app)\nD\nData Commons\nDataset Search\nDesktop\nDictionary\nDinosaur Game\nDirectory\nDocs\nDocs Editors\nDomains\nDrawings\nDrive\nDuo\nE\nEarth\nEtherpad\nExpeditions\nExpress\nF\nFamily Link\nFast Flip\nFeedBurner\nfflick\nFi Wireless\nFinance\nFiles\nFind Hub\nFit\nFlights\nFlu Trends\nFonts\nForms\nFriend Connect\nFusion Tables\nG\nGboard\nGemini\nNano Banana\nGesture Search\nGizmo5\nGoogle+\nGmail\nGoggles\nGOOG-411\nGrasshopper\nGroups\nH\nHangouts\nHelpouts\nHome\nI\niGoogle\nImages\nImage Labeler\nImage Swirl\nInbox by Gmail\nInput Tools\nJapanese Input\nPinyin\nInsights for Search\nJ\nJaiku\nJamboard\nK\nKaggle\nKeep\nKnol\nL\nLabs\nLatitude\nLens\nLike.com\nLive Transcribe\nLively\nM\nMap Maker\nMaps\nMaps Navigation\nMarketing Platform\nMeet\nMessages\nModerator\nMy Tracks\nN\nNearby Share\nNews\nNews & Weather\nNews Archive\nNotebook\nNotebookLM\nNow\nO\nOffers\nOne\nOne Pass\nOpinion Rewards\nOrkut\nOyster\nP\nPanoramio\nPaperofRecord.com\nPatents\nPage Creator\nPay (mobile app)\nPay (payment method)\nPay Send\nPeople Cards\nPerson Finder\nPersonalized Search\nPhotomath\nPhotos\nPicasa\nPicasa Web Albums\nPicnik\nPixel Camera\nPlay\nPlay Books\nPlay Games\nPlay Music\nPlay Newsstand\nPlay Pass\nPlay Services\nPodcasts\nPoly\nPostini\nPostRank\nPrimer\nPublic Alerts\nPublic Data Explorer\nQ\nQuestion Hub\nQuick, Draw!\nQuick Search Box\nQuick Share\nQuickoffice\nR\nRead Along\nReader\nReply\nS\nSafe Browsing\nSageTV\nSanta Tracker\nSchemer\nScholar\nSearch\nAI Overviews\nKnowledge Graph\nSafeSearch\nSearchwiki\nSheets\nShoploop\nShopping\nSidewiki\nSites\nSlides\nSnapseed\nSocratic\nSoftcard\nSongza\nSound Amplifier\nSpaces\nSparrow (chatbot)\nSparrow (email client)\nSpeech Recognition & Synthesis\nSquared\nStadia\nStation\nStore\nStreet View\nSurveys\nSync\nT\nTables\nTalk\nTalkBack\nTasks\nTenor\nTez\nTilt Brush\nToolbar\nToontastic 3D\nTranslate\nTravel\nTrendalyzer\nTrends\nTV\nU\nURL Shortener\nV\nVideo\nVids\nVoice\nVoice Access\nVoice Search\nW\nWallet\nWave\nWaze\nWDYL\nWeb Light\nWhere Is My Train\nWidevine\nWiz\nWord Lens\nWorkspace\nWorkspace Marketplace\nY\nYouTube\nYouTube Kids\nYouTube Music\nYouTube Premium\nYouTube Shorts\nYouTube Studio\nYouTube TV\nYouTube VR\nHardware Pixel Smartphones\nPixel (2016)\nPixel 2 (2017)\nPixel 3 (2018)\nPixel 3a (2019)\nPixel 4 (2019)\nPixel 4a (2020)\nPixel 5 (2020)\nPixel 5a (2021)\nPixel 6 (2021)\nPixel 6a (2022)\nPixel 7 (2022)\nPixel 7a (2023)\nPixel Fold (2023)\nPixel 8 (2023)\nPixel 8a (2024)\nPixel 9 (2024)\nPixel 9 Pro Fold (2024)\nPixel 9a (2025)\nPixel 10 (2025)\nPixel 10 Pro Fold (2025)\nSmartwatches\nPixel Watch (2022)\nPixel Watch 2 (2023)\nPixel Watch 3 (2024)\nPixel Watch 4 (2025)\nTablets\nPixel C (2015)\nPixel Slate (2018)\nPixel Tablet (2023)\nLaptops\nChromebook Pixel (2013–2015)\nPixelbook (2017)\nPixelbook Go (2019)\nOther\nPixel Buds (2017–present)\nNexus Smartphones\nNexus One (2010)\nNexus S (2010)\nGalaxy Nexus (2011)\nNexus 4 (2012)\nNexus 5 (2013)\nNexus 6 (2014)\nNexus 5X (2015)\nNexus 6P (2015)\nTablets\nNexus 7 (2012)\nNexus 10 (2012)\nNexus 7 (2013)\nNexus 9 (2014)\nOther\nNexus Q (2012)\nNexus Player (2014)\nOther\nAndroid Dev Phone\nAndroid One\nCardboard\nChromebit\nChromebook\nChromebox\nChromecast\nClips\nDaydream\nFitbit\nGlass\nLiftware\nLiquid Galaxy\nNest\nsmart speakers\nThermostat\nWifi\nPlay Edition\nProject Ara\nOnHub\nPixel Visual Core\nProject Iris\nSearch Appliance\nSycamore processor\nTensor\nTensor Processing Unit\nTitan Security Key\nv t e Litigation Advertising\nFeldman v. Google, Inc. (2007)\nRescuecom Corp. v. Google Inc. (2009)\nGoddard v. Google, Inc. (2009)\nRosetta Stone Ltd. v. Google, Inc. (2012)\nGoogle, Inc. v. American Blind & Wallpaper Factory, Inc. (2017)\nJedi Blue\nAntitrust\nEuropean Union (2010–present)\nUnited States v. Adobe Systems, Inc., Apple Inc., Google Inc., Intel Corporation, Intuit, Inc., and Pixar (2011)\nUmar Javeed, Sukarma Thapar, Aaqib Javeed vs. Google LLC and Ors. (2019)\nUnited States v. Google LLC (2020)\nUnited States v. Google LLC (2023)\nIntellectual property\nPerfect 10, Inc. v. Amazon.com, Inc. (2007)\nViacom International, Inc. v. YouTube, Inc. (2010)\nLenz v. Universal Music Corp. (2015)\nAuthors Guild, Inc. v. Google, Inc. (2015)\nField v. Google, Inc. (2016)\nGoogle LLC v. Oracle America, Inc. (2021)\nSmartphone patent wars\nPrivacy\nRocky Mountain Bank v. Google, Inc. (2009)\nHibnick v. Google, Inc. (2010)\nUnited States v. Google Inc. (2012)\nJudgement of the German Federal Court of Justice on Google's autocomplete function (2013)\nJoffe v. Google, Inc. (2013)\nMosley v SARL Google (2013)\nGoogle Spain v AEPD and Mario Costeja González (2014)\nFrank v. Gaos (2019)\nOther\nGarcia v. Google, Inc. (2015)\nGoogle LLC v Defteros (2020)\nEpic Games v. Google (2021)\nGonzalez v. Google LLC (2022)\nRelated Concepts\nBeauty YouTuber\nBookTube\nBreadTube\n\" Don't be evil \"\nGayglers\nGoogle as a verb\nGoogle bombing\n2004 U.S. presidential election\nGoogle effect\nGooglefight\nGoogle hacking\nGoogleshare\nGoogle tax\nGooglewhack\nGooglization\nIllegal flower tribute\nObjectives and key results\nRooting\nSearch engine manipulation effect\nSide project time\nSitelink\nSite reliability engineering\nStudyTube\nVTuber\nYouTube Poop\nYouTuber\nlist\nProducts Android\nBooting process\nCustom distributions\nFeatures\nRecovery mode\nSoftware development\nStreet View coverage\nAfrica\nAntarctica\nAsia\nIsrael\nEurope\nNorth America\nCanada\nUnited States\nOceania\nSouth America\nArgentina\nChile\nColombia\nYouTube\nCopyright strike\nEducation\nFeatures\nModeration\nMost-disliked videos\nMost-liked videos\nMost-subscribed channels\nMost-viewed channels\nMost-viewed videos\nArabic music videos\nChinese music videos\nFrench music videos\nIndian videos\nPakistani videos\nOfficial channel\nSocial impact\nYouTube Premium original programming\nOther\nGmail interface\nMaps pin\nMost downloaded Google Play applications\nStadia games\nDocumentaries\nAlphaGo\nGoogle: Behind the Screen\nGoogle Maps Road Trip\nGoogle and the World Brain\nThe Creepy Line\nBooks\nGoogle Hacks\nThe Google Story\nGoogled: The End of the World as We Know It\nHow Google Works\nI'm Feeling Lucky\nIn the Plex\nThe MANIAC\nPopular culture\nGoogle Feud\nGoogle Me (film)\n\" Google Me \" (Kim Zolciak song)\n\" Google Me \" (Teyana Taylor song)\nIs Google Making Us Stupid?\nProceratium google\nMatt Nathanson: Live at Google\nThe Billion Dollar Code\nThe Internship\nWhere on Google Earth is Carmen Sandiego?\nOther\n\" Attention Is All You Need \"\nelgooG\nGenerative pre-trained transformer\n\" Me at the zoo \"\nPredictions of the end\nRelationship with Wikipedia\n\" Reunion \"\nRobot Constitution\nItalics denote discontinued products .\n Category\nOutline\nv t e Natural language processing General terms\nAI-complete\nBag-of-words\nn -gram\nBigram\nTrigram\nComputational linguistics\nNatural language understanding\nStop words\nText processing\nText analysis\nArgument mining\nCollocation extraction\nConcept mining\nCoreference resolution\nDeep linguistic processing\nDistant reading\nInformation extraction\nNamed-entity recognition\nOntology learning\nParsing\nsemantic\nsyntactic\nPart-of-speech tagging\nSemantic analysis\nSemantic role labeling\nSemantic decomposition\nSemantic similarity\nSentiment analysis\nTerminology extraction\nText mining\nTextual entailment\nTruecasing\nWord-sense disambiguation\nWord-sense induction\nText segmentation\nCompound-term processing\nLemmatisation\nLexical analysis\nText chunking\nStemming\nSentence segmentation\nWord segmentation\nAutomatic summarization\nMulti-document summarization\nSentence extraction\nText simplification\nMachine translation\nComputer-assisted\nExample-based\nRule-based\nStatistical\nTransfer-based\nNeural\nDistributional semantics models\nBERT\nDocument-term matrix\nExplicit semantic analysis\nfastText\nGloVe\nLanguage model\nlarge\nsmall\nLatent semantic analysis\nLong short-term memory\nSeq2seq\nTransformer\nWord embedding\nWord2vec\nLanguage resources , datasets and corpora Types and standards\nCorpus linguistics\nLexical resource\nLinguistic Linked Open Data\nMachine-readable dictionary\nParallel text\nPropBank\nSemantic network\nSimple Knowledge Organization System\nSpeech corpus\nText corpus\nThesaurus (information retrieval)\nTreebank\nUniversal Dependencies\nData\nBabelNet\nBank of English\nDBpedia\nFrameNet\nGoogle Ngram Viewer\nUBY\nWordNet\nWikidata\nAutomatic identification and data capture\nSpeech recognition\nSpeech segmentation\nSpeech synthesis\nNatural language generation\nOptical character recognition\nTopic model\nDocument classification\nLatent Dirichlet allocation\nPachinko allocation\nComputer-assisted reviewing\nAutomated essay scoring\nConcordancer\nGrammar checker\nPredictive text\nPronunciation assessment\nSpell checker\nNatural language user interface\nChatbot\nInteractive fiction\nQuestion answering\nVirtual assistant\nVoice user interface\nRelated\nFormal semantics\nHallucination\nNatural Language Toolkit\nspaCy\nv t e Artificial intelligence (AI)\nHistory\ntimeline\nGlossary\nCompanies\nProjects\nConcepts\nParameter\nHyperparameter\nLoss functions\nRegression\nBias–variance tradeoff\nDouble descent\nOverfitting\nClustering\nGradient descent\nSGD\nQuasi-Newton method\nConjugate gradient method\nBackpropagation\nAttention\nConvolution\nNormalization\nBatchnorm\nActivation\nSoftmax\nSigmoid\nRectifier\nGating\nWeight initialization\nRegularization\nDatasets\nAugmentation\nPrompt engineering\nReinforcement learning\nQ-learning\nSARSA\nImitation\nPolicy gradient\nDiffusion\nLatent diffusion model\nAutoregression\nAdversary\nRAG\nUncanny valley\nRLHF\nSelf-supervised learning\nReflection\nRecursive self-improvement\nHallucination\nWord embedding\nVibe coding\nSafety ( Alignment )\nApplications\nMachine learning\nIn-context learning\nArtificial neural network\nDeep learning\nLanguage model\nLarge\nNMT\nReasoning\nModel Context Protocol\nIntelligent agent\nArtificial human companion\nHumanity's Last Exam\nArtificial general intelligence (AGI)\nImplementations Audio–visual\nAlexNet\nWaveNet\nHuman image synthesis\nHWR\nOCR\nComputer vision\nSpeech synthesis\n15.ai\nElevenLabs\nSpeech recognition\nWhisper\nFacial recognition\nAlphaFold\nText-to-image models\nAurora\nDALL-E\nFirefly\nFlux\nGPT Image\nIdeogram\nImagen\nMidjourney\nRecraft\nStable Diffusion\nText-to-video models\nDream Machine\nRunway Gen\nHailuo AI\nKling\nSora\nVeo\nMusic generation\nRiffusion\nSuno AI\nUdio\nText\nWord2vec\nSeq2seq\nGloVe\nBERT\nT5\nLlama\nChinchilla AI\nPaLM\nGPT\n1\n2\n3\nJ\nChatGPT\n4\n4o\no1\no3\n4.5\n4.1\no4-mini\n5\n5.1\n5.2\nClaude\nGemini\nGemini (language model)\nGemma\nGrok\nLaMDA\nBLOOM\nDBRX\nProject Debater\nIBM Watson\nIBM Watsonx\nGranite\nPanGu-Σ\nDeepSeek\nQwen\nDecisional\nAlphaGo\nAlphaZero\nOpenAI Five\nSelf-driving car\nMuZero\nAction selection\nAutoGPT\nRobot control\nPeople\nAlan Turing\nWarren Sturgis McCulloch\nWalter Pitts\nJohn von Neumann\nChristopher D. Manning\nClaude Shannon\nShun'ichi Amari\nKunihiko Fukushima\nTakeo Kanade\nMarvin Minsky\nJohn McCarthy\nNathaniel Rochester\nAllen Newell\nCliff Shaw\nHerbert A. Simon\nOliver Selfridge\nFrank Rosenblatt\nBernard Widrow\nJoseph Weizenbaum\nSeymour Papert\nSeppo Linnainmaa\nPaul Werbos\nGeoffrey Hinton\nJohn Hopfield\nJürgen Schmidhuber\nYann LeCun\nYoshua Bengio\nLotfi A. Zadeh\nStephen Grossberg\nAlex Graves\nJames Goodnight\nAndrew Ng\nFei-Fei Li\nAlex Krizhevsky\nIlya Sutskever\nOriol Vinyals\nQuoc V. Le\nIan Goodfellow\nDemis Hassabis\nDavid Silver\nAndrej Karpathy\nAshish Vaswani\nNoam Shazeer\nAidan Gomez\nJohn Schulman\nMustafa Suleyman\nJan Leike\nDaniel Kokotajlo\nFrançois Chollet\nArchitectures\nNeural Turing machine\nDifferentiable neural computer\nTransformer\nVision transformer (ViT)\nRecurrent neural network (RNN)\nLong short-term memory (LSTM)\nGated recurrent unit (GRU)\nEcho state network\nMultilayer perceptron (MLP)\nConvolutional neural network (CNN)\nResidual neural network (RNN)\nHighway network\nMamba\nAutoencoder\nVariational autoencoder (VAE)\nGenerative adversarial network (GAN)\nGraph neural network (GNN)\nCategory\nRetrieved from \" https://en.wikipedia.org/w/index.php?title=BERT_(language_model)&oldid=1319250070 \"\nCategories : Google software Large language models 2018 software 2018 in artificial intelligence Hidden categories: Articles with short description Short description is different from Wikidata Use mdy dates from November 2023 Use American English from November 2023 All Wikipedia articles written in American English Articles containing potentially dated statements from 2020 All articles containing potentially dated statements\nThis page was last edited on 28 October 2025, at 17:44  (UTC) .\nText is available under the Creative Commons Attribution-ShareAlike 4.0 License ;\nadditional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy . Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. , a non-profit organization.\nPrivacy policy\nAbout Wikipedia\nDisclaimers\nContact Wikipedia\nLegal & safety contacts\nCode of Conduct\nDevelopers\nStatistics\nCookie statement\nMobile view\nSearch\nSearch\nToggle the table of contents\nBERT (language model)\n21 languages\nAdd topic\n"
    },
    {
      "turn_idx": 6,
      "action": "fetch_website_content({\"website_url\": \"https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html\"})",
      "explanation": "",
      "observation": "\nOpen Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processin\nJump to Content\nResearch\nResearch\nWho we are\nBack to\nWho we are\n menu\nDefining the technology of today and tomorrow.\nPhilosophy\nWe strive to create an environment conducive to many different types of research across many different time scales and levels of risk.\nLearn more about our Philosophy\nLearn more\nPhilosophy\nPeople\nOur researchers drive advancements in computer science through both fundamental and applied research.\nLearn more about our People\nLearn more\nPeople\nResearch areas\nBack to\nResearch areas\n menu\nResearch areas\nExplore all research areas\nResearch areas\nBack to\nResearch areas\n menu\nExplore all research areas\nFoundational ML & Algorithms\nAlgorithms & Theory\nData Management\nData Mining & Modeling\nInformation Retrieval & the Web\nMachine Intelligence\nMachine Perception\nMachine Translation\nNatural Language Processing\nSpeech Processing\nFoundational ML & Algorithms\nBack to\nFoundational ML & Algorithms\n menu\nAlgorithms & Theory\nData Management\nData Mining & Modeling\nInformation Retrieval & the Web\nMachine Intelligence\nMachine Perception\nMachine Translation\nNatural Language Processing\nSpeech Processing\nComputing Systems & Quantum AI\nDistributed Systems & Parallel Computing\nHardware & Architecture\nMobile Systems\nNetworking\nQuantum Computing\nRobotics\nSecurity, Privacy, & Abuse Prevention\nSoftware Engineering\nSoftware Systems\nComputing Systems & Quantum AI\nBack to\nComputing Systems & Quantum AI\n menu\nDistributed Systems & Parallel Computing\nHardware & Architecture\nMobile Systems\nNetworking\nQuantum Computing\nRobotics\nSecurity, Privacy, & Abuse Prevention\nSoftware Engineering\nSoftware Systems\nScience, AI & Society\nClimate & Sustainability\nEconomics & Electronic Commerce\nEducation Innovation\nGeneral Science\nHealth & Bioscience\nHuman-Computer Interaction and Visualization\nResponsible AI\nScience, AI & Society\nBack to\nScience, AI & Society\n menu\nClimate & Sustainability\nEconomics & Electronic Commerce\nEducation Innovation\nGeneral Science\nHealth & Bioscience\nHuman-Computer Interaction and Visualization\nResponsible AI\nOur work\nBack to\nOur work\n menu\nProjects\nWe regularly open-source projects with the broader research community and apply our developments to Google products.\nLearn more about our Projects\nLearn more\nProjects\nPublications\nPublishing our work allows us to share ideas and work collaboratively to advance the field of computer science.\nLearn more about our Publications\nLearn more\nPublications\nResources\nWe make products, tools, and datasets available to everyone with the goal of building a more collaborative ecosystem.\nLearn more about our Resources\nLearn more\nResources\nPrograms & events\nBack to\nPrograms & events\n menu\nShaping the future, together.\nCollaborate with us\nStudent programs\nSupporting the next generation of researchers through a wide range of programming.\nLearn more about our Student programs\nLearn more\nStudent programs\nFaculty programs\nParticipating in the academic research community through meaningful engagement with university faculty.\nLearn more about our Faculty programs\nLearn more\nFaculty programs\nConferences & events\nConnecting with the broader research community through events is essential for creating progress in every aspect of our work.\nLearn more about our Conferences & events\nLearn more\nConferences & events\nCollaborate with us\nCareers\nBlog\nSearch\nHome\nBlog\nOpen Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing\nNovember 2, 2018 Posted by Jacob Devlin and Ming-Wei Chang, Research Scientists, Google AI Language\nQuick links\nShare\nCopy link\n×\nOne of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions , of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training ). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis , resulting in substantial accuracy improvements compared to training on these datasets from scratch.\nThis week, we open sourced a new technique for NLP pre-training called B idirectional E ncoder R epresentations from T ransformers , or BERT . With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU , or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models. In our associated paper , we demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).\nWhat Makes BERT Different?\nBERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning , Generative Pre-Training , ELMo , and ULMFit . However, unlike these previous models, BERT is the first deeply bidirectional , unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia ).\nWhy does this matter? Pre-trained representations can either be context-free or contextual , and contextual representations can further be unidirectional or bidirectional . Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “ bank ” would have the same context-free representation in “ bank account ” and “ bank of the river. ” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “ I accessed the bank account ,” a unidirectional contextual model would represent “ bank ” based on “ I accessed the ” but not “ account .” However, BERT represents “ bank ” using both its previous and next context — “ I accessed the ... account ” — starting from the very bottom of a deep neural network, making it deeply bidirectional.\nA visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:\nBERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.\nThe Strength of Bidirectionality\nIf bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.\nTo solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:\nWhile this idea has been around for a very long time , BERT is the first time it was successfully used to pre-train a deep neural network.\nBERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B , is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:\nTraining with Cloud TPUs\nEverything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs . Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques. The Transformer model architecture , developed by researchers at Google in 2017, also gave us the foundation we needed to make BERT successful. The Transformer is implemented in our open source release , as well as the tensor2tensor library .\nResults with BERT\nTo evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture. On SQuAD v1.1 , BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:\nBERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark , a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:\nMaking BERT Work for You\nThe models that we are releasing can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that we are releasing today are English-only, but we hope to release models which have been pre-trained on a variety of languages in the near future.\nThe open source TensorFlow implementation and pointers to pre-trained BERT models can be found at http://goo.gl/language/bert . Alternatively, you can get started using BERT through Colab with the notebook “ BERT FineTuning with Cloud TPUs .”\nYou can also read our paper \" BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding \" for more details.\nLabels:\nMachine Intelligence\nNatural Language Processing\nOpen Source Models & Datasets\nProduct\nQuick links\nShare\nCopy link\n×\nOther posts of interest\nDecember 15, 2025\nGemini provides automated feedback for theoretical computer scientists at STOC 2026\nAlgorithms & Theory\n·\nGenerative AI\n·\nNatural Language Processing\nDecember 4, 2025\nTitans + MIRAS: Helping AI have long-term memory\nGenerative AI\n·\nMachine Intelligence\n·\nNatural Language Processing\nDecember 3, 2025\nFrom Waveforms to Wisdom: The New Benchmark for Auditory Intelligence\nMachine Intelligence\n·\nSound & Accoustics\n·\nSpeech Processing\n×\n❮\n❯\nFollow us\nAbout Google\nGoogle Products\nPrivacy\nTerms\nHelp\nSubmit feedback\n×\n"
    },
    {
      "turn_idx": 7,
      "action": "tavily_search({\"query\": \"Transformer encoder stack of N=6 identical layers quote\"})",
      "explanation": "",
      "observation": "{'query': 'Transformer encoder stack of N=6 identical layers quote', 'follow_up_questions': None, 'answer': None, 'images': [], 'results': [{'url': 'https://medium.com/@saleemheebah4/transformer-architecture-17cbd11eb26e', 'title': 'Transformer Architecture: Encoder | by Heebah Saleem - Medium', 'content': 'N = 6 identical layers — encoder consists of a stack of N = 6 identical layers, each composed of two sublayers. Attention (simple) Mechanism', 'score': 0.8055255, 'raw_content': None}, {'url': 'https://glassboxmedicine.com/2019/08/15/the-transformer-attention-is-all-you-need/', 'title': 'The Transformer: Attention Is All You Need - Glass Box Medicine', 'content': 'As you can see from the figure, the encoder is made of N = 6 identical layers stacked on top of each other. For English-Spanish translation', 'score': 0.7434107, 'raw_content': None}, {'url': 'https://nlp.seas.harvard.edu/2018/04/03/attention.html', 'title': 'The Annotated Transformer - Harvard NLP', 'content': 'To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution. The Transformer uses multi-head attention in three different ways: 1) In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite). In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d\\\\_{\\\\text{model}}$. The positional encodings have the same dimension $d\\\\_{\\\\text{model}}$ as the embeddings, so that the two can be summed. > Once trained we can decode the model to produce a set of translations.', 'score': 0.21618558, 'raw_content': None}, {'url': 'https://www.machinelearningmastery.com/the-transformer-model/', 'title': 'The Transformer Model - MachineLearningMastery.com', 'content': '* How the Transformer architecture implements an encoder-decoder structure without recurrence and convolutions. The Transformer architecture follows an encoder-decoder structure but does not rely on recurrence and convolutions in order to generate an output. In a nutshell, the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder. * How the Transformer architecture implements an encoder-decoder structure without recurrence and convolutions. This is because, when the self-attention layer in the Transformer architecture receives its inputs in the form of Queries, Keys and Values, it will apply a set number of sequential operations to them, namely: dot-product multiplication between the Queries and Keys, followed by scaling (and masking, which is however optional), softmax normalisation, and one final dot-product multiplication with the Values (refer to the link provided by James for more details).', 'score': 0.16506785, 'raw_content': None}, {'url': 'https://huggingface.co/blog/encoder-decoder', 'title': 'Transformer-based Encoder-Decoder Models - Hugging Face', 'content': 'The transformer-based encoder part encodes the input sequence X1:n\\u200b to a *sequence* of *hidden states* X1:n\\u200b, thus defining the mapping:. probability distribution of the target vector sequence Y1:n\\u200b given the sequence of encoded hidden states X1:n\\u200b:. Thereby, the first encoder block transforms each input vector of the. transformer-based decoder is required as a starting input vector to. of decoder blocks maps the contextualized encoding sequence X1:n\\u200b and a target vector sequence prepended by. decoders, the encoded output vector y\\u200bi\\u200b should be. outputs of the uni-directional self-attention layer, *i.e.* Y′′0:m−1\\u200b and the contextualized encoding vectors X1:n\\u200b. the contextual encoding representation of all encoder input vectors X1:n\\u200b = \"I want to buy a car EOS\". responsible to further condition each output vector on all encoded input. encoder_output_vectors = model.base_model.encoder(input_ids, return_dict=True).last_hidden_state. # pass decoder input ids and encoded input vectors to decoder. decoder_output_vectors = model.base_model.decoder(decoder_input_ids, encoder_hidden_states=encoder_output_vectors).last_hidden_state. decoder_output_vectors_perturbed = model.base_model.decoder(decoder_input_ids_perturbed, encoder_hidden_states=encoder_output_vectors).last_hidden_state. lm_logits = model(None, encoder_outputs=encoded_sequence, decoder_input_ids=decoder_input_ids, return_dict=True).logits. lm_logits = model(None, encoder_outputs=encoded_sequence, decoder_input_ids=decoder_input_ids, return_dict=True).logits.', 'score': 0.10603178, 'raw_content': None}], 'response_time': 0.66, 'request_id': '7bfc79e9-c46c-475c-a48c-8f08353382f3'}"
    },
    {
      "turn_idx": 8,
      "action": "tavily_search({\"query\": \"BERT_BASE L=12 H=768 A=12 source\"})",
      "explanation": "",
      "observation": "{'query': 'BERT_BASE L=12 H=768 A=12 source', 'follow_up_questions': None, 'answer': None, 'images': [], 'results': [{'url': 'https://dataloop.ai/library/model/dimitre_bert_en_cased_l-12_h-768_a-12/', 'title': 'Bert en cased L 12 H 768 A 12 · Models - Dataloop', 'content': 'With 12 hidden layers, 768 hidden size, and 12 attention heads, this model is well-equipped to handle a wide range of tasks, from text classification to', 'score': 0.9999292, 'raw_content': None}, {'url': 'https://huggingface.co/google/bert_uncased_L-12_H-768_A-12', 'title': 'google/bert_uncased_L-12_H-768_A-12 - Hugging Face', 'content': 'This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.', 'score': 0.99992573, 'raw_content': None}, {'url': 'https://www.kaggle.com/questions-and-answers/393385', 'title': 'BERT parameters calculation - Kaggle', 'content': \"Let's look at individual elements of an encoder block for BERT to visualize the number weight matrices as well as the bias vectors. The given configuration L = 12 means there will be 12 layers of self attention, H = 768 means that the embedding dimension of individual tokens will be of 768 dimensions, A = 12 means there will be 12 attention heads in one layer of self attention. Position-wise feedforward network will have one hidden layer, that will correspond to two weight matrices and two bias vectors. Let's calculate the actual number of parameters by associating the right dimensions to the weight matrices and bias vectors for the BERT base model. Embedding Layer Normalization, weight and Bias [768] + [768] = 1536. Query Weight Matrix size [768, 64] = 49152 and Bias [768] = 768. Key Weight Matrix size [768, 64] = 49152 and Bias [768] = 768. Value Weight Matrix size [768, 64] = 49152 and Bias [768] = 768.\", 'score': 0.99992037, 'raw_content': None}, {'url': 'https://medium.com/data-science/keeping-up-with-the-berts-5b7beb92766', 'title': 'Keeping up with the BERTs - Medium', 'content': 'BERT Base: L=12, H=768, A=12. Total Parameters=110M! BERT Large: L=24, H=1024, A=16. Total Parameters=340M!! What makes it Bidirectional? We', 'score': 0.9999151, 'raw_content': None}, {'url': 'https://www.reddit.com/r/MachineLearning/comments/ng4nbw/d_how_to_predict_nlp_transformer_model_sizes/', 'title': '[D] How to predict NLP Transformer model sizes? : r/MachineLearning', 'content': '\"We primarily report results on two model sizes: BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total', 'score': 0.9998795, 'raw_content': None}], 'response_time': 0.56, 'request_id': '0313d696-11a5-4ffb-a80f-5c3b2a6b5e0d'}"
    },
    {
      "turn_idx": 9,
      "action": "fetch_website_content({\"website_url\": \"https://www.kaggle.com/code/mdfahimreshm/bert-in-depth-understanding\"})",
      "explanation": "",
      "observation": "Checking your browser - reCAPTCHA Checking your browser before accessing www.kaggle.com ... Click here if you are not automatically redirected after 5 seconds."
    },
    {
      "turn_idx": 10,
      "action": "fetch_website_content({\"website_url\": \"https://glassboxmedicine.com/2019/08/15/the-transformer-attention-is-all-you-need/\"})",
      "explanation": "",
      "observation": "\nThe Transformer: Attention Is All You Need – Glass Box Medicine\nSkip to content\nBlog\nConsulting\nTable of Contents\nContact\nSearch\nSearch for:\nSearch\nGlass Box Medicine\nHealthcare & Artificial Intelligence, by Rachel Draelos, MD, PhD\nMachine Learning\nThe Transformer: Attention Is All You Need\nDate: August 15, 2019 Author: Rachel Draelos, MD, PhD\nThe Transformer paper, “Attention is All You Need” is the #1 all-time paper on Arxiv Sanity Preserver as of this writing (Aug 14, 2019). This paper showed that using attention mechanisms alone, it’s possible to achieve state-of-the-art results on language translation. Subsequent models built on the Transformer (e.g. BERT ) have achieved excellent performance on a wide range of natural language processing tasks.\nThroughout this post, I use a mixture of custom figures, figures modified from the original paper, equations, tables, and code, in an effort to explain the Transformer as clearly as possible. By the end of this post, you should understand all the different pieces of the Transformer model:\nFigure 1 from the Transformer paper\nPaper\nVaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. NeurIPS 2017 (pp. 5998-6008).\nAll quotes in this post are from the paper.\nCode\nFor a Pytorch implementation of the Transformer model, please see  “The Annotated Transformer”  which is an iPython notebook containing the Transformer paper text interspersed with working Pytorch code. I will reference code from the Annotated Transformer throughout this post, and explain certain sections of the code in detail.\nNote that the order in which we will discuss parts of the Transformer here is different from the order in either the original paper or the Annotated Transformer. Here, I’ve organized everything according to the flow of data through the model, e.g. starting from an English sentence “I like trees” and working through the Transformer to a Spanish translation “Me gustan los arboles.”\nThe hyperparameters used here are those of the Transformer base model, as shown in this excerpt from Table 3 of the Transformer paper:\nThese are the same hyperparameters used in the code in the function “make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff = 2048, h=8, dropout=0.1).”\nRepresenting Inputs and Outputs\nFor English-Spanish language translation, the input to the Transformer is an English sentence (“I like trees”) and the output is the Spanish translation of that sentence (“Me gustan los arboles”).\nRepresenting the Input\nWe first represent each word of the input sentence using a one-hot vector. A one-hot vector is a vector in which every element is “zero,” except for a single element which is a “one”:\nThe length of each one-hot vector is determined beforehand by the size of the vocabulary. If we want to represent 10,000 different words we need to use one-hot vectors of length 10,000 (so that we have a unique slot for the “one” for each word.) For more background on one-hot vectors, please see the post Preparing Tabular Data for Neural Networks (Code Included!) , Section “Representing Categorical Variables: One-Hot Vectors.”\nWord Embeddings\nWe don’t want to feed the Transformer plain one-hot vectors because they’re sparse, huge, and tell us nothing about the characteristics of the word. Therefore we learn a “word embedding” which is a smaller real-valued vector representation of the word that carries some information about the word. We can do this using nn.Embedding in Pytorch, or, more generally speaking, by multiplying our one-hot vector with a learned weight matrix W.\nnn.Embedding consists of a weight matrix W that will transform a one-hot vector into a real-valued vector. The weight matrix has shape (num_embeddings, embedding_dim). num_embeddings is simply the vocabulary size – you need one embedding for each word in the vocabulary. embedding_dim is the size you want your real-valued representation to be; you can choose this to be whatever you want – 3, 64, 256, 512, etc. In the Transformers paper they choose 512 (the hyperparameter d_model = 512).\nPeople refer to nn.Embedding as a “lookup table” because you can imagine the weight matrix as merely a stack of the real-valued vector representations of the words:\nThere are two options for dealing with the Pytorch nn.Embedding weight matrix. One option is to initialize it with pre-trained embeddings and keep it fixed, in which case it’s really just a lookup table. Another option is to initialize it randomly, or with pre-trained embeddings, but keep it trainable. In that case the word representations will get refined and modified throughout training because the weight matrix will get refined and modified throughout training.\nThe Transformer uses a random initialization of the weight matrix and refines these weights during training – i.e. it learns its own word embeddings.\nIn the Annotated Transformer the word embeddings are created using the class “Embeddings” which in turn makes use of nn.Embedding:\nPositional Encoding\nWe could just feed in the word embeddings for each word in our sentence and use that as our input representation. However, the word embeddings alone don’t carry any information about the relative order of words in the sentence:\n“I like trees” and “The trees grew” both contain the word “trees”\nThe word “trees” has the exact same word embedding regardless of whether it’s the third word or second word in a sentence.\nIn an RNN, it would okay to use the same vector for the word “trees” everywhere, because the RNN processes an input sentence sequentially, one word at a time. However, in a Transformer, all words in the input sentence are processed simultaneously – there is no inherent “word ordering” and thus no inherent position information.\nThe authors of the Transformer propose adding a “positional encoding” to address this problem. The positional encoding makes it possible for the Transformer to use information about word order. The positional encoding uses a sequence of real-valued vectors that capture ordering information. Each word in a sentence is summed with a particular positional encoding vector based on its position within the sentence:\nHow exactly does the “position vector” carry information about position? The authors explored two options for creating the positional encoding vectors:\noption 1: learning the positional encoding vectors (requires trainable parameters),\noption 2: calculating the positional encoding vectors using an equation (requires no trainable parameters)\nThey found the performance of both options was similar, so they went with the latter option (calculation) because it requires no parameters and also might allow the model to work well even on sentence lengths not seen during training.\nHere’s the formula they use to calculate the positional encoding:\nIn this equation,\npos is the position of a word in the sentence (e.g. “2” for the second word in the sentence)\ni indexes into the embedding dimension, i.e. it’s the position along the positional encoding vector dimension. For a positional encoding vector of length d_model = 512, we’ll have i range from 1 to 512.\nWhy use sine and cosine?  To quote the authors, “each dimension of the positional encoding corresponds to a sinusoid. […] We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions.”\nIn the Annotated Transformer the positional encoding is created and added to the word embeddings using the class “PositionalEncoding”:\nSummary of the Input\nNow we’ve got our input representation: an English sentence, “I like trees” converted to three vectors (one vector for each word):\nEach of the vectors in our input representation captures both the word’s meaning and the word’s position in the sentence.\nWe’ll do the exact same process (word embedding + positional encoding) to represent the output, which in this case is the Spanish sentence “Me gustan los arboles.”\nNow we’ve covered the bottom piece of Figure 1 of the Transformers paper: how the input and output sentences are processed before being fed into the rest of the model (not to be confused with the “output probabilities” at the top, which are something else):\nThe shape of the inputs Tensor (and of the outputs Tensor) after the embedding and positional encoding is [nbatches, L, 512] where “nbatches” is the batch size (to follow the Annotated Transformer’s variable name), L is the length of the sequence (e.g. L = 3 for “I like trees”), and 512 is the embedding dimension / positional encoding dimension. Note that the batches are created carefully such that a single batch contains sequences of similar length.\nThe Encoder\nNow it’s time for the encoder to process our sentence. Here’s what the encoder looks like:\nFigure modified from Transformer paper Figure 1.\nAs you can see from the figure, the encoder is made of N = 6 identical layers stacked on top of each other.\nFor English-Spanish translation, what goes “in” is an English sentence, e.g. “I like trees,” represented in the “word embeddings + positional encodings” format we just talked about. What comes “out” is a different representation of this sentence.\nEach of the six encoder layers contains two sub-layers:\nthe first sub-layer is “a multi-head self-attention mechanism”\nthe second sub-layer is “a simple, position-wise fully connected feed-forward network”\nFigure modified from Transformer paper Figure 1.\nWe’ll talk about what each of these sub-layers is doing. But first, here’s code from  the Annotated Transformer  showing how the Encoder is built:\nThe class “Encoder” takes in a <layer> and stacks it <N> times. The <layer> that it takes in is an instance of the class “EncoderLayer”\nThe class “EncoderLayer” is initialized with <size>, <self_attn>, <feed_forward>, and <dropout>:\n<size> is d_model, which is 512 in the base model\n<self_attn> is an instance of class “MultiHeadedAttention.” This corresponds to sub-layer 1.\n<feed_forward> is an instance of class “PositionwiseFeedForward.” This corresponds to sub-layer 2.\n<dropout> is the dropout rate, e.g. 0.1\nEncoder Sub-Layer 1: Multi-Head Attention Mechanism\nCerberus the multi-headed dog ( Image Source )\nIt is critical to understand the multi-head attention mechanism in order to understand the Transformer. The multi-head attention mechanism is used in both the encoder and the decoder. First I’ll write a high-level summary based on the equations from the paper, and then we’ll dive into the details of the code.\nHigh-Level Verbal Summary of Attention: Keys, Queries, and Values\nLet’s call our input to the attention mechanism “x.” At the beginning of the encoder, x is our initial sentence representation. In the middle of the encoder, “x” is the output of the previous EncoderLayer. For example EncoderLayer3 gets its input “x” from the output of EncoderLayer2.\nWe use x to calculate keys, queries, and values. The keys, queries, and values are calculated from x using distinct linear layers:\nkey = linear_k(x)\nquery = linear_q(x)\nvalue = linear_v(x)\nwhere for a particular encoder layer, linear_k, linear_q, and linear_v are separate feedforward neural network layers that go from dimension 512 to dimension 512 (d_model to d_model.) Linear_k, linear_q, and linear_v have different weights from each other which are learned separately. If we used the same layer weights to calculate the keys, queries, and values, then they’d all be identical to each other and we wouldn’t need different names for them.\nOnce we have our keys (K), queries (Q), and values (V), we calculate the attention as follows:\nThis is equation (1) in the Transformer paper. Let’s dissect what is going on:\nFirst, we take the dot product between the query and the key. If we do this on many queries and keys simultaneously, we can write the dot products as a matrix multiplication like so:\nHere Q is a stack of queries q, and K is a stack of keys k.\nAfter we take the dot product, we divide by the square root of d_k:\nWhat’s the point of dividing by sqrt(d_k)? The authors explain that they scale the dot products by sqrt(d_k) to prevent the dot products from getting huge as d_k (the vector length) increases.\nExample: the dot product of the vectors [2,2] and [2,2] is 8, but the dot product of the vectors [2,2,2,2,2] and [2,2,2,2,2] is 20. We don’t want the dot product to be huge if we choose a huge vector length, so we divide by the square root of the vector length to mitigate this effect. A huge dot product value is bad because it will “push the softmax function into regions where it has extremely small gradients.”\nWhich brings us to the next step – applying a softmax, which squishes the numbers into a (0,1) range:\nFor a discussion of the softmax function see this post .\nWhat do we have at this point? At this point, we have a bunch of numbers between 0 and 1, which we can think of as our attention weights. We’ve calculated these attention weights as softmax(QK^T/sqrt(d_k)).\nThe last step is to do a weighted sum of the values V, using the attention weights that we just calculated:\nAnd that’s the whole equation!\nMore Detailed Description of Multi-Headed Attention with Code\nThis section refers to code from the Annotated Transformer , as I think looking at the code is a good way to understand what is happening.\nFirst, here are the relevant pieces of code with some annotations. Don’t worry about all these annotations – I’ll describe the key points in the text below. (If you would like to read the annotations, depending on your browser you may have to open the image in a new window in order for the text to be large enough to read.)\nIn the Annotated Transformer , multi-headed attention is implemented with the class MultiHeadedAttention:\nAn instance of this class is initialized with:\n<h> = 8, the number of “heads.” There are 8 heads in the Transformer base model.\n<d_model> = 512\n<dropout> = dropout rate = 0.1\nThe dimension of the keys, d_k, is calculated as d_model/h. So in this case, d_k = 512/8 = 64.\nLet’s look in more detail at the forward() function from MultiHeadedAttention:\nWe can see that the input to forward() is query, key, value, and mask. Ignore the mask for now. Where do the query, key, and value come from? They are in fact from the “x” repeated three times in the EncoderLayer (see yellow highlight):\nThe x came from the previous EncoderLayer, or if we’re at EncoderLayer1, the x is our initial sentence representation. (Note that self.self_attn in class EncoderLayer is an instance of MultiHeadedAttention.)\nWithin the MultiHeadedAttention class, we’re going to take the old queries (the old x), the old keys (also the old x), and the old values (also the old x), and produce new queries, keys, and values which are distinct from each other.\nNote that the shape of the “query” input is [nbatches, L, 512] where nbatches is the batch size, L is the sequence length, and 512 is d_model. The “key” and “value” inputs also have shape [nbatches, L, 512].\nStep 1) in the MultiHeadedAttention forward() function states, “Do all the linear projections in batch from d_model => h x d_k.”\nWe’ll do three different linear projections on the same Tensor of shape [nbatches, L, 512] to obtain new queries, keys, and values, each of shape [nbatches, L, 512]. (The shape hasn’t changed since the linear layer is 512 –> 512).\nThen, we’ll reshape that output into 8 different heads. For example the queries are shape [nbatches, L, 512] and are reshaped using view() to [nbatches, L, 8, 64] where h=8 is the number of heads and d_k = 64 is the key size.\nFinally we’ll swap dimensions 1 and 2 using transpose to get shape [nbatches, 8, L, 64]\nStep 2) in the code states, “Apply attention on all the projected vectors in batch.”\nThe specific line is x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)\nHere’s the equation we’re implementing using the attention() function:\nAnd here’s the attention() implementation:\nTo calculate the scores, we first do a matrix multiplication between query [nbatches, 8, L, 64 ] and key transposed [nbatches, 8, 64, L ]. This is the QK^T from the equation. The resulting shape of scores is [nbatches, 8, L, L ].\nNext, we calculate the attention weights p_attn by applying a softmax to the scores. If applicable, we also apply dropout to the attention weights. p_attn thus corresponds to softmax(QK^T/sqrt(d_k)) in the equation above. The shape of p_attn is [nbatches, 8, L, L ] because applying softmax and dropout to the scores doesn’t change the shape.\nFinally, we perform a matrix multiplication between the attention weights p_attn [nbatches, 8, L, L ] and the values [nbatches, 8, L, 64 ]. The result is the final output of our attention function, with shape [nbatches, 8, L, 64 ]. We return this from the function, along with the attention weights themselves p_attn.\nNotice that in our input to the attention function and in the output of the attention function, we have 8 heads (dimension 1 of our Tensor, e.g. [nbatches, 8, L, 64].) We did a different matrix multiplication for each of the eight heads. This is what is meant by “multi-headed” attention: the extra “heads” dimension allows us to have multiple “representation subspaces.” It gives us eight different ways of considering the same sentence.\nStep 3) (in the MultiHeadedAttention class, since we’ve now returned from the attention() funtion) is concatenation using a view() followed by application of a final linear layer.\nThe specific lines of Step 3 are:\nx = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)\nreturn self.linears[-1](x)\nx is what’s returned from the attention function: our eight-headed representation [nbatches, 8, L, 64].\nWe transpose it to get [nbatches, L, 8, 64] and then reshape it using view to get [nbatches, L, 8 x 64] = [nbatches, L, 512]. This reshaping operation using view() is basically a concatenation across the 8 heads.\nFinally, we apply our last linear layer from self.linears. This linear layer goes from 512 –> 512. Note that in Pytorch if a multi-dimensional Tensor is given to a Linear layer, the Linear layer is applied only to the last dimension. So the result of self.linears[-1](x) still has shape [nbatches, L, 512].\nNotice that [nbatches, L, 512] is exactly the right shape we need to feed in to another MultiHeadedAttention layer….\n…But before we do that, we’ve got one last step, Encoder Sub-Layer 2, which we’ll talk about right after we review Figure 2 from the Transformer paper.\nHere is Figure 2 from the Transformer paper:\nTransformer paper Figure 2\nOn the left, under “Scaled Dot-Product Attention” we have a visual depiction of what the attention() function calculates: softmax(QK^T/sqrt(d_k))V. It’s “scaled” because of the division by sqrt(d_k) and it’s “dot-product” because QK^T represents the dot product between a bunch of stacked queries and a bunch of stacked keys.\nOn the right, under “Multi-Head Attention” we have a visual depiction of what the MultiHeadedAttention class does. You should now recognize the parts of this figure:\nAt the bottom going in are the old V, K, and Q, which are our “x” output from a previous EncoderLayer (or our x from the input sentence representation for EncoderLayer1)\nNext we apply a Linear layer (the “Linear” boxes) to calculate a processed V, K, and Q (which are not explicitly shown)\nWe feed the processed V, K, and Q into our scaled dot-product attention with 8 heads. This is the attention() function.\nFinally, we concatenate the result of the attention() function over the 8 heads, and apply a last linear layer to produce our multi-headed attention output.\nEncoder Sub-Layer 2: Position-Wise Fully Connected Feed-Forward Network\nWe are now almost done understanding an entire EncoderLayer! Recall that this is the basic structure of a single EncoderLayer (modified from Transformer paper Figure 1):\nWe’ve gone over sub-layer 1, the multi-head attention. Now we’ll take a look at sub-layer 2, the feed-forward network.\nSub-layer 2 is easier to understand than sub-layer 1, because sub-layer 2 is just a feedforward neural network. Here’s the expression for sub-layer 2:\nIn other words, we apply a fully-connected layer with weights W1 and biases b1, perform a ReLU nonlinearity (the max with zero), and then apply a second fully-connected layer with weights W2 and biases b2.\nHere’s the corresponding code snippet from the Annotated Transformer:\nSo, in summary, the Encoder consists of 6 EncoderLayers. Each EncoderLayer has 2 sub-layers: sub-layer 1 for multi-headed attention, and sub-layer 2 that’s just a feedforward neural network.\nThe Decoder\nNow that we understand the encoder, the decoder will be easier to understand, because it’s similar to the encoder. Here’s Figure 1 again, with a few extra annotations:\nModified from Transformer paper Figure 1\nHere are three main differences between the decoder and the encoder:\nDecoder sub-layer 1 uses “masked” multi-head attention to prevent illegally “seeing into the future.”\nThe decoder has an extra sub-layer, labeled “sub-layer 2” in the figure above. This sub-layer is “encoder-decoder multi-head attention.”\nThere is a linear layer and a softmax applied to the decoder output to produce output probabilities which indicate the next predicted word.\nLet’s talk about each of these pieces.\nDecoder Sub-Layer 1: Masked Multi-Head Attention\nThe Crystal Ball by John William Waterhouse\nThe point of masking in the multi-headed attention layer is to prevent the decoder from “seeing into the future”  – i.e. we don’t want any crystal balls built into our model.\nThe mask consists of ones and zeros:\nMask example ( Image Source ).\nThe lines of code in the attention() function that use the mask are here:\nif mask is not None:\nscores = scores.masked_fill(mask == 0, -1e9)\nThe masked_fill(mask, value) function “fills elements of  self  tensor with  value  where  mask  is True. The shape of  mask  must be  broadcastable  with the shape of the underlying tensor.” So, basically, we use the mask to “zero out” parts of the scores Tensor that correspond to future words we aren’t supposed to see.\nTo quote the authors, “We […] modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.”\nDecoder Sub-Layer 2: Encoder-Decoder Multi-Head Attention\nHere’s the code for a DecoderLayer:\nThe line underlined in yellow defines the “encoder-decoder attention”:\nx = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))\nself.src_attn is an instance of MultiHeadedAttention. The inputs are query = x, key = m, value = m, and mask = src_mask. Here, x comes from the previous DecoderLayer, while m or “memory” comes from the output of the Encoder (i.e. the output of EncoderLayer6).\n(Note that the line above the line in yellow,  x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask)) , defines the decoder self-attention of decoder sub-layer 1 that we just talked about. It works exactly the same way as encoder self-attention except for the extra masking step.)\nAside: Complete Summary of Attention in the Transformer\nWe’ve now gone over the three kinds of attention in the Transformer. Here’s the author summary from the Transformer paper about the three ways attention is used in their model:\nDecoder Final Output: Linear and Softmax to Produce Output Probabilities\nAt the end of the decoder stack, we feed the decoder’s final output into a linear layer followed by a softmax, to predict the next token.\nWe ran the encoder once to get the output of the encoder stack, which represents the English input sentence “I like trees”; now we’re going to run the decoder multiple times so it can predict the multiple words in the Spanish translation “Me gustan los arboles.”\nThe last linear layer expands the output of the decoder stack into a huge vector whose length is equal to the vocabulary size. The softmax means that we’ll select the one element of this huge vector with the highest probability (“greedy decoding”) which corresponds to one word in our Spanish vocabulary.\nAfter the network is trained (i.e. when we’re performing inference), we’ll do the following steps (note that the encoder output is calculated once and then used several times):\nFeed the decoder our encoder output (which represents the full English sentence “I like trees”) and  a special beginning of sentence token, </s>, in the “output sentence” slot at the bottom of the decoder. The decoder will produce a predicted word, which in our example should be “Me” (the first word of our Spanish translation.)\nFeed the decoder our encoder output, the beginning of sentence token, and the word the decoder just produced – i.e., feed the decoder the encoder output and “</s> Me.” In this step the decoder should produce the predicted word “gustan.”\nFeed the decoder our encoder output and “</s> Me gustan.” In this step the decoder should produce the predicted word “los.”\nFeed the decoder our encoder output and “</s> Me gustan los.” In this step the decoder should produce the predicted word “arboles.”\nFeed the decoder our encoder output and “</s> Me gustan los arboles.” In this step the decoder should produce the end-of-sentence token , for example “</eos>.”\nBecause the decoder has produced an end-of-sentence token, we know we are done translating this sentence.\nWhat about during training? During training, the decoder might not be very good – so it can produce incorrect predictions of the next word. If the decoder is producing junk, we don’t want to feed that junk back into the decoder for the next step. So, during training we use a process called “teacher forcing” ( ref1 , ref2 , ref3 ).\nIn teacher forcing, we make use of the fact that we know what the correct translation should be, and we feed the decoder the symbols that it should  have predicted so far. Note that we don’t want the decoder to just learn a copying task, so we’ll only feed it “</s> Me gustan los” at the step where it’s supposed to be predicting the word “arboles.” This is implemented through:\nthe masking that we talked about earlier, in which future words are zeroed out (i.e. no feeding the decoder “los arboles” when it’s supposed to be predicting “gustan”), and\na right-shift so that the “present” word isn’t fed in either (i.e. no feeding the decoder “gustan” when it’s supposed to be predicting “gustan.”)\nThe loss is then calculated using the probability distribution over possible next words that the decoder actually produced (e.g. [0.01,0.01,0.02,0.7,0.20,0.01,0.05]), versus the probability distribution it should’ve produced (which is [0,0,0,0,1,0,0] with the “1” in the “arboles” slot if we’re using one-hot vectors as ground truth.)\nNote that the approach I just described (selecting the highest-probability word at each decoding step) is called “greedy decoding.” An alternative is beam search, which keeps more than one predicted word at each decoding step (see this post for more details.)\nHere’s the Pytorch Generator class used for the final linear layer and softmax:\nMore Fun\nCongratulations – you have just worked through the key parts of a Transformer model! There are a few additional concepts built in to the Transformer that I’ll overview quickly here:\nDropout : dropout is used in a few different places throughout the Transformer. In this technique, a random subset of neurons are ignored during each forward/backward pass to help prevent overfitting. For more information on dropout see this post .\nResidual Connection and Layer Normalization : There’s a residual connection around each sub-layer of the encoder and around each sub-layer of the decoder, followed by layer normalization.\nResidual connection: if we calculate some function f(x), a “residual connection” produces the output f(x)+x. In other words, we add the original input back onto the output that we just calculated. For more details see this article .\nLayer normalization: this is a method that normalizes inputs across the features (as opposed to batch normalization which normalizes features across a batch.) For more details see this article .\nIn the following diagram of an EncoderLayer I’ve colored red the relevant parts: the arrow and the “Add and Norm” box that together represent the residual connection and the layer normalization:\nModified from Transformer paper Figure 1.\nQuote, “the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.”\nFor the Pytorch implementation, see the Annotated Transformer class “LayerNorm” as well as class “SublayerConnection” which applies LayerNorm, Dropout, and a residual connection.\nNoam optimizer : The Transformer is trained using the Adam optimizer . The authors report a specific formula for varying the learning rate throughout training. First the learning rate is increased linearly for a certain number of training steps. After that, the learning rate is decreased proportionally to the inverse square root of the step number. This learning rate schedule is implemented in the Annotated Transformer class “NoamOpt.”\nLabel smoothing : Finally, the authors apply the label smoothing technique. Essentially, label smoothing takes one-hot-encoded “correct answers” and smooths them out so that most of the probability mass goes where the “1” was, and the remainder is distributed throughout all the slots that were “0.” For more details see this paper . For the implementation see Annotated Transformer class “LabelSmoothing.”\nGrand Summary!\nThe Transformer consists of an Encoder and a Decoder.\nThe input sentence (e.g. “I like trees”) and the output sentence (e.g. “Me gustan los arboles”) are each represented using a word embedding plus positional encoding vector for each word.\nThe Encoder is made up of 6 EncoderLayers. The Decoder is made up of 6 DecoderLayers.\nEach EncoderLayer has two sub-layers: multi-headed self-attention and a feedforward layer.\nEach DecoderLayer has three sub-layers: multi-headed self-attention, multi-headed encoder-decoder attention, and a feedforward layer.\nAt the end of the Decoder, a linear layer and a softmax are applied to the Decoder output to predict the next word.\nThe Encoder is run once. The Decoder is run multiple times, to produce a predicted word at each step.\nReferences\nVaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. “Attention is all you need.” NeurIPS 2017 (pp. 5998-6008).\nThe Annotated Transformer by Harvard NLP. An iPython notebook that walks through the paper step-by-step with working Pytorch code throughout.\nWalkthrough: The Transformer Architecture [Part 1/2] by Matthew Barnett. Overview of basic concepts in the Transformer without excess details.\nWalkthrough: The Transformer Architecture [Part 2/2] by Matthew Barnett. Details about how the Transformer works. Great post for explaining what queries, keys, and values actually are.\nThe Illustrated Transformer by Jay Alammar\nTransformer Tutorial – DGL 0.3 documentation . Includes an interesting graph illustration of the decoder/encoder\nCS224N Transformer Networks by Richard Socher\nTransformer model for language understanding from Tensorflow Core. Includes Tensorflow example code.\nAbout the Featured Image\nThe featured image is a crop from the painting “The Crystal Ball” by John William Waterhouse .\nWant to be the first to hear about my upcoming book bridging healthcare, artificial intelligence, and business—and get a free list of my favorite health AI resources? Sign up here .\nShare this:\nClick to share on X (Opens in new window)\nX\nClick to share on Facebook (Opens in new window)\nFacebook\nLike Loading...\nRelated\nattention neuralnetwork nlp transformer\nPublished by Rachel Draelos, MD, PhD\nI have an MD and a PhD in Computer Science from Duke University. My research focuses on machine learning methods development for medical data. I am passionate about explainable AI for healthcare.\nView all posts by Rachel Draelos, MD, PhD\nPost navigation\nPrevious Previous post: Learn to Pay Attention! Trainable Visual Attention in CNNs Next Next post: Universal Transformers\n3 thoughts on “ The Transformer: Attention Is All You Need ”\nAdd Comment\nPingback: Universal Transformers – Glass Box Medicine\nPingback: Learn to Pay Attention! Trainable Visual Attention in CNNs – Glass Box\nPingback: How to benchmark and optimize LLM inference performance (for data scientists) - UbiOps - AI model serving, orchestration & training\nComments are closed.\nFollow Glass Box Medicine on WordPress.com\nCategories\nCareer\nMachine Learning\nMedicine\nradiology\nVideo\nTags agi\nai\nai risk\nai safety\nalignment\nanatomy\nartificial-intelligence\nartificial general intelligence\nattention\nauc\nauprc\nauroc\naverageprecision\nbackpropagation\nbrain\nchatbot\nchatgpt\nchest\nchest-x-ray\nclassification\ncnn\ncode\nct-scan\ndata-processing\ndata-representation\ndataset\ndermatology\nDICOM\ndisease\ndoctor\nehr\ngradcam\nhealth\nhealthcare\nheart\nheatmap\nlabs\nlarge language model\nlarge language models\nlikelihood\nllm\nllms\nlungs\nmedical-notes\nmedication\nmlp\nmoral philosophy\nneuralnetwork\nnlp\nnodule\nnumpy\noop\npathology\nperformance-metric\npneumonia\nprocedure\npython\npytorch\nradiology\nregression\nsaliency\nsarle\nsegmentation\nsensitivity\nsoftmax\nspecificity\ntechnology\nterminology\ntorchvision\ntraining data\ntransformer\ntutorial\nunet\nvision\nx-ray\nTweets by racheldraelos\n© 2025 Glass Box Medicine\nWebsite Built with WordPress.com .\nReblog\nSubscribe\nSubscribed\nGlass Box Medicine\nJoin 89 other subscribers\nSign me up\nAlready have a WordPress.com account? Log in now.\nGlass Box Medicine\nSubscribe\nSubscribed\nSign up\nLog in\nCopy shortlink\nReport this content\nView post in Reader\nManage subscriptions\nCollapse this bar\nLoading Comments...\nYou must be logged in to post a comment.\n%d\n"
    },
    {
      "turn_idx": 11,
      "action": "fetch_website_content({\"website_url\": \"https://medium.com/data-science/keeping-up-with-the-berts-5b7beb92766\"})",
      "explanation": "",
      "observation": "Just a moment... Enable JavaScript and cookies to continue \n"
    },
    {
      "turn_idx": 12,
      "action": "fetch_website_content({\"website_url\": \"https://nlp.seas.harvard.edu/2018/04/03/attention.html\"})",
      "explanation": "",
      "observation": "\nThe Annotated Transformer\nMembers\nCode\nPublications\nContact Us\nThe Annotated Transformer\nApr 3, 2018\n———————-\nThere is now a new version of this blog post updated for modern PyTorch.\n———————-\nfrom IPython.display import Image\nImage ( filename = 'images/aiayn.png' )\nThe Transformer from “Attention is All You\nNeed” has been on a lot of people’s minds\nover the last year. Besides producing major improvements in translation quality,\nit provides a new architecture for many other NLP tasks. The paper itself is\nvery clearly written, but the conventional wisdom has been that it is quite\ndifficult to implement correctly.\nIn this post I present an “annotated” version of the paper in the form of a\nline-by-line implementation. I have reordered and deleted some sections from the\noriginal paper and added comments throughout. This document itself is a working\nnotebook, and should be a completely usable implementation. In total there are\n400 lines of library code which can process 27,000 tokens per second on 4 GPUs.\nTo follow along you will first need to install\n PyTorch . The complete notebook is also\navailable on\n github or on\nGoogle\n Colab with free GPUs.\nNote this is merely a starting point for researchers and interested developers.\nThe code here is based heavily on our OpenNMT packages.\n(If helpful feel free to cite .) For other full-sevice\nimplementations of the model check-out\n Tensor2Tensor (tensorflow) and\n Sockeye (mxnet).\nAlexander Rush ( @harvardnlp or\nsrush@seas.harvard.edu), with help from Vincent Nguyen and Guillaume Klein\nPrelims\n# !pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl numpy matplotlib spacy torchtext seaborn\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport math , copy , time\nfrom torch.autograd import Variable\nimport matplotlib.pyplot as plt\nimport seaborn\nseaborn . set_context ( context = \"talk\" )\n% matplotlib inline\nTable of Contents\nPrelims\nBackground\nModel Architecture\nEncoder and Decoder Stacks\nEncoder\nDecoder\nAttention\nApplications of Attention in our Model\nPosition-wise Feed-Forward Networks\nEmbeddings and Softmax\nPositional Encoding\nFull Model\nTraining\nBatches and Masking\nTraining Loop\nTraining Data and Batching\nHardware and Schedule\nOptimizer\nRegularization\nLabel Smoothing\nA First Example\nSynthetic Data\nLoss Computation\nGreedy Decoding\nA Real World Example\nData Loading\nIterators\nMulti-GPU Training\nTraining the System\nAdditional Components: BPE, Search, Averaging\nResults\nAttention Visualization\nConclusion\nMy comments are blockquoted. The main text is all from the paper itself.\nBackground\nThe goal of reducing sequential computation also forms the foundation of the\nExtended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural\nnetworks as basic building block, computing hidden representations in parallel\nfor all input and output positions. In these models, the number of operations\nrequired to relate signals from two arbitrary input or output positions grows in\nthe distance between positions, linearly for ConvS2S and logarithmically for\nByteNet. This makes it more difficult to learn dependencies between distant\npositions. In the Transformer this is reduced to a constant number of\noperations, albeit at the cost of reduced effective resolution due to averaging\nattention-weighted positions, an effect we counteract with Multi-Head Attention.\nSelf-attention, sometimes called intra-attention is an attention mechanism\nrelating different positions of a single sequence in order to compute a\nrepresentation of the sequence. Self-attention has been used successfully in a\nvariety of tasks including reading comprehension, abstractive summarization,\ntextual entailment and learning task-independent sentence representations. End-\nto-end memory networks are based on a recurrent attention mechanism instead of\nsequencealigned recurrence and have been shown to perform well on simple-\nlanguage question answering and\nlanguage modeling tasks.\nTo the best of our knowledge, however, the Transformer is the first transduction\nmodel relying entirely on self-attention to compute representations of its input\nand output without using sequence aligned RNNs or convolution.\nModel Architecture\nMost competitive neural sequence transduction models have an encoder-decoder\nstructure (cite) . Here, the encoder maps an\ninput sequence of symbol representations $(x_1, …, x_n)$ to a sequence of\ncontinuous representations $\\mathbf{z} = (z_1, …, z_n)$. Given $\\mathbf{z}$,\nthe decoder then generates an output sequence $(y_1,…,y_m)$ of symbols one\nelement at a time. At each step the model is auto-regressive\n (cite) , consuming the previously generated\nsymbols as additional input when generating the next.\nclass EncoderDecoder ( nn . Module ):\n\"\"\"\n A standard Encoder-Decoder architecture. Base for this and many\nother models.\n \"\"\"\ndef __init__ ( self , encoder , decoder , src_embed , tgt_embed , generator ):\nsuper ( EncoderDecoder , self ). __init__ ()\nself . encoder = encoder\nself . decoder = decoder\nself . src_embed = src_embed\nself . tgt_embed = tgt_embed\nself . generator = generator\ndef forward ( self , src , tgt , src_mask , tgt_mask ):\n\"Take in and process masked src and target sequences.\"\nreturn self . decode ( self . encode ( src , src_mask ), src_mask ,\ntgt , tgt_mask )\ndef encode ( self , src , src_mask ):\nreturn self . encoder ( self . src_embed ( src ), src_mask )\ndef decode ( self , memory , src_mask , tgt , tgt_mask ):\nreturn self . decoder ( self . tgt_embed ( tgt ), memory , src_mask , tgt_mask )\nclass Generator ( nn . Module ):\n\"Define standard linear + softmax generation step.\"\ndef __init__ ( self , d_model , vocab ):\nsuper ( Generator , self ). __init__ ()\nself . proj = nn . Linear ( d_model , vocab )\ndef forward ( self , x ):\nreturn F . log_softmax ( self . proj ( x ), dim =- 1 )\nThe Transformer follows this overall architecture using stacked self-attention\nand point-wise, fully connected layers for both the encoder and decoder, shown\nin the left and right halves of Figure 1, respectively.\nImage ( filename = 'images/ModalNet-21.png' )\nEncoder and Decoder Stacks\nEncoder\nThe encoder is composed of a stack of $N=6$ identical layers.\ndef clones ( module , N ):\n\"Produce N identical layers.\"\nreturn nn . ModuleList ([ copy . deepcopy ( module ) for _ in range ( N )])\nclass Encoder ( nn . Module ):\n\"Core encoder is a stack of N layers\"\ndef __init__ ( self , layer , N ):\nsuper ( Encoder , self ). __init__ ()\nself . layers = clones ( layer , N )\nself . norm = LayerNorm ( layer . size )\ndef forward ( self , x , mask ):\n\"Pass the input (and mask) through each layer in turn.\"\nfor layer in self . layers :\nx = layer ( x , mask )\nreturn self . norm ( x )\nWe employ a residual connection (cite) \naround each of the two sub-layers, followed by layer normalization\n (cite) .\nclass LayerNorm ( nn . Module ):\n\"Construct a layernorm module (See citation for details).\"\ndef __init__ ( self , features , eps = 1e-6 ):\nsuper ( LayerNorm , self ). __init__ ()\nself . a_2 = nn . Parameter ( torch . ones ( features ))\nself . b_2 = nn . Parameter ( torch . zeros ( features ))\nself . eps = eps\ndef forward ( self , x ):\nmean = x . mean ( - 1 , keepdim = True )\nstd = x . std ( - 1 , keepdim = True )\nreturn self . a_2 * ( x - mean ) / ( std + self . eps ) + self . b_2\nThat is, the output of each sub-layer is $\\mathrm{LayerNorm}(x +\n\\mathrm{Sublayer}(x))$, where $\\mathrm{Sublayer}(x)$ is the function implemented\nby the sub-layer itself. We apply dropout\n (cite) to the output of each\nsub-layer, before it is added to the sub-layer input and normalized.\nTo facilitate these residual connections, all sub-layers in the model, as well\nas the embedding layers, produce outputs of dimension $d_{\\text{model}}=512$.\nclass SublayerConnection ( nn . Module ):\n\"\"\"\n A residual connection followed by a layer norm.\n Note for code simplicity the norm is first as opposed to last.\n \"\"\"\ndef __init__ ( self , size , dropout ):\nsuper ( SublayerConnection , self ). __init__ ()\nself . norm = LayerNorm ( size )\nself . dropout = nn . Dropout ( dropout )\ndef forward ( self , x , sublayer ):\n\"Apply residual connection to any sublayer with the same size.\"\nreturn x + self . dropout ( sublayer ( self . norm ( x )))\nEach layer has two sub-layers. The first is a multi-head self-attention\nmechanism, and the second is a simple, position-wise fully connected feed-\nforward network.\nclass EncoderLayer ( nn . Module ):\n\"Encoder is made up of self-attn and feed forward (defined below)\"\ndef __init__ ( self , size , self_attn , feed_forward , dropout ):\nsuper ( EncoderLayer , self ). __init__ ()\nself . self_attn = self_attn\nself . feed_forward = feed_forward\nself . sublayer = clones ( SublayerConnection ( size , dropout ), 2 )\nself . size = size\ndef forward ( self , x , mask ):\n\"Follow Figure 1 (left) for connections.\"\nx = self . sublayer [ 0 ]( x , lambda x : self . self_attn ( x , x , x , mask ))\nreturn self . sublayer [ 1 ]( x , self . feed_forward )\nDecoder\nThe decoder is also composed of a stack of $N=6$ identical layers.\nclass Decoder ( nn . Module ):\n\"Generic N layer decoder with masking.\"\ndef __init__ ( self , layer , N ):\nsuper ( Decoder , self ). __init__ ()\nself . layers = clones ( layer , N )\nself . norm = LayerNorm ( layer . size )\ndef forward ( self , x , memory , src_mask , tgt_mask ):\nfor layer in self . layers :\nx = layer ( x , memory , src_mask , tgt_mask )\nreturn self . norm ( x )\nIn addition to the two sub-layers in each encoder layer, the decoder inserts a\nthird sub-layer, which performs multi-head attention over the output of the\nencoder stack. Similar to the encoder, we employ residual connections around\neach of the sub-layers, followed by layer normalization.\nclass DecoderLayer ( nn . Module ):\n\"Decoder is made of self-attn, src-attn, and feed forward (defined below)\"\ndef __init__ ( self , size , self_attn , src_attn , feed_forward , dropout ):\nsuper ( DecoderLayer , self ). __init__ ()\nself . size = size\nself . self_attn = self_attn\nself . src_attn = src_attn\nself . feed_forward = feed_forward\nself . sublayer = clones ( SublayerConnection ( size , dropout ), 3 )\ndef forward ( self , x , memory , src_mask , tgt_mask ):\n\"Follow Figure 1 (right) for connections.\"\nm = memory\nx = self . sublayer [ 0 ]( x , lambda x : self . self_attn ( x , x , x , tgt_mask ))\nx = self . sublayer [ 1 ]( x , lambda x : self . src_attn ( x , m , m , src_mask ))\nreturn self . sublayer [ 2 ]( x , self . feed_forward )\nWe also modify the self-attention sub-layer in the decoder stack to prevent\npositions from attending to subsequent positions. This masking, combined with\nfact that the output embeddings are offset by one position, ensures that the\npredictions for position $i$ can depend only on the known outputs at positions\nless than $i$.\ndef subsequent_mask ( size ):\n\"Mask out subsequent positions.\"\nattn_shape = ( 1 , size , size )\nsubsequent_mask = np . triu ( np . ones ( attn_shape ), k = 1 ). astype ( 'uint8' )\nreturn torch . from_numpy ( subsequent_mask ) == 0\nBelow the attention mask shows the position each tgt word (row) is allowed to\nlook at (column). Words are blocked for attending to future words during\ntraining.\nplt . figure ( figsize = ( 5 , 5 ))\nplt . imshow ( subsequent_mask ( 20 )[ 0 ])\nNone\nAttention\nAn attention function can be described as mapping a query and a set of key-value\npairs to an output, where the query, keys, values, and output are all vectors.\nThe output is computed as a weighted sum of the values, where the weight\nassigned to each value is computed by a compatibility function of the query with\nthe corresponding key.\nWe call our particular attention “Scaled Dot-Product Attention”. The input\nconsists of queries and keys of dimension $d_k$, and values of dimension $d_v$.\nWe compute the dot products of the query with all keys, divide each by\n$\\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.\nImage ( filename = 'images/ModalNet-19.png' )\nIn practice, we compute the attention function on a set of queries\nsimultaneously, packed together into a matrix $Q$. The keys and values are\nalso packed together into matrices $K$ and $V$. We compute the matrix of\noutputs as:\n\\[\\mathrm{Attention}(Q, K, V) = \\mathrm{softmax}(\\frac{QK^T}{\\sqrt{d_k}})V\\]\ndef attention ( query , key , value , mask = None , dropout = None ):\n\"Compute 'Scaled Dot Product Attention'\"\nd_k = query . size ( - 1 )\nscores = torch . matmul ( query , key . transpose ( - 2 , - 1 )) \\\n / math . sqrt ( d_k )\nif mask is not None :\nscores = scores . masked_fill ( mask == 0 , - 1e9 )\np_attn = F . softmax ( scores , dim = - 1 )\nif dropout is not None :\np_attn = dropout ( p_attn )\nreturn torch . matmul ( p_attn , value ), p_attn\nThe two most commonly used attention functions are additive attention\n (cite) , and dot-product (multiplicative)\nattention. Dot-product attention is identical to our algorithm, except for the\nscaling factor of $\\frac{1}{\\sqrt{d_k}}$. Additive attention computes the\ncompatibility function using a feed-forward network with a single hidden layer.\nWhile the two are similar in theoretical complexity, dot-product attention is\nmuch faster and more space-efficient in practice, since it can be implemented\nusing highly optimized matrix multiplication code.\nWhile for small values of $d_k$ the two mechanisms perform similarly, additive\nattention outperforms dot product attention without scaling for larger values of\n$d_k$ (cite) . We suspect that for large\nvalues of $d_k$, the dot products grow large in magnitude, pushing the softmax\nfunction into regions where it has extremely small gradients (To illustrate why\nthe dot products get large, assume that the components of $q$ and $k$ are\nindependent random variables with mean $0$ and variance $1$. Then their dot\nproduct, $q \\cdot k = \\sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance\n$d_k$.). To counteract this effect, we scale the dot products by\n$\\frac{1}{\\sqrt{d_k}}$.\nImage ( filename = 'images/ModalNet-20.png' )\nMulti-head attention allows the model to jointly attend to information from\ndifferent representation subspaces at different positions. With a single\nattention head, averaging inhibits this.\n\\(\\mathrm{MultiHead}(Q, K, V) = \\mathrm{Concat}(\\mathrm{head_1}, ...,\n\\mathrm{head_h})W^O \\\\\n \\text{where}~\\mathrm{head_i} = \\mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)\\)\nWhere the projections are parameter matrices $W^Q_i \\in\n\\mathbb{R}^{d_{\\text{model}} \\times d_k}$, $W^K_i \\in\n\\mathbb{R}^{d_{\\text{model}} \\times d_k}$, $W^V_i \\in\n\\mathbb{R}^{d_{\\text{model}} \\times d_v}$ and $W^O \\in \\mathbb{R}^{hd_v \\times\nd_{\\text{model}}}$.\nIn this work we employ $h=8$ parallel attention layers, or heads. For each of\nthese we use $d_k=d_v=d_{\\text{model}}/h=64$. Due to the reduced dimension of\neach head, the total computational cost is similar to that of single-head\nattention with full dimensionality.\nclass MultiHeadedAttention ( nn . Module ):\ndef __init__ ( self , h , d_model , dropout = 0.1 ):\n\"Take in model size and number of heads.\"\nsuper ( MultiHeadedAttention , self ). __init__ ()\nassert d_model % h == 0\n# We assume d_v always equals d_k\n self . d_k = d_model // h\nself . h = h\nself . linears = clones ( nn . Linear ( d_model , d_model ), 4 )\nself . attn = None\nself . dropout = nn . Dropout ( p = dropout )\ndef forward ( self , query , key , value , mask = None ):\n\"Implements Figure 2\"\nif mask is not None :\n# Same mask applied to all h heads.\n mask = mask . unsqueeze ( 1 )\nnbatches = query . size ( 0 )\n# 1) Do all the linear projections in batch from d_model => h x d_k\nquery , key , value = \\\n [ l ( x ). view ( nbatches , - 1 , self . h , self . d_k ). transpose ( 1 , 2 )\nfor l , x in zip ( self . linears , ( query , key , value ))]\n# 2) Apply attention on all the projected vectors in batch.\nx , self . attn = attention ( query , key , value , mask = mask ,\ndropout = self . dropout )\n# 3) \"Concat\" using a view and apply a final linear.\nx = x . transpose ( 1 , 2 ). contiguous () \\\n . view ( nbatches , - 1 , self . h * self . d_k )\nreturn self . linears [ - 1 ]( x )\nApplications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n1) In “encoder-decoder attention” layers, the queries come from the previous\ndecoder layer, and the memory keys and values come from the output of the\nencoder. This allows every position in the decoder to attend over all\npositions in the input sequence. This mimics the typical encoder-decoder\nattention mechanisms in sequence-to-sequence models such as\n (cite) .\n2) The encoder contains self-attention layers. In a self-attention layer all of\nthe keys, values and queries come from the same place, in this case, the output\nof the previous layer in the encoder. Each position in the encoder can attend\nto all positions in the previous layer of the encoder.\n3) Similarly, self-attention layers in the decoder allow each position in the\ndecoder to attend to all positions in the decoder up to and including that\nposition. We need to prevent leftward information flow in the decoder to\npreserve the auto-regressive property. We implement this inside of scaled dot-\nproduct attention by masking out (setting to $-\\infty$) all values in the input\nof the softmax which correspond to illegal connections.\nPosition-wise Feed-Forward Networks\nIn addition to attention sub-layers, each of the layers in our encoder and\ndecoder contains a fully connected feed-forward network, which is applied to\neach position separately and identically. This consists of two linear\ntransformations with a ReLU activation in between.\n\\[\\mathrm{FFN}(x)=\\max(0, xW_1 + b_1) W_2 + b_2\\]\nWhile the linear transformations are the same across different positions, they\nuse different parameters from layer to layer. Another way of describing this is\nas two convolutions with kernel size 1. The dimensionality of input and output\nis $d_{\\text{model}}=512$, and the inner-layer has dimensionality $d_{ff}=2048$.\nclass PositionwiseFeedForward ( nn . Module ):\n\"Implements FFN equation.\"\ndef __init__ ( self , d_model , d_ff , dropout = 0.1 ):\nsuper ( PositionwiseFeedForward , self ). __init__ ()\nself . w_1 = nn . Linear ( d_model , d_ff )\nself . w_2 = nn . Linear ( d_ff , d_model )\nself . dropout = nn . Dropout ( dropout )\ndef forward ( self , x ):\nreturn self . w_2 ( self . dropout ( F . relu ( self . w_1 ( x ))))\nEmbeddings and Softmax\nSimilarly to other sequence transduction models, we use learned embeddings to\nconvert the input tokens and output tokens to vectors of dimension\n$d_{\\text{model}}$. We also use the usual learned linear transformation and\nsoftmax function to convert the decoder output to predicted next-token\nprobabilities. In our model, we share the same weight matrix between the two\nembedding layers and the pre-softmax linear transformation, similar to\n (cite) . In the embedding layers, we multiply\nthose weights by $\\sqrt{d_{\\text{model}}}$.\nclass Embeddings ( nn . Module ):\ndef __init__ ( self , d_model , vocab ):\nsuper ( Embeddings , self ). __init__ ()\nself . lut = nn . Embedding ( vocab , d_model )\nself . d_model = d_model\ndef forward ( self , x ):\nreturn self . lut ( x ) * math . sqrt ( self . d_model )\nPositional Encoding\nSince our model contains no recurrence and no convolution, in order for the\nmodel to make use of the order of the sequence, we must inject some information\nabout the relative or absolute position of the tokens in the sequence. To this\nend, we add “positional encodings” to the input embeddings at the bottoms of the\nencoder and decoder stacks. The positional encodings have the same dimension\n$d_{\\text{model}}$ as the embeddings, so that the two can be summed. There are\nmany choices of positional encodings, learned and fixed\n (cite) .\nIn this work, we use sine and cosine functions of different frequencies:\n\\(PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\\text{model}}})\\)\n\\(PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\\text{model}}})\\)\nwhere $pos$ is the position and $i$ is the dimension. That is, each dimension\nof the positional encoding corresponds to a sinusoid. The wavelengths form a\ngeometric progression from $2\\pi$ to $10000 \\cdot 2\\pi$. We chose this function\nbecause we hypothesized it would allow the model to easily learn to attend by\nrelative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be\nrepresented as a linear function of $PE_{pos}$.\nIn addition, we apply dropout to the sums of the embeddings and the positional\nencodings in both the encoder and decoder stacks. For the base model, we use a\nrate of $P_{drop}=0.1$.\nclass PositionalEncoding ( nn . Module ):\n\"Implement the PE function.\"\ndef __init__ ( self , d_model , dropout , max_len = 5000 ):\nsuper ( PositionalEncoding , self ). __init__ ()\nself . dropout = nn . Dropout ( p = dropout )\n# Compute the positional encodings once in log space.\n pe = torch . zeros ( max_len , d_model )\nposition = torch . arange ( 0 , max_len ). unsqueeze ( 1 )\ndiv_term = torch . exp ( torch . arange ( 0 , d_model , 2 ) *\n- ( math . log ( 10000.0 ) / d_model ))\npe [:, 0 :: 2 ] = torch . sin ( position * div_term )\npe [:, 1 :: 2 ] = torch . cos ( position * div_term )\npe = pe . unsqueeze ( 0 )\nself . register_buffer ( 'pe' , pe )\ndef forward ( self , x ):\nx = x + Variable ( self . pe [:, : x . size ( 1 )],\nrequires_grad = False )\nreturn self . dropout ( x )\nBelow the positional encoding will add in a sine wave based on position. The\nfrequency and offset of the wave is different for each dimension.\nplt . figure ( figsize = ( 15 , 5 ))\npe = PositionalEncoding ( 20 , 0 )\ny = pe . forward ( Variable ( torch . zeros ( 1 , 100 , 20 )))\nplt . plot ( np . arange ( 100 ), y [ 0 , :, 4 : 8 ]. data . numpy ())\nplt . legend ([ \"dim %d\" % p for p in [ 4 , 5 , 6 , 7 ]])\nNone\nWe also experimented with using learned positional embeddings\n (cite) instead, and found that the two\nversions produced nearly identical results. We chose the sinusoidal version\nbecause it may allow the model to extrapolate to sequence lengths longer than\nthe ones encountered during training.\nFull Model\nHere we define a function that takes in hyperparameters and produces a full model.\ndef make_model ( src_vocab , tgt_vocab , N = 6 ,\nd_model = 512 , d_ff = 2048 , h = 8 , dropout = 0.1 ):\n\"Helper: Construct a model from hyperparameters.\"\nc = copy . deepcopy\nattn = MultiHeadedAttention ( h , d_model )\nff = PositionwiseFeedForward ( d_model , d_ff , dropout )\nposition = PositionalEncoding ( d_model , dropout )\nmodel = EncoderDecoder (\nEncoder ( EncoderLayer ( d_model , c ( attn ), c ( ff ), dropout ), N ),\nDecoder ( DecoderLayer ( d_model , c ( attn ), c ( attn ),\nc ( ff ), dropout ), N ),\nnn . Sequential ( Embeddings ( d_model , src_vocab ), c ( position )),\nnn . Sequential ( Embeddings ( d_model , tgt_vocab ), c ( position )),\nGenerator ( d_model , tgt_vocab ))\n# This was important from their code.\n# Initialize parameters with Glorot / fan_avg.\n for p in model . parameters ():\nif p . dim () > 1 :\nnn . init . xavier_uniform ( p )\nreturn model\n# Small example model.\n tmp_model = make_model ( 10 , 10 , 2 )\nNone\nTraining\nThis section describes the training regime for our models.\nWe stop for a quick interlude to introduce some of the tools\nneeded to train a standard encoder decoder model. First we define a batch object\nthat holds the src and target sentences for training, as well as constructing\nthe masks.\nBatches and Masking\nclass Batch :\n\"Object for holding a batch of data with mask during training.\"\ndef __init__ ( self , src , trg = None , pad = 0 ):\nself . src = src\nself . src_mask = ( src != pad ). unsqueeze ( - 2 )\nif trg is not None :\nself . trg = trg [:, : - 1 ]\nself . trg_y = trg [:, 1 :]\nself . trg_mask = \\\n self . make_std_mask ( self . trg , pad )\nself . ntokens = ( self . trg_y != pad ). data . sum ()\n@ staticmethod\ndef make_std_mask ( tgt , pad ):\n\"Create a mask to hide padding and future words.\"\ntgt_mask = ( tgt != pad ). unsqueeze ( - 2 )\ntgt_mask = tgt_mask & Variable (\nsubsequent_mask ( tgt . size ( - 1 )). type_as ( tgt_mask . data ))\nreturn tgt_mask\nNext we create a generic training and scoring function to keep track of loss.\nWe pass in a generic loss compute function that also handles parameter updates.\nTraining Loop\ndef run_epoch ( data_iter , model , loss_compute ):\n\"Standard Training and Logging Function\"\nstart = time . time ()\ntotal_tokens = 0\ntotal_loss = 0\ntokens = 0\nfor i , batch in enumerate ( data_iter ):\nout = model . forward ( batch . src , batch . trg ,\nbatch . src_mask , batch . trg_mask )\nloss = loss_compute ( out , batch . trg_y , batch . ntokens )\ntotal_loss += loss\ntotal_tokens += batch . ntokens\ntokens += batch . ntokens\nif i % 50 == 1 :\nelapsed = time . time () - start\nprint ( \"Epoch Step: %d Loss: %f Tokens per Sec: %f\" %\n( i , loss / batch . ntokens , tokens / elapsed ))\nstart = time . time ()\ntokens = 0\nreturn total_loss / total_tokens\nTraining Data and Batching\nWe trained on the standard WMT 2014 English-German dataset consisting of about\n4.5 million sentence pairs. Sentences were encoded using byte-pair encoding,\nwhich has a shared source-target vocabulary of about 37000 tokens. For English-\nFrench, we used the significantly larger WMT 2014 English-French dataset\nconsisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.\nSentence pairs were batched together by approximate sequence length. Each\ntraining batch contained a set of sentence pairs containing approximately 25000\nsource tokens and 25000 target tokens.\nWe will use torch text for batching. This is discussed in more detail below.\nHere we create batches in a torchtext function that ensures our batch size\npadded to the maximum batchsize does not surpass a threshold (25000 if we have 8\ngpus).\nglobal max_src_in_batch , max_tgt_in_batch\ndef batch_size_fn ( new , count , sofar ):\n\"Keep augmenting batch and calculate total number of tokens + padding.\"\nglobal max_src_in_batch , max_tgt_in_batch\nif count == 1 :\nmax_src_in_batch = 0\nmax_tgt_in_batch = 0\nmax_src_in_batch = max ( max_src_in_batch , len ( new . src ))\nmax_tgt_in_batch = max ( max_tgt_in_batch , len ( new . trg ) + 2 )\nsrc_elements = count * max_src_in_batch\ntgt_elements = count * max_tgt_in_batch\nreturn max ( src_elements , tgt_elements )\nHardware and Schedule\nWe trained our models on one machine with 8 NVIDIA P100 GPUs. For our base\nmodels using the hyperparameters described throughout the paper, each training\nstep took about 0.4 seconds. We trained the base models for a total of 100,000\nsteps or 12 hours. For our big models, step time was 1.0 seconds. The big\nmodels were trained for 300,000 steps (3.5 days).\nOptimizer\nWe used the Adam optimizer (cite) with\n$\\beta_1=0.9$, $\\beta_2=0.98$ and $\\epsilon=10^{-9}$. We varied the learning\nrate over the course of training, according to the formula:\n\\(lrate = d_{\\text{model}}^{-0.5} \\cdot\n \\min({step\\_num}^{-0.5},\n {step\\_num} \\cdot {warmup\\_steps}^{-1.5})\\)\nThis corresponds to increasing the learning rate linearly for the first\n$warmup_steps$ training steps, and decreasing it thereafter proportionally to\nthe inverse square root of the step number. We used $warmup_steps=4000$.\nNote: This part is very important. Need to train with this setup of the model.\nclass NoamOpt :\n\"Optim wrapper that implements rate.\"\ndef __init__ ( self , model_size , factor , warmup , optimizer ):\nself . optimizer = optimizer\nself . _step = 0\nself . warmup = warmup\nself . factor = factor\nself . model_size = model_size\nself . _rate = 0\ndef step ( self ):\n\"Update parameters and rate\"\nself . _step += 1\nrate = self . rate ()\nfor p in self . optimizer . param_groups :\np [ 'lr' ] = rate\nself . _rate = rate\nself . optimizer . step ()\ndef rate ( self , step = None ):\n\"Implement `lrate` above\"\nif step is None :\nstep = self . _step\nreturn self . factor * \\\n ( self . model_size ** ( - 0.5 ) *\nmin ( step ** ( - 0.5 ), step * self . warmup ** ( - 1.5 )))\ndef get_std_opt ( model ):\nreturn NoamOpt ( model . src_embed [ 0 ]. d_model , 2 , 4000 ,\ntorch . optim . Adam ( model . parameters (), lr = 0 , betas = ( 0.9 , 0.98 ), eps = 1e-9 ))\nExample of the curves of this model for different model sizes and for\noptimization hyperparameters.\n# Three settings of the lrate hyperparameters.\n opts = [ NoamOpt ( 512 , 1 , 4000 , None ),\nNoamOpt ( 512 , 1 , 8000 , None ),\nNoamOpt ( 256 , 1 , 4000 , None )]\nplt . plot ( np . arange ( 1 , 20000 ), [[ opt . rate ( i ) for opt in opts ] for i in range ( 1 , 20000 )])\nplt . legend ([ \"512:4000\" , \"512:8000\" , \"256:4000\" ])\nNone\nRegularization\nLabel Smoothing\nDuring training, we employed label smoothing of value $\\epsilon_{ls}=0.1$\n (cite) . This hurts perplexity, as the model\nlearns to be more unsure, but improves accuracy and BLEU score.\nWe implement label smoothing using the KL div loss. Instead of using a one-hot\ntarget distribution, we create a distribution that has confidence of the\ncorrect word and the rest of the smoothing mass distributed throughout the\nvocabulary.\nclass LabelSmoothing ( nn . Module ):\n\"Implement label smoothing.\"\ndef __init__ ( self , size , padding_idx , smoothing = 0.0 ):\nsuper ( LabelSmoothing , self ). __init__ ()\nself . criterion = nn . KLDivLoss ( size_average = False )\nself . padding_idx = padding_idx\nself . confidence = 1.0 - smoothing\nself . smoothing = smoothing\nself . size = size\nself . true_dist = None\ndef forward ( self , x , target ):\nassert x . size ( 1 ) == self . size\ntrue_dist = x . data . clone ()\ntrue_dist . fill_ ( self . smoothing / ( self . size - 2 ))\ntrue_dist . scatter_ ( 1 , target . data . unsqueeze ( 1 ), self . confidence )\ntrue_dist [:, self . padding_idx ] = 0\nmask = torch . nonzero ( target . data == self . padding_idx )\nif mask . dim () > 0 :\ntrue_dist . index_fill_ ( 0 , mask . squeeze (), 0.0 )\nself . true_dist = true_dist\nreturn self . criterion ( x , Variable ( true_dist , requires_grad = False ))\nHere we can see an example of how the mass is distributed to the words based\non confidence.\n# Example of label smoothing.\n crit = LabelSmoothing ( 5 , 0 , 0.4 )\npredict = torch . FloatTensor ([[ 0 , 0.2 , 0.7 , 0.1 , 0 ],\n[ 0 , 0.2 , 0.7 , 0.1 , 0 ],\n[ 0 , 0.2 , 0.7 , 0.1 , 0 ]])\nv = crit ( Variable ( predict . log ()),\nVariable ( torch . LongTensor ([ 2 , 1 , 0 ])))\n# Show the target distributions expected by the system.\n plt . imshow ( crit . true_dist )\nNone\nLabel smoothing actually starts to penalize the model if it gets very\nconfident about a given choice.\ncrit = LabelSmoothing ( 5 , 0 , 0.1 )\ndef loss ( x ):\nd = x + 3 * 1\npredict = torch . FloatTensor ([[ 0 , x / d , 1 / d , 1 / d , 1 / d ],\n])\n#print(predict)\n return crit ( Variable ( predict . log ()),\nVariable ( torch . LongTensor ([ 1 ]))). data [ 0 ]\nplt . plot ( np . arange ( 1 , 100 ), [ loss ( x ) for x in range ( 1 , 100 )])\nNone\nA First Example\nWe can begin by trying out a simple copy-task. Given a random set of input\nsymbols from a small vocabulary, the goal is to generate back those same\nsymbols.\nSynthetic Data\ndef data_gen ( V , batch , nbatches ):\n\"Generate random data for a src-tgt copy task.\"\nfor i in range ( nbatches ):\ndata = torch . from_numpy ( np . random . randint ( 1 , V , size = ( batch , 10 )))\ndata [:, 0 ] = 1\nsrc = Variable ( data , requires_grad = False )\ntgt = Variable ( data , requires_grad = False )\nyield Batch ( src , tgt , 0 )\nLoss Computation\nclass SimpleLossCompute :\n\"A simple loss compute and train function.\"\ndef __init__ ( self , generator , criterion , opt = None ):\nself . generator = generator\nself . criterion = criterion\nself . opt = opt\ndef __call__ ( self , x , y , norm ):\nx = self . generator ( x )\nloss = self . criterion ( x . contiguous (). view ( - 1 , x . size ( - 1 )),\ny . contiguous (). view ( - 1 )) / norm\nloss . backward ()\nif self . opt is not None :\nself . opt . step ()\nself . opt . optimizer . zero_grad ()\nreturn loss . data [ 0 ] * norm\nGreedy Decoding\n# Train the simple copy task.\n V = 11\ncriterion = LabelSmoothing ( size = V , padding_idx = 0 , smoothing = 0.0 )\nmodel = make_model ( V , V , N = 2 )\nmodel_opt = NoamOpt ( model . src_embed [ 0 ]. d_model , 1 , 400 ,\ntorch . optim . Adam ( model . parameters (), lr = 0 , betas = ( 0.9 , 0.98 ), eps = 1e-9 ))\nfor epoch in range ( 10 ):\nmodel . train ()\nrun_epoch ( data_gen ( V , 30 , 20 ), model ,\nSimpleLossCompute ( model . generator , criterion , model_opt ))\nmodel . eval ()\nprint ( run_epoch ( data_gen ( V , 30 , 5 ), model ,\nSimpleLossCompute ( model . generator , criterion , None )))\nEpoch Step: 1 Loss: 3.023465 Tokens per Sec: 403.074173\nEpoch Step: 1 Loss: 1.920030 Tokens per Sec: 641.689380\n1.9274832487106324\nEpoch Step: 1 Loss: 1.940011 Tokens per Sec: 432.003378\nEpoch Step: 1 Loss: 1.699767 Tokens per Sec: 641.979665\n1.657595729827881\nEpoch Step: 1 Loss: 1.860276 Tokens per Sec: 433.320240\nEpoch Step: 1 Loss: 1.546011 Tokens per Sec: 640.537198\n1.4888023376464843\nEpoch Step: 1 Loss: 1.682198 Tokens per Sec: 432.092305\nEpoch Step: 1 Loss: 1.313169 Tokens per Sec: 639.441857\n1.3485562801361084\nEpoch Step: 1 Loss: 1.278768 Tokens per Sec: 433.568756\nEpoch Step: 1 Loss: 1.062384 Tokens per Sec: 642.542067\n0.9853351473808288\nEpoch Step: 1 Loss: 1.269471 Tokens per Sec: 433.388727\nEpoch Step: 1 Loss: 0.590709 Tokens per Sec: 642.862135\n0.5686767101287842\nEpoch Step: 1 Loss: 0.997076 Tokens per Sec: 433.009746\nEpoch Step: 1 Loss: 0.343118 Tokens per Sec: 642.288427\n0.34273059368133546\nEpoch Step: 1 Loss: 0.459483 Tokens per Sec: 434.594030\nEpoch Step: 1 Loss: 0.290385 Tokens per Sec: 642.519464\n0.2612409472465515\nEpoch Step: 1 Loss: 1.031042 Tokens per Sec: 434.557008\nEpoch Step: 1 Loss: 0.437069 Tokens per Sec: 643.630322\n0.4323212027549744\nEpoch Step: 1 Loss: 0.617165 Tokens per Sec: 436.652626\nEpoch Step: 1 Loss: 0.258793 Tokens per Sec: 644.372296\n0.27331129014492034\nThis code predicts a translation using greedy decoding for simplicity.\ndef greedy_decode ( model , src , src_mask , max_len , start_symbol ):\nmemory = model . encode ( src , src_mask )\nys = torch . ones ( 1 , 1 ). fill_ ( start_symbol ). type_as ( src . data )\nfor i in range ( max_len - 1 ):\nout = model . decode ( memory , src_mask ,\nVariable ( ys ),\nVariable ( subsequent_mask ( ys . size ( 1 ))\n. type_as ( src . data )))\nprob = model . generator ( out [:, - 1 ])\n_ , next_word = torch . max ( prob , dim = 1 )\nnext_word = next_word . data [ 0 ]\nys = torch . cat ([ ys ,\ntorch . ones ( 1 , 1 ). type_as ( src . data ). fill_ ( next_word )], dim = 1 )\nreturn ys\nmodel . eval ()\nsrc = Variable ( torch . LongTensor ([[ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ]]) )\nsrc_mask = Variable ( torch . ones ( 1 , 1 , 10 ) )\nprint ( greedy_decode ( model , src , src_mask , max_len = 10 , start_symbol = 1 ))\n1 2 3 4 5 6 7 8 9 10\n[torch.LongTensor of size 1x10]\nA Real World Example\nNow we consider a real-world example using the IWSLT German-English\nTranslation task. This task is much smaller than the WMT task considered in the\npaper, but it illustrates the whole system. We also show how to use multi-gpu\nprocessing to make it really fast.\n#!pip install torchtext spacy\n#!python -m spacy download en\n#!python -m spacy download de\nData Loading\nWe will load the dataset using torchtext and spacy for tokenization.\n# For data loading.\n from torchtext import data , datasets\nif True :\nimport spacy\nspacy_de = spacy . load ( 'de' )\nspacy_en = spacy . load ( 'en' )\ndef tokenize_de ( text ):\nreturn [ tok . text for tok in spacy_de . tokenizer ( text )]\ndef tokenize_en ( text ):\nreturn [ tok . text for tok in spacy_en . tokenizer ( text )]\nBOS_WORD = '<s>'\nEOS_WORD = '</s>'\nBLANK_WORD = \"<blank>\"\nSRC = data . Field ( tokenize = tokenize_de , pad_token = BLANK_WORD )\nTGT = data . Field ( tokenize = tokenize_en , init_token = BOS_WORD ,\neos_token = EOS_WORD , pad_token = BLANK_WORD )\nMAX_LEN = 100\ntrain , val , test = datasets . IWSLT . splits (\nexts = ( '.de' , '.en' ), fields = ( SRC , TGT ),\nfilter_pred = lambda x : len ( vars ( x )[ 'src' ]) <= MAX_LEN and\nlen ( vars ( x )[ 'trg' ]) <= MAX_LEN )\nMIN_FREQ = 2\nSRC . build_vocab ( train . src , min_freq = MIN_FREQ )\nTGT . build_vocab ( train . trg , min_freq = MIN_FREQ )\nBatching matters a ton for speed. We want to have very evenly divided batches,\nwith absolutely minimal padding. To do this we have to hack a bit around the\ndefault torchtext batching. This code patches their default batching to make\nsure we search over enough sentences to find tight batches.\nIterators\nclass MyIterator ( data . Iterator ):\ndef create_batches ( self ):\nif self . train :\ndef pool ( d , random_shuffler ):\nfor p in data . batch ( d , self . batch_size * 100 ):\np_batch = data . batch (\nsorted ( p , key = self . sort_key ),\nself . batch_size , self . batch_size_fn )\nfor b in random_shuffler ( list ( p_batch )):\nyield b\nself . batches = pool ( self . data (), self . random_shuffler )\nelse :\nself . batches = []\nfor b in data . batch ( self . data (), self . batch_size ,\nself . batch_size_fn ):\nself . batches . append ( sorted ( b , key = self . sort_key ))\ndef rebatch ( pad_idx , batch ):\n\"Fix order in torchtext to match ours\"\nsrc , trg = batch . src . transpose ( 0 , 1 ), batch . trg . transpose ( 0 , 1 )\nreturn Batch ( src , trg , pad_idx )\nMulti-GPU Training\nFinally to really target fast training, we will use multi-gpu. This code\nimplements multi-gpu word generation. It is not specific to transformer so I\nwon’t go into too much detail. The idea is to split up word generation at\ntraining time into chunks to be processed in parallel across many different\ngpus. We do this using pytorch parallel primitives:\nreplicate - split modules onto different gpus.\nscatter - split batches onto different gpus\nparallel_apply - apply module to batches on different gpus\ngather - pull scattered data back onto one gpu.\nnn.DataParallel - a special module wrapper that calls these all before\nevaluating.\n# Skip if not interested in multigpu.\n class MultiGPULossCompute :\n\"A multi-gpu loss compute and train function.\"\ndef __init__ ( self , generator , criterion , devices , opt = None , chunk_size = 5 ):\n# Send out to different gpus.\n self . generator = generator\nself . criterion = nn . parallel . replicate ( criterion ,\ndevices = devices )\nself . opt = opt\nself . devices = devices\nself . chunk_size = chunk_size\ndef __call__ ( self , out , targets , normalize ):\ntotal = 0.0\ngenerator = nn . parallel . replicate ( self . generator ,\ndevices = self . devices )\nout_scatter = nn . parallel . scatter ( out ,\ntarget_gpus = self . devices )\nout_grad = [[] for _ in out_scatter ]\ntargets = nn . parallel . scatter ( targets ,\ntarget_gpus = self . devices )\n# Divide generating into chunks.\n chunk_size = self . chunk_size\nfor i in range ( 0 , out_scatter [ 0 ]. size ( 1 ), chunk_size ):\n# Predict distributions\n out_column = [[ Variable ( o [:, i : i + chunk_size ]. data ,\nrequires_grad = self . opt is not None )]\nfor o in out_scatter ]\ngen = nn . parallel . parallel_apply ( generator , out_column )\n# Compute loss.\ny = [( g . contiguous (). view ( - 1 , g . size ( - 1 )),\nt [:, i : i + chunk_size ]. contiguous (). view ( - 1 ))\nfor g , t in zip ( gen , targets )]\nloss = nn . parallel . parallel_apply ( self . criterion , y )\n# Sum and normalize loss\n l = nn . parallel . gather ( loss ,\ntarget_device = self . devices [ 0 ])\nl = l . sum ()[ 0 ] / normalize\ntotal += l . data [ 0 ]\n# Backprop loss to output of transformer\n if self . opt is not None :\nl . backward ()\nfor j , l in enumerate ( loss ):\nout_grad [ j ]. append ( out_column [ j ][ 0 ]. grad . data . clone ())\n# Backprop all loss through transformer.\nif self . opt is not None :\nout_grad = [ Variable ( torch . cat ( og , dim = 1 )) for og in out_grad ]\no1 = out\no2 = nn . parallel . gather ( out_grad ,\ntarget_device = self . devices [ 0 ])\no1 . backward ( gradient = o2 )\nself . opt . step ()\nself . opt . optimizer . zero_grad ()\nreturn total * normalize\nNow we create our model, criterion, optimizer, data iterators, and\nparalelization\n# GPUs to use\n devices = [ 0 , 1 , 2 , 3 ]\nif True :\npad_idx = TGT . vocab . stoi [ \"<blank>\" ]\nmodel = make_model ( len ( SRC . vocab ), len ( TGT . vocab ), N = 6 )\nmodel . cuda ()\ncriterion = LabelSmoothing ( size = len ( TGT . vocab ), padding_idx = pad_idx , smoothing = 0.1 )\ncriterion . cuda ()\nBATCH_SIZE = 12000\ntrain_iter = MyIterator ( train , batch_size = BATCH_SIZE , device = 0 ,\nrepeat = False , sort_key = lambda x : ( len ( x . src ), len ( x . trg )),\nbatch_size_fn = batch_size_fn , train = True )\nvalid_iter = MyIterator ( val , batch_size = BATCH_SIZE , device = 0 ,\nrepeat = False , sort_key = lambda x : ( len ( x . src ), len ( x . trg )),\nbatch_size_fn = batch_size_fn , train = False )\nmodel_par = nn . DataParallel ( model , device_ids = devices )\nNone\nNow we train the model. I will play with the warmup steps a bit, but\neverything else uses the default parameters. On an AWS p3.8xlarge with 4 Tesla\nV100s, this runs at ~27,000 tokens per second with a batch size of 12,000\nTraining the System\n#!wget https://s3.amazonaws.com/opennmt-models/iwslt.pt\nif False :\nmodel_opt = NoamOpt ( model . src_embed [ 0 ]. d_model , 1 , 2000 ,\ntorch . optim . Adam ( model . parameters (), lr = 0 , betas = ( 0.9 , 0.98 ), eps = 1e-9 ))\nfor epoch in range ( 10 ):\nmodel_par . train ()\nrun_epoch (( rebatch ( pad_idx , b ) for b in train_iter ),\nmodel_par ,\nMultiGPULossCompute ( model . generator , criterion ,\ndevices = devices , opt = model_opt ))\nmodel_par . eval ()\nloss = run_epoch (( rebatch ( pad_idx , b ) for b in valid_iter ),\nmodel_par ,\nMultiGPULossCompute ( model . generator , criterion ,\ndevices = devices , opt = None ))\nprint ( loss )\nelse :\nmodel = torch . load ( \"iwslt.pt\" )\nOnce trained we can decode the model to produce a set of translations. Here we\nsimply translate the first sentence in the validation set. This dataset is\npretty small so the translations with greedy search are reasonably accurate.\nfor i , batch in enumerate ( valid_iter ):\nsrc = batch . src . transpose ( 0 , 1 )[: 1 ]\nsrc_mask = ( src != SRC . vocab . stoi [ \"<blank>\" ]). unsqueeze ( - 2 )\nout = greedy_decode ( model , src , src_mask ,\nmax_len = 60 , start_symbol = TGT . vocab . stoi [ \"<s>\" ])\nprint ( \"Translation:\" , end = \" \\t \" )\nfor i in range ( 1 , out . size ( 1 )):\nsym = TGT . vocab . itos [ out [ 0 , i ]]\nif sym == \"</s>\" : break\nprint ( sym , end = \" \" )\nprint ()\nprint ( \"Target:\" , end = \" \\t \" )\nfor i in range ( 1 , batch . trg . size ( 0 )):\nsym = TGT . vocab . itos [ batch . trg . data [ i , 0 ]]\nif sym == \"</s>\" : break\nprint ( sym , end = \" \" )\nprint ()\nbreak\nTranslation: <unk> <unk> . In my language , that means , thank you very much . \nGold: <unk> <unk> . It means in my language , thank you very much .\nAdditional Components: BPE, Search, Averaging\nSo this mostly covers the transformer model itself. There are four aspects\nthat we didn’t cover explicitly. We also have all these additional features\nimplemented in OpenNMT-py .\n1) BPE/ Word-piece: We can use a library to first preprocess the data into\nsubword units. See Rico Sennrich’s subword-\nnmt implementation. These models will\ntransform the training data to look like this:\n▁Die ▁Protokoll datei ▁kann ▁ heimlich ▁per ▁E - Mail ▁oder ▁FTP ▁an ▁einen\n▁bestimmte n ▁Empfänger ▁gesendet ▁werden .\n2) Shared Embeddings: When using BPE with shared vocabulary we can share the\nsame weight vectors between the source / target / generator. See the\n (cite) for details. To add this to the model\nsimply do this:\nif False :\nmodel . src_embed [ 0 ]. lut . weight = model . tgt_embeddings [ 0 ]. lut . weight\nmodel . generator . lut . weight = model . tgt_embed [ 0 ]. lut . weight\n3) Beam Search: This is a bit too complicated to cover here. See the OpenNMT-\npy \nfor a pytorch implementation.\n4) Model Averaging: The paper averages the last k checkpoints to create an\nensembling effect. We can do this after the fact if we have a bunch of models:\ndef average ( model , models ):\n\"Average models into model\"\nfor ps in zip ( * [ m . params () for m in [ model ] + models ]):\np [ 0 ]. copy_ ( torch . sum ( * ps [ 1 :]) / len ( ps [ 1 :]))\nResults\nOn the WMT 2014 English-to-German translation task, the big transformer model\n(Transformer (big)\nin Table 2) outperforms the best previously reported models (including\nensembles) by more than 2.0\nBLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration\nof this model is\nlisted in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs.\nEven our base model\nsurpasses all previously published models and ensembles, at a fraction of the\ntraining cost of any of\nthe competitive models.\nOn the WMT 2014 English-to-French translation task, our big model achieves a\nBLEU score of 41.0,\noutperforming all of the previously published single models, at less than 1/4\nthe training cost of the\nprevious state-of-the-art model. The Transformer (big) model trained for\nEnglish-to-French used\ndropout rate Pdrop = 0.1, instead of 0.3.\nImage ( filename = \"images/results.png\" )\nThe code we have written here is a version of the base model. There are fully\ntrained version of this system available here (Example\nModels) .\nWith the addtional extensions in the last section, the OpenNMT-py replication\ngets to 26.9 on EN-DE WMT. Here I have loaded in those parameters to our\nreimplemenation.\n! wget https : // s3 . amazonaws . com / opennmt - models / en - de - model . pt\nmodel , SRC , TGT = torch . load ( \"en-de-model.pt\" )\nmodel . eval ()\nsent = \"▁The ▁log ▁file ▁can ▁be ▁sent ▁secret ly ▁with ▁email ▁or ▁FTP ▁to ▁a ▁specified ▁receiver\" . split ()\nsrc = torch . LongTensor ([[ SRC . stoi [ w ] for w in sent ]])\nsrc = Variable ( src )\nsrc_mask = ( src != SRC . stoi [ \"<blank>\" ]). unsqueeze ( - 2 )\nout = greedy_decode ( model , src , src_mask ,\nmax_len = 60 , start_symbol = TGT . stoi [ \"<s>\" ])\nprint ( \"Translation:\" , end = \" \\t \" )\ntrans = \"<s> \"\nfor i in range ( 1 , out . size ( 1 )):\nsym = TGT . itos [ out [ 0 , i ]]\nif sym == \"</s>\" : break\ntrans += sym + \" \"\nprint ( trans )\nTranslation: <s> ▁Die ▁Protokoll datei ▁kann ▁ heimlich ▁per ▁E - Mail ▁oder ▁FTP ▁an ▁einen ▁bestimmte n ▁Empfänger ▁gesendet ▁werden .\nAttention Visualization\nEven with a greedy decoder the translation looks pretty good. We can further\nvisualize it to see what is happening at each layer of the attention\ntgt_sent = trans . split ()\ndef draw ( data , x , y , ax ):\nseaborn . heatmap ( data ,\nxticklabels = x , square = True , yticklabels = y , vmin = 0.0 , vmax = 1.0 ,\ncbar = False , ax = ax )\nfor layer in range ( 1 , 6 , 2 ):\nfig , axs = plt . subplots ( 1 , 4 , figsize = ( 20 , 10 ))\nprint ( \"Encoder Layer\" , layer + 1 )\nfor h in range ( 4 ):\ndraw ( model . encoder . layers [ layer ]. self_attn . attn [ 0 , h ]. data ,\nsent , sent if h == 0 else [], ax = axs [ h ])\nplt . show ()\nfor layer in range ( 1 , 6 , 2 ):\nfig , axs = plt . subplots ( 1 , 4 , figsize = ( 20 , 10 ))\nprint ( \"Decoder Self Layer\" , layer + 1 )\nfor h in range ( 4 ):\ndraw ( model . decoder . layers [ layer ]. self_attn . attn [ 0 , h ]. data [: len ( tgt_sent ), : len ( tgt_sent )],\ntgt_sent , tgt_sent if h == 0 else [], ax = axs [ h ])\nplt . show ()\nprint ( \"Decoder Src Layer\" , layer + 1 )\nfig , axs = plt . subplots ( 1 , 4 , figsize = ( 20 , 10 ))\nfor h in range ( 4 ):\ndraw ( model . decoder . layers [ layer ]. self_attn . attn [ 0 , h ]. data [: len ( tgt_sent ), : len ( sent )],\nsent , tgt_sent if h == 0 else [], ax = axs [ h ])\nplt . show ()\nEncoder Layer 2\nEncoder Layer 4\nEncoder Layer 6\nDecoder Self Layer 2\nDecoder Src Layer 2\nDecoder Self Layer 4\nDecoder Src Layer 4\nDecoder Self Layer 6\nDecoder Src Layer 6\nConclusion\nHopefully this code is useful for future research. Please reach out if you\nhave any issues. If you find this code helpful, also check out our other OpenNMT\ntools.\n@inproceedings{opennmt,\n author = {Guillaume Klein and\n Yoon Kim and\n Yuntian Deng and\n Jean Senellart and\n Alexander M. Rush},\n title = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},\n booktitle = {Proc. ACL},\n year = {2017},\n url = {https://doi.org/10.18653/v1/P17-4012},\n doi = {10.18653/v1/P17-4012}\n}\nCheers,\nsrush\nPlease enable JavaScript to view the comments powered by\nDisqus. </noscript>\nPlease enable JavaScript to view the comments powered by\nDisqus. </noscript>\nharvardnlp\nCopyright © 2024 The President and Fellows of Harvard College | Accessibility | Digital Accessibility | Report Copyright Infringement\n"
    },
    {
      "turn_idx": 13,
      "action": "file_saver({\"content\": \"# Information from collecting and cross-verifying encoder layer counts (BERT base vs. Transformer encoder)\\n\\nSources:\\n- BERT (language model) — Wikipedia: https://en.wikipedia.org/wiki/BERT_(language_model)\\n- The Annotated Transformer — Harvard NLP: https://nlp.seas.harvard.edu/2018/04/03/attention.html\\n- The Transformer: Attention Is All You Need — Glass Box Medicine: https://glassboxmedicine.com/2019/08/15/the-transformer-attention-is-all-you-need/\\n\\nExtracted Content (verbatim quotes with attribution):\\n\\n1) BERT base encoder layers\\n- Source: BERT (language model) — Wikipedia\\n  Quote: \\\"After this, the representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer.\\\"\\n  URL: https://en.wikipedia.org/wiki/BERT_(language_model)\\n\\n- Source: BERT (language model) — Wikipedia (Architectural family)\\n  Quote: \\\"The notation for encoder stack is written as L/H. For example, BERT BASE is written as 12L/768H, BERT LARGE as 24L/1024H, and BERT TINY as 2L/128H.\\\"\\n  URL: https://en.wikipedia.org/wiki/BERT_(language_model)\\n\\n2) Transformer encoder layers (Attention Is All You Need)\\n- Source: The Annotated Transformer — Harvard NLP\\n  Quote: \\\"The encoder is composed of a stack of N=6 identical layers.\\\"\\n  URL: https://nlp.seas.harvard.edu/2018/04/03/attention.html\\n\\n- Source: The Transformer: Attention Is All You Need — Glass Box Medicine\\n  Quote: \\\"As you can see from the figure, the encoder is made of N = 6 identical layers stacked on top of each other.\\\"\\n  URL: https://glassboxmedicine.com/2019/08/15/the-transformer-attention-is-all-you-need/\\n\", \"file_path\": \"work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md\", \"mode\": \"w\"})",
      "explanation": "",
      "observation": "Content successfully saved to work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md"
    },
    {
      "turn_idx": 14,
      "action": "mark_step({\"step_index\": 1, \"step_status\": \"completed\", \"step_notes\": \"Collected and cross-verified authoritative counts for encoder blocks:\\n- BERT base: 12 Transformer encoder blocks (12L), per Wikipedia (BERT page): \\\"passed forward through 12 Transformer encoder blocks\\\" and \\\"BERT BASE is written as 12L/768H\\\".\\n- Transformer encoder in Attention Is All You Need: stack of N=6 identical layers, per Harvard NLP 'The Annotated Transformer' and Glass Box Medicine posts. \\nSaved consolidated extraction with verbatim quotes and sources at: work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md\"})",
      "explanation": "",
      "observation": "Step 1: step_status is completed, step_notes is Collected and cross-verified authoritative counts for encoder blocks:\n- BERT base: 12 Transformer encoder blocks (12L), per Wikipedia (BERT page): \"passed forward through 12 Transformer encoder blocks\" and \"BERT BASE is written as 12L/768H\".\n- Transformer encoder in Attention Is All You Need: stack of N=6 identical layers, per Harvard NLP 'The Annotated Transformer' and Glass Box Medicine posts. \nSaved consolidated extraction with verbatim quotes and sources at: work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md "
    },
    {
      "turn_idx": 15,
      "action": "file_saver({\"file_path\": \"work_space/gaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md\", \"content\": \"Step 2: Compute difference in number of encoder blocks\\n\\n- BERT base encoder blocks: 12\\n- Transformer (Attention Is All You Need, base encoder) blocks: 6\\n- Difference (BERT base minus Transformer encoder): 6\\n\\nReference: see extracted sources in work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md\", \"mode\": \"w\"})",
      "explanation": "Computed the difference in encoder blocks: BERT base has 12 encoder layers, and the original Transformer encoder (Attention Is All You Need, base) has 6 layers; difference is 6. Saved calculation summary at: work_spacegaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md",
      "observation": "Content successfully saved to work_space/gaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md"
    },
    {
      "turn_idx": 16,
      "action": "mark_step({\"step_index\": 2, \"step_status\": \"completed\", \"step_notes\": \"Computed the difference in encoder blocks: BERT base has 12 encoder layers, and the original Transformer encoder (Attention Is All You Need, base) has 6 layers; difference is 6. Saved calculation summary at: work_space/gaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md\"})",
      "explanation": "",
      "observation": "Step 2: step_status is completed, step_notes is Computed the difference in encoder blocks: BERT base has 12 encoder layers, and the original Transformer encoder (Attention Is All You Need, base) has 6 layers; difference is 6. Saved calculation summary at: work_space/gaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md "
    },
    {
      "turn_idx": 17,
      "action": "mark_step({\"step_index\": 3, \"step_status\": \"completed\", \"step_notes\": \"Validated formatting and finalized answer. Summary: BERT base encoder has 12 Transformer encoder layers; the original Transformer encoder from \\\"Attention Is All You Need\\\" uses 6 layers. Therefore, BERT base has 6 more encoder blocks. Supporting files created in previous steps:\\n- work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md\\n- work_space/gaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md\\nNo issues encountered. Final answer provided below per required template.\"})",
      "explanation": "Validated formatting and finalized answer. Summary: BERT base encoder has 12 Transformer encoder layers; the original Transformer encoder from \"Attention Is All You Need\" uses 6 layers. Therefore, BERT base has 6 more encoder blocks. Supporting files created in previous steps:\n- work_spacegaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md\n- work_spacegaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md\nNo issues encountered. Final answer provided below per required template.",
      "observation": "Step 3: step_status is completed, step_notes is Validated formatting and finalized answer. Summary: BERT base encoder has 12 Transformer encoder layers; the original Transformer encoder from \"Attention Is All You Need\" uses 6 layers. Therefore, BERT base has 6 more encoder blocks. Supporting files created in previous steps:\n- work_space/gaia_11af4e1a_20251221_200158_779820/search_results_summary_collect_and_cross_verify_encoder_layers.md\n- work_space/gaia_11af4e1a_20251221_200158_779820/step2_difference_calculation.md\nNo issues encountered. Final answer provided below per required template. "
    },
    {
      "turn_idx": 18,
      "action": "FINAL_ANSWER: 6",
      "explanation": "Overview:\n- The task was completed successfully. The objective was to determine how many more encoder blocks BERT base has compared to the encoder in the original Transformer architecture (Attention Is All You Need) and provide the answer under strict formatting rules.\n\nWhat was done:\n- Clarified the question and the required output formatting constraints.\n- Collected and cross-verified the encoder layer counts:\n  - BERT base: 12 encoder layers.\n  - Original Transformer encoder (Attention Is All You Need): 6 encoder layers.\n- Computed the difference between the two counts.\n- Validated that the final answer complies with the specified minimal format.\n\nOutcome:\n- Determined that BERT base has 6 more encoder blocks than the encoder in the original Transformer.\n\nKey success factors:\n- Knowledge of canonical model configurations (BERT base 12 layers; original Transformer encoder 6 layers).\n- Clear interpretation of “blocks” as encoder layers.\n- Strict adherence to the specified output format.\n\nFINAL ANSWER: 6",
      "observation": ""
    }
  ],
  "source": "gaia_level1"
}