Analysis of Transformer Decoder Architecture and KV Cache Behavior During LLM Inference

Published: 01 Jan 2025, Last Modified: 24 Jun 2025ICEIC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, OpenAI released the Chat GPT ol-preview model, which has reasoning abilities comparable to ranking within the top 2,000 in the U.S. Math Olympiad, significantly sur-passing the average human linguistic ability. This Transformer-based model has become the standard not only in language processing but also in various fields such as vision and speech. As the research in these fields progresses, a comprehensive understanding of GPT has become increasingly necessary. Through a detailed analysis and mathematical understanding of the GPT-3 Transformer Decoder architecture, we explore why it was designed this way and the effects of this design. Additionally, we trace the lineage of each component, examining the previous research from which they emerged, and propose methods to deepen the understanding of GPT. We also examine the rationale behind the existence of the KV caching methodology and how it operates.
Loading