Abstract: The emergence of 1-bit large language models (LLMs) has sparked significant interest, promising substantial efficiency gains through extreme quantization. However, these benefits are inherently limited by the portion of the model that can be quantized. Specifically, 1-bit quantization typically targets only the projection layers, while the attention mechanisms remain in higher precision, potentially creating significant throughput bottlenecks. To address this, we present an adaptation of Amdahl's Law specifically tailored to the LLMs, offering a quantitative framework for understanding the throughput limits of extreme quantization. Our analysis reveals how improvements in quantization can deliver substantial throughput gains, but only to the extent that they address critical throughput-constrained sections of the model. Through extensive experiments across diverse model architectures and hardware platforms, we highlight key trade-offs and performance ceilings, providing a roadmap for future research aimed at maximizing LLM throughput through more holistic quantization strategies.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have uploaded the revised manuscript in which the changes are highlighted in Blue as described in the following:
**For comment 1 of reviewer mZtb, we added the following text in section 4.1 :**
“Our work assumes W1A8, W2A8 quantization as the baseline quantization schemes throughout the analysis. Specifically, projection weights are represented using binary or ternary quantization (i.e., 1-2 bits for weights), while activations are maintained at 8-bit integer precision (INT8). This reflects the implementation principles of systems like BitNet Wang et al. (2023) and BitNet 1.58 Ma et al. (2024), which preserve accuracy by keeping activations in moderate precision while aggressively quantizing weights.”
**For comment 2 of reviewer mZtb, we added the following text in section 3.3 :**
“…Nonetheless, to provide context, a recent processing-in-memory (PIM)-based hardware design Malekar et al. (2025) targeting the acceleration of projection layers of 1-bit LLMs demonstrated up to an 80× increase in throughput.”
**For comment 1 of reviewers VFaP and dcc9, we added the following text in section 3.3:**
“...This formulation enables practitioners to reason about speedup potential (Spartial) as a function of hardware-specific parameters (e.g., systolic array size) and model hyperparameters such as embedding size d, context length l, feedforward dimension dFF, and number of attention heads h.”
**And the text below in section 5 :**
“While the observation that linear projection layers increasingly dominate compute at larger model scales may be expected from dimensional analysis, our contribution lies in quantifying this shift precisely across a wide range of real-world LLM configurations (OPT, GPT, LLaMA) using cycle-accurate simulations on custom-designed TPU architectures. This approach moves beyond theoretical speculation and establishes an empirical foundation for throughput bottlenecks under extreme quantization regimes.”
**For comment 2 of reviewers VFaP and dcc9, we added the following in Section 5:**
“The proposed research opens up several promising directions for future work, including: (i) expanding the design space to encompass emerging model variants, such as multi-query attention, Mixture-of-Experts (MoE), and linear attention, and studying their impact on F , the overall bottleneck profile, and optimal hardware allocation; and (ii) integrating memory-efficient attention mechanisms (e.g., FlashAttention Dao et al. (2022), PagedAttention Kwon et al. (2023)) and refining the analysis within hardware-specific implementations to evaluate their effects on memory access patterns.”
**For comment 3 of reviewers VFaP and dcc9, we added the following text in section 4.3, along with a new Appendix E that includes new simulation results and analysis of the impact of attention head MatMul dimensions on the compute and memory balance.**
“The difference between the cloud and edge setups stems from their scaling characteristics. In cloud TPUs (256×256 arrays), large MatMuls in projection layers are efficiently handled due to ample compute and memory bandwidth, leading to pronounced compute savings for MatMul-free sections. In contrast, edge TPUs (32×32 arrays) suffer degraded efficiency on large projections due to limited parallelism and SRAM, capping compute benefits despite similar memory access trends.
Additionally, the cloud setup faces underutilization of processing elements when executing smaller MatMuls in attention heads, whereas the edge setup rarely encounters this issue. This helps explain the observed difference in the compute cycle fraction (F ) between edge and cloud environments, even though the memory access patterns remain largely similar across both. We include additional experiments in Appendix E, where the embedding dimension is fixed (d = 4096) and the number of attention heads is varied, to isolate the impact of attention head MatMul dimensions on the compute and memory balance.”
**For comment 4 of reviewers VFaP, we added the following text in Appendix D :**
“The GPU latency analysis is intended to contextualize the TPU-centric findings and highlight that even on GPU-optimized transformer kernels the projection layers remain the dominant latency contributor, especially in large models. For instance, Figure 12(a) shows that in LLaMA and OPT models, projection layers account for more than 95% of total latency, reinforcing our claim that quantization of projection layers yields substantial benefits across hardware backends.”
**For comment 4 of reviewer dcc9, we added the below text to the last paragraph of Section 3.3, along with a new Appendix F.**
“In Appendix F, we provide a more generalized variation of the proposed Amdahl’s Law of LLM to enable a more general and hardware-agnostic analysis.”
Assigned Action Editor: ~Aaron_Klein1
Submission Number: 4891
Loading