CAR-LLM: Cloud Accelerator Recommender for Large Language Models

Ashwin Krishnan, Venkatesh Pasumarti, Samarth Inamdar, Arghyajoy Mondal, Manoj Nambiar, Rekha Singhal

Published: 01 Jan 2024, Last Modified: 30 Jul 2025HiPC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformer-based Large Language Models (LLMs) have garnered significant attention due to their multi-modality and exceptional performance across diverse applications. This surge in popularity has spurred the development of numerous new LLMs and corresponding hardware solutions for efficient deployment. However, deploying LLMs on different accelerators for inference poses a significant challenge due to the vast search space involved. This encompasses considerations such as the number of accelerator chips, instances of accelerators, and the choice of inference framework to meet stringent workload and latency constraints. In this paper, we introduce the Cloud Accelerator Recommender for Large Language Models (CAR-LLM), a framework designed to optimize the deployment of LLMs on available accelerators and hardware across various cloud vendors. CAR-LLM aims to achieve maximum performance with minimal cost by recommending optimal deployment strategies. We outline a cost-effective experimental strategy and investigate key parameters affecting the latency of LLMs on specific hardware. Additionally, we develop a performance model to predict latency and throughput, enhancing deployment efficiency and decision-making for LLM applications.

External IDs:dblp:conf/hipc/KrishnanPIM0S24