WeInfer: Unleashing the Power of WebGPU on LLM Inference in Web Browsers

Zhiyang Chen; Yun Ma; Haiyang SHEN; Mugeng Liu

WeInfer: Unleashing the Power of WebGPU on LLM Inference in Web Browsers

Zhiyang Chen, Yun Ma, Haiyang SHEN, Mugeng Liu

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Systems and infrastructure for Web, mobile, and WoT

Keywords: Large Language Model, WebGPU, Inference Acceleration

TL;DR: We identify that existing Web-based LLM inference frameworks suffer performance degradation due to the inefficient use of WebGPU. Thus, we propose WeInfer, a novel framework to unleash the capacities of WebGPU.

Abstract: Web-based large language model (LLM) has garnered significant attention from both academia and industry due to its potential to combine the benefits of on-device computation with the accessibility and portability of Web applications. The advent of WebGPU, a modern browser API that enables Web applications to access and utilize a device's GPU, has opened up new possibilities for GPU-accelerated LLM inference within browsers. Several frameworks have been developed to support Web-based LLM inference with WebGPU. However, our experiment reveals that these frameworks exhibit inefficiencies in GPU utilization, influencing the LLM inference speed. These inefficiencies primarily arise from underutilizing the full capabilities of WebGPU, particularly in resource management and execution synchronization. To address these limitations, we present WeInfer, an efficient Web-based LLM inference framework specifically designed to unleash the power of WebGPU. WeInfer incorporates two key innovations: 1) buffer reuse strategies that reduce the overhead associated with resource preparation, optimizing the lifecycle management of WebGPU buffers, and 2) an asynchronous pipeline that decouples resource preparation from GPU execution, enabling parallelized computation and deferred result fetching to improve overall efficiency. We conduct extensive evaluations across 9 different LLMs and 5 heterogeneous devices, covering a broad spectrum of model architectures and hardware configurations. The experimental results demonstrate that WeInfer delivers substantial improvements in decoding speed, achieving up to a $3.76\times$ performance boost compared with WebLLM, the state-of-the-art Web-based LLM inference framework.

Submission Number: 1041

Loading