Benchmarking Efficiency Techniques in GenAI Foundation Models Using an Elo-Based Performance Evaluation Framework

Saman Keon; Summer Deng; Bram Wasti; Joshua Wolff Fromm

Benchmarking Efficiency Techniques in GenAI Foundation Models Using an Elo-Based Performance Evaluation Framework

Saman Keon, Summer Deng, Bram Wasti, Joshua Wolff Fromm

Published: 21 May 2025, Last Modified: 17 Jun 2025MLArchSys 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Presentation: In-Person

Keywords: benchmarking, performance evaluation, Elo scoring, preference-based evaluation, LLM-as-a-judge, model ranking, model quality assessment, foundation models, efficiency techniques, quantization, scaling laws, GPU-based evaluation, model compression, language models, foundation model optimization

Presenter Full Name: Summer Deng

TL;DR: This paper proposes a scalable Elo-based benchmarking framework that uses LLM-as-a-judge evaluations to systematically measure the quality impact of efficiency techniques like quantization on foundation models.

Presenter Email: summerdeng@meta.com

Abstract: Evaluating the effectiveness of efficiency techniques in foundation models—such as quantization, pruning, and distillation—requires a rigorous, standardized methodology for determining model quality parity. Notably, quantization poses a unique challenge due to its non-obvious impacts on model quality stemming from alterations to numerical representations, which are not captured by established scaling laws. In this work, we address this critical gap by proposing an Elo-based scoring framework that quantifies the relative performance of optimized models through automated competitive matchups. By leveraging publicly available datasets such as LMSYS chat, which encompass diverse language-based real-world user queries, our method generates consistent and interpretable rankings of model variants using LLM-based preference judgments. This approach enables quality assessments across various tasks without relying on task-specific ground truths. Backed by over 2,000 GPU hours on H100 infrastructure, our framework offers a scalable, reproducible evaluation protocol that delivers nuanced insights into the trade-offs of model efficiency techniques, while taking a step toward standardizing performance parity assessment across the machine learning community.

Presenter Bio: Summer Deng focuses on AI system co-design for key AI workloads like recommender systems and language models. She explores numerical optimizations from 16 bits down to 4 bits, enhancing training and inference efficiency. Her work integrates these techniques with ML infrastructure, compilers, and hardware like GPUs and ASICs for improved performance and stability.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

YouTube Link Poster: https://youtu.be/2eWyLn5_lk8

Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.

Poster: Yes

Workshop Registration: Yes, the presenter has registered for the workshop.

Submission Number: 6

Loading