Rethinking Machine Learning Benchmarks in the Context of Professional Codes of Conduct

Published: 01 Jan 2024, Last Modified: 18 Jun 2024CSLAW 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Benchmarking efforts for machine learning have often mimicked (or even explicitly used) professional licensing exams to assess capabilities in a given area, focusing primarily on accuracy as the metric of choice. However, this approach neglects a variety of essential skills required in professional settings. We propose that professional codes of conduct and rules can guide machine learning researchers to address potential gaps in benchmark construction. These guidelines frequently account for situations professionals may encounter and must handle with care. A model may excel on an exam but still fall short in critical scenarios, deemed unacceptable under professional codes or rules. To motivate this idea, we conduct a case study and comparative examination of machine translation in legal settings. We point out several areas where standard deployments and benchmarks do not assess key requirements under professional rules. We suggest further refinements that would bring the two closer together, including requiring a measurement of uncertainty so that models opt out of uncertain translations. We then share broader insights on constructing and deploying foundation models, particularly in critical domains like law and legal translation.
Loading