Abstract: Pre-trained large language models (LLMs) can now be easily
adapted for specific business purposes using custom prompts or fine tuning.
These customizations are often iteratively re-engineered to improve
some aspect of performance, but after each change businesses want to
ensure that there has been no negative impact on the system’s behavior
around such critical issues as bias. Prior methods of benchmarking bias
use techniques such as word masking and multiple choice questions to
assess bias at scale, but these do not capture all of the nuanced types of
bias that can occur in free response answers, the types of answers typically
generated by LLM systems. In this paper, we identify several kinds
of nuanced bias in free text that cannot similarly identified by multiple
choice tests.We describe these as: confidence bias, implied bias, inclusion
bias and erasure bias.We present a semi-automated pipeline for detecting
these types of bias by first eliminating answers that can be automatically
classified as unbiased and then co-evaluating name reversed pairs using
crowd workers. We believe that the nuanced classifications our method
generates can be used to give better feedback to LLMs, especially as
LLM reasoning capabilities become more advanced.
Loading