Keywords: LLM, rating, instruct-tuning, gender, methodology
TL;DR: For the same task, an LLM may be prompted with explicit instructions or more tacitly, with different outcomes.
Abstract: This study compares different ways of using LLMs, roughly as research assistants vs. as naive test subjects. Different models and ways of prompting are compared for predicting existing word ratings on gender association; prompts like "On a scale from 1-7, how masculine is the word 'plumber'?" vs. "The main character of this story ... plumber ... . Initially, [he/she]" (where the (relative) predicted probabilities of the gendered pronouns are taken as the measured variable). We find that explicit prompting gives the strongest correlation with human ratings, with Llama 70B base and instruction-tuned models performing alike and slightly better than GPT-4o, with Llama 8B base considerably worse, and Llama 8B instruct in between. Prompting for comparative as opposed to absolute (scalar) judgments can help smaller models.
Submission Number: 58
Loading