Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a local stability measure to preemptively quantify prediction consistency in Tabular LLMs under fine-tuning multiplicity, without retraining multiple models.
Abstract: Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon of *fine-tuning multiplicity* where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction's consistency by analyzing (sampling) the model's local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.
Lay Summary: Tabular large language models (TabLLMs) are increasingly used in high-stakes areas like finance, education, and healthcare, where we expect predictions to be reliable. But surprisingly, even small changes during training, like using a different random seed, weight initialization, or minor changes to training data, can lead to different results for the same input. This is a big problem when important decisions are at stake. Our research shows that this unpredictability - what we call fine-tuning multiplicity - is common in TabLLMs. We introduce a novel method to measure how consistent a model’s prediction is under fine-tuning multiplicity without needing to retrain our model multiple times. We leverage the the model behavior in the “neighborhood” around an input, using our local stability measure, to estimate how likely a prediction is to stay consistent. Our approach can help identify which predictions are trustworthy and which might change if the model were retrained. It’s a step toward more reliable AI in high-stakes settings.
Primary Area: Social Aspects->Robustness
Keywords: Tabular LLM, Model Multiplicity, Few-shot Learning, Prediction Consistency
Submission Number: 12466
Loading