ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models
RTML Workshop 2023
Kigali, Rwanda May 05 2023 https://rtml-iclr2023.github.io/ yu.cheng@microsoft.com
Please see the venue website for more information.
Submission Start: Feb 24 2023 12:00AM UTC-0, Abstract Registration: Apr 01 2023 11:00AM UTC-0, End: Apr 01 2023 11:00AM UTC-0
-
Can stochastic weight averaging improve generalization in private learning?
Show details- Keywords: differential privacy, stochastic weight averaging, generalization
- TL;DR: Stochastic weight averaging improves generalization, accuracy, and stability in private learning
- Abstract: We investigate stochastic weight averaging (SWA) for private learning in the context of generalization and model performance. Differentially private (DP) optimizers are known to suffer from reduced performance and high variance in comparison to non-private learning. However, the generalization properties of DP optimizers have not been studied much, in particular for large-scale machine learning models. SWA is variant of stochastic gradient descent (SGD) which averages the weights along the SGD trajectory. We consider a DP adaptation of SWA (DP-SWA) which incurs no additional privacy cost and has little computational overhead. For quadratic objective functions, we show that DP-SWA converges to the optimum at the same rate as non-private SGD, which implies convergence to zero for the excess risk. For non-convex objective functions, we observe throughout multiple experiments on standard benchmark datasets that averaging model weights improves generalization, model accuracy, and performance variance.
-
Examining LLM's Awareness of the United Nations Sustainable Development Goals (SDGs)
Show details- Keywords: LLM, UN SGD, ChatGPT
- TL;DR: Utilized ChatGPT to exam 8 LLMs on true/false statements and compared their performance against human-written statements
- Abstract: Utilization of Large Language Models (LLMs) is rapidly growing in diverse domains and each LLM may show different performance across different topics. Amidst this progress, biases and other ethical concerns surrounding LLMs have raised questions regarding trust and reliability, thereby necessitating human verification and audit. In this study, we empirically investigate six important topics of the United Nations Sustainable Development Goals (UN SDG) by utilizing ChatGPT LLM as a facilitator for generating statements needed for evaluation of eight different LLMs. We also compare the performance of these LLMs on human-written statements and questions. In addition, we study the tendency of producing of true and false statement for the eight LLMs considered. Although LLMs show comparative performance on ChatGPT and human input for relatively common issues, they are not sophisticated enough in understanding nuanced and advanced issues that demand critical and wholistic introspection.
-
On the Robustness of Diffusion Inversion in Image Manipulation
Show details- Abstract: Text-guided image editing is a rapidly growing field due to the development of large diffusion models. In this work, we present an effective approach to address the key step of real image editing, known as ``inversion", which involves finding the initial noise vector that reconstructs the input image when conditioned on a text prompt. Existing works on conditional inversion is often unstable and inaccurate, leading to distorted image manipulation. To address these challenges, our method starts by analyzing the inconsistent assumptions and accumulative errors that contribute to the ill-posedness of mathematical inverse problems. We then introduce learnable latent variables as bias correction to approximate invertible and bijective inversion. We perform latent trajectory optimization with a prior to fully invert the image by optimizing the bias correction on the unconditional text prompt and initial noise vector. Our method is based on the publicly Stable Diffusion model and is extensively evaluated on a variety of images and prompt editing, demonstrating high accuracy, robustness, and quality compared to state-of-the-art baseline approaches.
-
Long-Term Fairness with Unknown Dynamics
Show details- Keywords: Long-term Fairness, Dynamics, Reinforcement Learning
- TL;DR: Desirable social outcomes that are in conflict with myopic optimization may be realized using a reinforcement learning formalism of long-term fairness.
- Abstract: As populations adapt to algorithmic prediction, machine learning can myopically reinforce social inequalities or dynamically seek equitable outcomes. In this paper, we formalize prediction subject to long-term fairness as a constrained online reinforcement learning problem. This formulation can accommodate dynamical control objectives, such as inducing equitable population adaptations, that cannot be expressed by static formulations of fairness. By adapting recent work in online learning, we provide the first algorithm that guarantees simultaneous, probabilistic bounds on cumulative loss and cumulative violations of fairness (defined as statistical regularities between demographic groups) in this setting. We compare this algorithm to an off-the-shelf, deep reinforcement learning algorithm that lacks such safety guarantees, and to a repeatedly retrained, myopic classifier, as a baseline. We demonstrate that a reinforcement learning framework for long-term fairness allows algorithms to adapt to unknown dynamics and sacrifice short-term profit or fairness to drive a classifier-population system towards more desirable equilibria. Our experiments model human populations according to evolutionary game theory, using real-world data to set an initial state.
-
IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude’s Variance Matters
Show details- Keywords: Cross Entropy, Mean Absolute Error, Underfitting, Overfitting, Label Noise, Robustness, Example Weighting
- TL;DR: We study robust deep learning against abnormal training data from the perspective of example weighting built in empirical loss functions, i.e., gradient magnitude with respect to logits.
- Abstract: In this work, we study robust deep learning against abnormal training data from the perspective of example weighting built in empirical loss functions, i.e., gradient magnitude with respect to logits, an angle that is not thoroughly studied so far. Consequently, we have two key findings: (1) Mean Absolute Error (MAE) Does Not Treat Examples Equally. We present new observations and insightful analysis about MAE, which is theoretically proved to be noise-robust. First, we reveal its underfitting problem in practice. Second, we analyse that MAE’s noise-robustness is from emphasising on uncertain examples instead of treating training samples equally, as claimed in prior work. (2) The Variance of Gradient Magnitude Matters. We propose an effective and simple solution to enhance MAE’s fitting ability while preserving its noise-robustness. Without changing MAE’s overall weighting scheme, i.e., what examples get higher weights, we simply change its weighting variance non-linearly so that the impact ratio between two examples are adjusted. Our solution is termed Improved MAE (IMAE). We prove IMAE’s effectiveness using extensive experiments: image classification under clean labels, synthetic label noise, and real-world unknown noise.
-
Performative Federated Learning
Show details- Keywords: robustness, federated learning, model-dependent data shifts, performative prediction, strategic classification and regression
- TL;DR: We proposed the performative FL framework and the performative FedAvg (P-FedAvg) algorithm to address the model-dependent data distribution shift problems in FL. P-FedAvg converge at rate O(1/T) under the same assumptions in previous works.
- Abstract: We consider a federated learning (FL) system comprising multiple clients and a server, wherein the clients collaborate to learn a common decision model from their distributed data. Unlike the conventional FL framework, we consider the scenario where the clients' data distributions change with the deployed decision model. In this work, we propose a performative federated learning framework that formalizes model-dependent distribution shifts by leveraging the concept of distribution shift mappings in the performative prediction literature. We introduce necessary and sufficient conditions for the existence of a unique performative stable solution and characterize its distance to the performative optimal solution. Under such conditions, we propose the performative FedAvg algorithm and show that it converges to the performative stable solution at a rate of O(1/T) under both full and partial participation schemes. In addition, we show how the clients' heterogeneity influences the convergence both theoretically and using numerical results.
-
EXPLAINTABLE: EXPLAINING LARGE SCALE MODELS APPLIED TO TABULAR DATA
Show details- Keywords: Interpretability, Activation Maximization, Tabular Deep Learning, Large Scale Models
- TL;DR: Adapting activation maximisation methods, a new feature selection method is proposed. Results in one of the largest-scale tabular NN are presented and a suggestion on how to apply it to LLM is proposed
- Abstract: Interpretability of Deep Neural Networks (DNNs) is crucial when designing reliable and trustworthy models. However, there is a lack of interpretability methods for DNNs applied to tabular data. In this short paper, we propose a novel feature importance method for any Tabular Deep Learning model based on activation maximization. This allows to discard uninformative features for the network. We present some preliminary results on one of the largest scale Tabular Networks. In addition, we suggest how it can be applied to Large Language Models (LLM) to systematically study their biases too.
-
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research
Show details- Keywords: toxicity, fairness, nlp
- Abstract: Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and call for a more structured approach to evaluating toxicity over time. Code and data are available at https://github.com/for-ai/black-box-api-challenges.
- Community Implementations: [ 2 code implementations](https://www.catalyzex.com/paper/on-the-challenges-of-using-black-box-apis-for/code)
-
Intriguing Properties of Visual-Language Model Explanations
Show details- Keywords: Explainability, TrustworthyML, Zero-shot, Fine-tune, Visual-Language Model
- TL;DR: An empirical study to establish the trustworthy properties of explanations generated for visual-language models used in zero-shot vs. fine-tune settings.
- Abstract: The growing popularity of large-scale visual-language models (VLMs) has led to their employment in various downstream applications as they provide a rich source of image and text representations. However, these representations are highly entangled and complex to interpret by machine learning developers and practitioners. Recent works have shown visualizations of image regions that VLMs focus on but fail to describe the change in explanations generated for visual-language classifiers in zero-shot (using image and text representations) vs. fine-tuned settings (using image representations). In this work, we perform the first empirical study to establish the trustworthy properties of explanations generated for VLMs used in zero-shot vs. fine-tune settings. We show that explanations for zero-shot visual-language classifiers are more faithful than their fine-tuned counterpart. Further, we demonstrate that VLMs tend to attribute high importance to gender, despite being non-indicative of the downstream task. Our experiments on multiple real-world datasets show interesting VLM behavior in zero-shot vs. fine-tuned settings, opening up new frontiers in understanding the trustworthiness of large-scale visual-language models.
-
Trustworthy model evaluation on a budget
Show details- Keywords: Explainability, Trustworthy, Model Evaluation, Large-Scale
- TL;DR: Current Machine Learning practices for performing ablation experiments can lead to unreliable conclusions where the selection of hyperparameter and the computational budget have strong interaction effects.
- Abstract: Standard practice in Machine Learning (ML) research uses ablation studies to evaluate a novel method. We find that errors in the ablation setup can lead to incorrect explanations for which method components contribute to the performance. Previous work has shown that the majority of experiments published in top conferences are performed with few experimental trials (less than 50) and manual sampling of hyperparameters. Using the insights from our meta-analysis, we demonstrate how current practices can lead to unreliable conclusions. We simulate an ablation study experiment on an existing Neural Architecture Search (NAS) benchmark and perform an ablation study with 120 trials using ResNet50. We quantify the selection bias of Hyperparameter Optimization (HPO) strategies to show that only random sampling can produce reliable results when determining the top and mean performance of a method under a limited computational budget.
-
N2G: A SCALABLE APPROACH FOR QUANTIFYING INTERPRETABLE NEURON REPRESENTATION IN LLMS
Show details- Keywords: Interpretability
- TL;DR: We introduce N2G, a method for converting neurons in LLMs into interpretable graphs which can be visualized and automatically evaluated.
- Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose Neuron to Graph (N2G), a tool which takes a neuron and its dataset examples, and automatically distills the neuron’s behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to LLMs. We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron’s ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality.
- Community Implementations: [ 2 code implementations](https://www.catalyzex.com/paper/n2g-a-scalable-approach-for-quantifying/code)
-
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
Show details- Abstract: Multimodal contrastive pretraining has been used to train multimodal representation models, such as CLIP, on large amounts of paired image-text data. However, previous studies have revealed that such models are vulnerable to backdoor attacks. Specifically, when trained on backdoored examples, CLIP learns spurious correlations between the embedded backdoor trigger and the target label, aligning their representations in the joint embedding space. Injecting even a small number of poisoned examples, such as 75 examples in 3 million pretraining data, can significantly manipulate the model's behavior, making it difficult to detect or unlearn such correlations. To address this issue, we propose CleanCLIP, a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks by independently re-aligning the representations for individual modalities. We demonstrate that unsupervised finetuning using a combination of multimodal contrastive and unimodal self-supervised objectives for individual modalities can significantly reduce the impact of the backdoor attack. We show empirically that CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
- Community Implementations: [ 2 code implementations](https://www.catalyzex.com/paper/cleanclip-mitigating-data-poisoning-attacks/code)
-
Leaving Reality to Imagination: Robust Classification via Generated Datasets
Show details- Keywords: Robust, Classification, Generated Data, Stable Diffusion, Generative Modeling, ImageNet-G
- TL;DR: We find that Imagenet classifiers trained on real data augmented with generated data achieve high accuracy on natural distribution shifts. We further propose ImageNet-G, an evolving dataset to aid research in robust and trustworthy machine learning.
- Abstract: Recent research on robustness has revealed significant performance gaps between neural image classifiers trained on datasets that are similar to the test set, and those that are from a naturally shifted distribution, such as sketches, paintings, and animations of the object categories observed during training. Prior work focuses on reducing this gap by designing engineered augmentations of training data or through unsupervised pretraining of a single large model on massive in-the-wild training datasets scraped from the Internet. However, the notion of a dataset is also undergoing a paradigm shift in recent years. With drastic improvements in the quality, ease-of-use, and access to modern generative models, generated data is pervading the web. In this light, we study the question: How do these generated datasets influence the natural robustness of image classifiers? We find that Imagenet classifiers trained on real data augmented with generated data achieve higher accuracy and effective robustness than standard training and popular augmentation strategies in the presence of natural distribution shifts. Further, we introduce and analyze an evolving generated dataset, ImageNet-G-v1, to better benchmark the design, utility, and critique of standalone generated datasets for robust and trustworthy machine learning.
- Community Implementations: [ 1 code implementation](https://www.catalyzex.com/paper/leaving-reality-to-imagination-robust/code)
-
In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation
Show details- Keywords: OOD detection, Out-of-distribution detection, ImageNet
- TL;DR: We show that the currently used datasets for Out-of-distribution detection evaluation on ImageNet-1K are severly flawed and provide NINCO, a clean and challenging new OOD test dataset.
-
Abstract:
Out-of-distribution (OOD) detection is the problem of identifying inputs which are unrelated to the in-distribution task. The OOD detection performance when the in-distribution (ID) is ImageNet-1Ki s commonly being tested on a small range of test OOD datasets. We find that most of the currently used test OOD datasets have severe issues, in some cases more than 50
of the dataset contains objects belonging to one of the ID classes. These erroneous samples heavily distort the evaluation of OOD detectors. As a solution, we introduce with NINCO a novel test OOD dataset, each sample checked to be ID free, which with its fine-grained range of OOD classes allows for a detailed analysis of an OOD detector's strengths and failure modes, particularly when paired with a number of synthetic “OOD unit-tests”. We provide detailed evaluations across a large set of architectures and OOD detection methods on NINCO and the unit-tests, revealing new insights about model weaknesses and the effects of pretraining on OOD detection performance. - Community Implementations: [ 5 code implementations](https://www.catalyzex.com/paper/in-or-out-fixing-imagenet-out-of-distribution/code)
-
DP-Adam: Correcting DP Bias in Adam's Second Moment Estimation
Show details- Abstract: We observe that the traditional use of DP with the Adam optimizer introduces a bias in the second moment estimation, due to the addition of independent noise in the gradient computation. This bias leads to a different scaling for low variance parameter updates, that is inconsistent with the behavior of non-private Adam, and Adam's sign descent interpretation. Empirically, correcting the bias introduced by DP noise significantly improves the optimization performance of DP-Adam.
-
GPT detectors are biased against non-native English writers
Show details- Keywords: GPT detectors, bias, non-native English, prompt
- TL;DR: GPT detectors exhibit biases against non-native English writers, with simple prompting strategies mitigating this bias and effectively bypassing detection.
- Abstract: The rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of AI-generated content. Although numerous detection methods have been proposed to differentiate between AI and human-generated content, the fairness and robustness of these detectors remain underexplored. In this study, we evaluate the performance of several widely-used GPT detectors using writing samples from native and non-native English writers. Our findings reveal that these detectors consistently misclassify non-native English writing samples as AI-generated, whereas native writing samples are accurately identified. Furthermore, we demonstrate that simple prompting strategies can not only mitigate this bias but also effectively bypass GPT detectors, suggesting that GPT detectors may unintentionally penalize writers with constrained linguistic expressions. Our results call for a broader conversation about the ethical implications of deploying ChatGPT content detectors and caution against their use in evaluative or educational settings, particularly when they may inadvertently penalize or exclude non-native English speakers from the global discourse.
- Community Implementations: [ 1 code implementation](https://www.catalyzex.com/paper/gpt-detectors-are-biased-against-non-native/code)
-
MathPrompter: Mathematical Reasoning using Large Language Models
Show details- Keywords: Mathematical Reasoning, Large language models, Foundation models, Zero-shot learning, chain-of-thought prompting
- TL;DR: A Zero-shot chain-of-thought prompting method for mathematical reasoning using large language models
- Abstract: Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks and often provide incorrect answers. Unlike natural language understanding, math problems typically have a single correct answer, making the task of generating accurate solutions more challenging for LLMs. To the best of our knowledge, we are not aware of any LLMs that indicate their level of confidence in their responses which fuels a trust deficit in these models impeding their adoption. To address this deficiency, we propose ‘MathPrompter’, a technique that improves performance of LLMs on arithmetic problems along with increased reliance in the predictions. MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple algebraic expressions or python functions to solve the same math problem in different ways and thereby raise the confidence level in the output results. This is in contrast to other prompt based CoT methods, where there is no check on the validity of the intermediate steps followed. Our technique improves over state-of-the-art on the MultiArith dataset (78.7% → 92.5%) evaluated using 175B parameter GPT-based LLM.
-
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
Show details- Abstract: ChatGPT is a recent chatbot service released by OpenAI and is receiving increasing attention over the past few months. While evaluations of various aspects of ChatGPT have been done, its robustness, i.e., the performance to unexpected inputs, is still unclear to the public. Robustness is of particular concern in responsible AI, especially for safety-critical applications. In this paper, we conduct a thorough evaluation of the robustness of ChatGPT from the adversarial and out-of-distribution (OOD) perspective. To do so, we employ the AdvGLUE and ANLI benchmarks to assess adversarial robustness and the Flipkart review and DDXPlus medical diagnosis datasets for OOD evaluation. We select several popular foundation models as baselines. Results show that ChatGPT shows consistent advantages on most adversarial and OOD classification and translation tasks. However, the absolute performance is far from perfection, which suggests that adversarial and OOD robustness remains a significant threat to foundation models.
- Community Implementations: [ 2 code implementations](https://www.catalyzex.com/paper/on-the-robustness-of-chatgpt-an-adversarial/code)
-
FEDCLIP: FAST GENERALIZATION AND PERSONALIZATION FOR CLIP IN FEDERATED LEARNING
Show details- Abstract: When federated learning (FL) meets trustworthy and reliable large-scale models, two critical challenges come: data distribution heterogeneity and high resource costs. Specifically, the non-IID data in different clients make existing FL algorithms hard to converge while the high resource costs, including computational and communication costs, increase the deployment difficulty in real-world scenarios. In this paper, we propose an effective yet simple method, named FedCLIP, to achieve fast generalization and personalization for CLIP in federated learning. Concretely, we design an attention-based adapter for the large model, CLIP, and the rest operations merely depend on adapters. Lightweight adapters can make the most use of pretrained model information and ensure models be adaptive for clients in specific tasks. Simultaneously, small-scale operations can mitigate the computational burden and communication burden caused by large models. Extensive experiments are conducted on three datasets with distribution shifts. Qualitative and quantitative results demonstrate that FedCLIP significantly outperforms other baselines (9% overall improvements on PACS) and effectively reduces computational and communication costs (283x faster than FedAVG).
- Community Implementations: [ 2 code implementations](https://www.catalyzex.com/paper/fedclip-fast-generalization-and/code)