# **Data Valuation by Leveraging Global and Local Statistical Information**

![Python](https://img.shields.io/badge/python-3.8-green.svg?style=plastic)
![PyTorch](https://img.shields.io/badge/pytorch-1.12-green.svg?style=plastic)


## 📌 Overview

**GLOC** (Global and Local characteristics-based data valuation) introduces a principled framework for assessing the importance of individual data points by leveraging both global and local distributional characteristics of data value estimations. While Shapley value-based methods are widely recognized for their theoretical rigor, they are often hindered by high computational costs and an inability to adapt effectively to dynamic data environments.

This repository provides the implementation of the **GLOC** framework along with its extensions—**IncGLOC** (Incremental GLOC) and **DecGLOC** (Decremental GLOC)—for efficient and adaptive data valuation.

---

## 📝 Abstract

Data valuation has garnered increasing attention in recent years, given the critical role of high-quality data in various applications. Among diverse data valuation approaches, Shapley value-based methods are predominant due to their strong theoretical grounding. However, the exact computation of Shapley values is often computationally prohibitive, prompting the development of numerous approximation techniques. Despite notable advancements, existing methods generally neglect the incorporation of value distribution information and fail to account for dynamic data conditions, thereby compromising their performance and application potential. In this paper, we highlight the crucial role of both global and local statistical properties of value distributions in the context of data valuation for machine learning. 
- First, we conduct a **comprehensive analysis** of these distributions across various simulated and real-world datasets, uncovering valuable insights and key patterns. 
- Second, we propose an **enhanced data valuation method** that integrates the explored distribution characteristics into two regularization terms to refine Shapley value estimation. The proposed regularizers can be seamlessly incorporated into various data valuation methods. 
- Third, we introduce a novel approach for **dynamic data valuation** that infers updated data values without recomputing Shapley values, thereby significantly improving computational efficiency. 
Extensive experiments have been conducted across a range of tasks, including Shapley value estimation, value-based data addition and removal, mislabeled data detection, and dynamic data valuation. The results showcase the consistent effectiveness and efficiency of our proposed methodologies, affirming the significant potential of global and local value distributions in data valuation.



## 🧪 Usage

The core implementation of GLOC is located in: "./opendataval/dataval/gloc/"

### ▶ CIFAR-10 Example

#### Run GLOC:

```
python CIFAR10-GLOC.py
```

#### Run IncGLOC (Incremental Valuation):

```
python IncGLOC-CIFAR.py --l1 0.01 --l2 10 --eps 1
```

Or use the script:

bash
```
sh train-inc.sh
```

#### Run DecGLOC (Decremental Valuation):
```
python DecGLOC-CIFAR.py --l1 0.01 --l2 10 --eps 1
```
Or use the script:

```
sh train-dec.sh
```

⚙ Hyperparameter Descriptions

--**l1**: Coefficient for the local distribution regularization term.

--**l2**: Coefficient for the global distribution regularization term.

--**eps**: Upper bound on permissible variation in data values during dynamic updates.


📊 Datasets
Our experiments are conducted on twelve benchmark classification datasets. Their details are summarized below:

<p align="center">
  <img src="figure/dataset.jpg" width="100%" height="100%">
</p>

📈 Experimental Results
Shapley Value Estimation Performance

<p align="center">
  <img src="figure/results1.jpg" width="100%" height="100%">
</p>

Value-Based Data Addition and Removal

Removal：
<p align="center">
  <img src="figure/results2.jpg" width="100%" height="100%">
</p>

Addition:
<p align="center">
  <img src="figure/results3.jpg" width="100%" height="100%">
</p>

Mislabeled Data Detection

<p align="center">
  <img src="figure/results4.jpg" width="100%" height="100%">
</p>

Dynamic Data Valuation

One data point:
<p align="center">
  <img src="figure/results5.jpg" width="100%" height="100%">
</p>

Multiple data points:
<p align="center">
  <img src="figure/results6.jpg" width="100%" height="100%">
</p>


🪙 Acknowledgements

Some codes in our project are adapted from [opendataval](https://github.com/opendataval/opendataval/). We express our gratitude for their outstanding projects.

