# 🚀 ReFine
Official implementation of "Limited Reference, Reliable Generation: Rule-Guided Tabular Data Generation with Dual-Granularity Filtering"


## 🌟 Main Contributions

:one: We identify two key challenges of LLM in tabular data generation in low-data regimes: (i) distributional drift of the synthetic data; and (ii) localized redundancy in the synthetic data. 

:two: To address the two challenges, we propose ReFine, a framework that constructs association rules to guide tabular data generation, and applies proxy-based distribution estimation with dual-granularity curation to correct localized redundancy.

:three: Experimental results demonstrate that ReFine consistently outperforms strong baselines, achieving up to 0.36 absolute gain in R2 for regression and 7.5% relative improvement in F1 for classification. Comprehensive ablations further highlight the respective contributions of Rules-Guided Generation and Dual-Granularity Filtering components.

---

## 📂 Synthetic data generation(Component I: Rules-Guided Generation)

Before generating synthetic data, make sure you have changed `api_key` and `base_url` in all three files.
```
'''####### Your API key and base url ########'''
client = OpenAI(
    api_key="Your api key",
    base_url="Your chosen base_url"
)
```

### :one: Rule Extraction & Generation
Extract & generate rules.

```bash
python extraction_generalization.py --k 3 --input_file ./data/sampled_30.csv --output_file ./outputs/llm_generated_rules.jsonl
```
You can try with your own data by specifying the `--input_file` argument, and the number of best decision trees by `--k`.

### :two: Rule Denoising
Denoise rules generated in Step 1.   
Notice that we manually collect rule sets generated by step 3 in :one: to assure unified format. We provide 5 sampled rule sets in `./outputs/rule_set_samples.jsonl`.

```bash
python denoising.py --input_file ./outputs/rule_set_samples.jsonl --output_file ./outputs/consistency_result.jsonl
```
Final denoised rule is saved in `./outputs/consistency_result.jsonl`
### :three: Tabular Data Generation
Use merged rules in :two: to guide LLM to generate data.

```bash
python llm_generation.py --input_file ./data/sampled_30.csv --output_file ./data/synthetic_data.csv --consistency_rules_file ./outputs/consistency_result.jsonl
```
We provide a typical synthetic data from Disease(N=30) in `./data` for convenience(`./data/synthetic_data`), so that you can run the following command directly.

## 🧪 Evaluation(with Component II: Dual-Granularity Filtering)
```bash
python test_with_filtering_demo.py --df_train ./data/sampled_30.csv --df_syn /data/synthetic_data.csv --df_test ./data/test.csv --filter --output_file ./data/filtered.csv
```
* `--df_train`: Training data path;
* `--df_syn`: Synthetic data path;
* `--df_test`: Testing data path;
* `--filter`: Whether to use Dual-Granularity Filtering;
* `--output_file`: Data after filtering;





## 📊 Results
![Results](./pics/result.png)


## 📜 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Citations

