# Data Quality Evaluation

This repository provides tools for quantitatively evaluating the ratio of **unique** and **non-trivial** patterns in truth tables generated under different configurations (logic size, gate count, noise levels, etc.).  
The goal is to measure how effectively different settings produce diverse and meaningful logical behaviors, instead of repetitive or trivial outputs.

---

## Directory Structure

```aiignore
composition_analysis/
├── results/                         
│   ├── AN_metrics_vs_in.png         # AN model metrics comparison diagram
│   ├── summary_AN.xlsx              # AN model summary table
│   ├── summary_ANO.xlsx             # ANO model summary table
│   ├── truth_proportion_summary_AN.json    # AN model analysis results
│   ├── truth_proportion_summary_ANO.json   # ANO model analysis results
│   └── ...                          
├── calculate_ratio.py               # Ratio calculation script
├── convert_table.py                 # Table conversion script  
├── cross_compare.py                 # Cross-comparison script
├── readme.md                        # Project documentation
├── vis_combine.py                   # Visualization combination script
└── vis_table.py                     # Table visualization script
```

## Analysis Metrics

### Pattern Ratios

- **Nontrivial ratio** = Non-trivial vectors ÷ Total vectors
- **Unique ratio** = Unique non-trivial patterns ÷ Total vectors

### Key Statistics

- `unique_ratio`: Unique patterns / total vectors
- `nontrivial_ratio`: Non-trivial vectors / total vectors
- `max_repeat_count`: Maximum repetitions of a single vector
- `avg_repeat`: Average repetitions per vector
- `total_vectors`: Total number of truth vectors
- `total_files`: Number of truth table files processed
- `unique_patterns`: Count of distinct patterns

------

## Usage

```
python calculate_ratio.py \
  --input_dir generated_aigs \
  --output_json truth_proportion_summary.json
```

**Arguments:**

- `--input_dir`: Root directory containing size-level truth table subdirectories
- `--output_json`: Path to save the computed proportion statistics

------

## How the Script Works

1. **Read truth vectors** from `.truth` files.
    Only the *output half* of each truth table is considered.
2. **Identify trivial patterns** (all-zeros and all-ones).
3. **Compute aggregaetd ratios** for both:
   - **Numerical average (cross-subdirectory)**:
      Average of ratios across and-gate subdirectories, treating each subdir equally regardless of its size.
   - **Cross-directory uniqueness (global aggregation)**:
      Merge all patterns from all subdirectories under the same input size, then compute ratios on the combined set.
      This captures *true global diversity*, since duplicates across subdirs are counted only once.
4. **Aggregate statistics** at multiple levels:
   - Per-subdirectory with noise-level breakdowns
   - Size-level with repeat pattern analysis
   - Global aggregation across all subdirectories

------

## Example: Numerical Average vs Global Uniqueness

We compare two ways of measuring non-trivial ratios across subdirectories.

### Example Data
- **and10**: {101:2, 111:1, 100:3}  
- **and20**: {101:3, 110:1, 000:2}  

### Step 1: Subdir-level Ratios
- and10: non-trivial = 5/6, unique = 2/6  
- and20: non-trivial = 4/6, unique = 2/6  

### Step 2: Numerical Average
- non-trivial = (5/6 + 4/6) / 2 = **3/4**  
- unique = (2/6 + 2/6) / 2 = **1/3**  

### Step 3: Global Uniqueness
Merge all counts: 
Counter({'101': 5, '100': 3, '111': 1, '110': 1, '000': 2})

- Total = 12  
- non-trivial = (5 + 3 + 1) / 12 = **3/4**  
- unique non-trivial = {101, 100, 110} → 3/12 = **1/4**  

### Key Difference
- **Numerical average**: treats each subdir equally.  
- **Global uniqueness**: removes duplicates across subdirs, reflecting true overall diversity.  


## Output Features

- Per-subdirectory: Average ratios with noise-level details  
- Cross-directory (global): Uniqueness and nontriviality across all data  
- Size-level stats: Includes max repeat ratio, average repeat, and unique pattern counts  
- Noise configurations: Flexible support (e.g., noise_0, noise_0.01, noise_0.05, etc.)  

