# Code Selection Toolkit

Tools for data selection and transformation.

## Structure

```
supp/
├── preprocess/
│   ├── utils.py           # Common utilities
│   ├── execution.py       # Code execution sandbox
│   ├── batch_execute.py   # Parallel batch execution
│   ├── extract_solution.py # Extract code from markdown
│   ├── filter_python.py   # Filter valid Python
│   └── embedding.py       # Code embeddings
│
├── src/
│   ├── utils.py           # Text utilities
│   ├── distance.py        # Distance metrics
│   ├── similarity.py      # Similarity metrics
│   ├── selection.py       # Selection algorithms
│   ├── syntax.py          # AST analysis
│   ├── scoring.py         # Difficulty scoring
│   ├── transform.py       # Dataset transformation
│   └── template.jinja     # Code template
│
└── README.md
```

## Algorithms

### Selection (`src/selection.py`)

| Function | Description |
|----------|-------------|
| `k_center` | Greedy k-center |
| `facility_location` | Greedy submodular maximization |

### Distance (`src/distance.py`)

| Function | Description |
|----------|-------------|
| `cosine` | Cosine distance |
| `euclidean` | L2 distance |

### Similarity (`src/similarity.py`)

| Function | Description |
|----------|-------------|
| `cosine_sim` | Cosine similarity |

### Syntax (`src/syntax.py`)

| Function | Description |
|----------|-------------|
| `syntax_distance` | AST Jaccard distance |
| `ast_coverage` | Greedy AST coverage |

## Usage

### Transform Dataset

```bash
python src/transform.py \
    --strategy cosine_fl \
    --input data.parquet \
    --output out.jsonl \
    --budget 11 \
    --seed 42
```

**Strategies:** `all`, `random`, `cosine_center`, `cosine_fl`, `syntax_center`, `ast_coverage`, `cluster`, `herding`, `difficulty`

### Compute Scores

```bash
python src/scoring.py \
    --input data.parquet \
    --output scored.parquet \
    --model /path/to/model \
    --template template.jinja \
    --workers 8
```

### Batch Execute

```bash
python preprocess/batch_execute.py \
    --input solutions/ \
    --output evaluated/ \
    --benchmark benchmark.jsonl \
    --cpus 40
```

## Dependencies

```
torch, transformers, ray, pandas, numpy
scikit-learn, tree-sitter, tree-sitter-python
pandarallel, jinja2, tqdm, pyarrow
```
