# NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

# Generation
## Environment Setup
Create a virtual environment. 
```console
bash setup.sh
```

### NoFunEdit
```console
python3 src/nofunedit_generation.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>
```
### Classification
```console
python3 src/classification_generation.py --data_subset <subset from non_func or humanevalclassify: eg-latency> --model <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>
```
# Evaluation Scripts

## Evaluation
```console
python3 src/evaluation.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --score_k <K values for score@k: eg-1,5,10,20> --metric <eval_metric to be used: eg-diffbleu>
```

### Example eval script (For maintainability)
```console
bash evaluation_example_script.sh
```

## Parameters

| Parameter                      | Description                                 |
| ----------------------------- | ---------------------------------------- |
| `data_subset`                     | The subset of data to use. Options: `latency`, `resource_util`, `maintainability`, `security`, `runtime_efficiency` for nofunedit. Additionally `humanevalclassify` for classification.|
| `model_path` | The path of the model from HF. Example: `WizardLM/WizardCoder-15B-V1.0`.
| `prompt`      | Prompt to use. Options: `base_prompt`, `one-shot`, `chain_of_thought`, `coding_concepts`. |
| `num_samples` | Number of samples to generate. Example: `1` (We used  `1` for greedy and `20` for higher temperature). **[NoFunEdit - Generation only]**|
| `max_new_tokens` | Budget for new token generation for a model. Example: `1200` (NoFunEdit: We used `1200` for runtime_efficiency and security for all prompts than CoT where `1500` was used. For others, we used `5192` or max possible limit. Classification: We used `4` for all generations).|
| `temperature` | Temperature for model generation. Example: `0` (We used `0` for greedy and `0.8` for higher samples) |
| `score_k` |K vales for Score@K. Example: `1,5,10,20` (Should not be greater than num_samples and is comma separated)  **[Eval only]** |
| `metric` | Metric to be used for evaluation. Option:  `diffbleu`, `codeql`, `codeql-diffbleu` (to be run after first two params are run), `classification`, `runtime` **[Eval only]**|

#### VLLM Parameters (for generation)
| Parameter                      | Description                                 |
| ----------------------------- | ---------------------------------------- |
| `batch-size` | Batch size. Default: `1`|
| `precision` | Floating point format: Default: `fp16` |
|  `tensor_parallel_size` | Default: `1` |
| `swap_space` | The size (GiB) of CPU memory per GPU to use as swap space: Default: `4` |