## File Descriptions

### 1. `inference_parsing.py`
**Purpose**: Parses LLM outputs and generates JSON and CSV results
- Processes LLM-generated outputs and converts them into structured formats
- Handles different model outputs (GPT, custom models) with appropriate parsing logic
- Generates CSV files for analysis (requires CSV viewer extension for proper display)
- Supports both contract and function test case types
- Includes data cleaning and validation functions

### 2. `model_setting.py`
**Purpose**: Configuration file for model loading and QLoRA setup
- Defines BitsAndBytes configuration for 4-bit quantization
- Sets up LoRA (Low-Rank Adaptation) configuration for efficient fine-tuning
- Configures target modules for attention layers (q_proj, v_proj)
- Provides centralized model configuration settings

### 3. `Instruction.py`
**Purpose**: Collection of instruction templates for model interactions
- Contains natural language instruction templates for different tasks
- Includes specialized prompts for contract analysis and function testing
- Provides structured output formats for various evaluation scenarios
- Supports both HumanEval and MBPP dataset formats
- Contains mask strings for different instruction types

### 4. `evaluation_code_generation.py`
**Purpose**: Evaluates code generation quality using pass@k metrics
- Implements pass@k evaluation for generated code
- Executes Python code safely with timeout protection
- Handles various error types and edge cases
- Provides comprehensive evaluation metrics for code quality
- Supports multiple evaluation scenarios and datasets

### 5. `after_quality.py`
**Purpose**: Quality assessment and filtering of test cases
- Compares ground truth code outputs with model-generated test case outputs
- Filters test cases based on two criteria:
  1. Ground truth code produces assertions
  2. Ground truth code produces assertions AND model output is error-free
- Currently supports GPT test cases only
- Generates filtered datasets for further processing

### 6. `evaluation_test_case_pass_k.py`
**Purpose**: Evaluates test case quality using pass@k metrics
- Executes generated test cases against both ground truth and model code
- Calculates pass@k results for test case effectiveness
- Stores results in pass_output directory
- Handles various test case formats and execution scenarios
- Provides comprehensive evaluation metrics

### 7. `evaluation_test_case_our_metric.py`
**Purpose**: Custom evaluation metrics for test case quality
- Implements custom evaluation metrics beyond standard pass@k
- Provides detailed analysis of test case coverage and effectiveness
- Supports both functionality and contract-based evaluation
- Generates comprehensive reports with weighted metrics
- Handles edge cases and error scenarios

### 8. `train_valid_split.py`
**Purpose**: Splits test case datasets into training and validation sets
- Divides test cases into 9:1 ratio (train:validation)
- Currently supports GPT test cases only
- Generates split summary CSV with statistics
- Creates separate files for train, validation, and test sets
- Maintains data integrity during splitting process

### 9. `load_data.py`
**Purpose**: Data loading utilities for different datasets
- Supports multiple dataset types (HumanEval, MBPP variants)
- Defines section categories for test case classification
- Provides standardized data loading interfaces
- Handles dataset-specific configurations and formats

### 10. `reward_contract.py`
**Purpose**: Contract-based reward calculation for test cases
- Implements contract validation and scoring mechanisms
- Analyzes assertion coverage and effectiveness
- Provides metrics for contract-based evaluation
- Supports complex contract validation scenarios

### 11. `reward_function.py`
**Purpose**: Functionality-based reward calculation
- Implements coverage-based reward mechanisms
- Evaluates test case effectiveness using line and branch coverage
- Provides timeout protection for test execution
- Generates detailed reward metrics for optimization

### 12. `evaluation_code_generation.py`
**Purpose**: Comprehensive code generation evaluation
- Evaluates generated code quality and correctness
- Implements multiple evaluation metrics and criteria
- Provides detailed analysis of code generation performance
- Supports various programming scenarios and datasets

## Additional Notes
- Most evaluation scripts include timeout protection and error handling
- The codebase supports both contract-based and functionality-based testing approaches
- Multiple evaluation metrics are available for comprehensive assessment
- The system is designed to handle various model outputs and dataset formats
- Error handling and logging are implemented throughout the evaluation pipeline 