
# RelEval Submission

## Overview

This repository contains experimental data and results for evaluating large language models on logical reasoning tasks.

## Tasks

- **Task 1**: Plan generation
- **Task 2**: Consistency check  
- **Task 3**: Comparison question

## Directory Structure

- `prompt_templates/` - Jinja templates for prompts used in experiments
- `input_data/` - Input graphs and conversion scripts for symbolic-to-natural language transformation
- `logs/symbolic_task*/` - Results for symbolic relation experiments (using `>`, `<` operators)
- `logs/nl_task*/` - Results for natural language framed experiments

## Evaluation Framework

Model evaluation is built on an open-source framework (not yet widely adopted) with custom modifications to support multi-turn conversations. The evaluation code is not included in this submission to maintain anonymity for double-blind review and will be published upon paper acceptance.

**Note**: Some run logs have been deleted due to material uploading size limits. The remaining logs provide representative samples of the experimental results. 
