# Project Script Summaries

## 1. sentencbert_score.py
Evaluates similarity between model outputs and human references using Sentence-BERT embeddings. Computes reasoning/answer similarity scores and their harmonic mean. Processes multiple model prediction files.

## 2. QwenLLM_Judge.py
Uses Qwen2.5-14B-Instruct to assess VLM output quality. Compares model reasoning/answers to references, scores on 1-5 scale, and calculates F1 metrics. Includes robust score extraction and batch processing.

## 3. Qwen2VLEvaluation.py
Implements inference with Qwen2-VL-2B-Instruct for visual question answering. Generates explanations from images and questions.

## 4. Qwen2_5LLMSummary.py
Uses Qwen2.5-14B-Instruct to generate concise answers from reasoning contexts. Processes datasets with pre-generated reasoning to extract clear answers.

## 5. Lave_NEW.py
Similar to QwenLLM_Judge.py but updated.

## 6. Inference.py
Performs inference with Qwen2.5-VL-7B-Instruct (optional LoRA fine-tuning). Processes images/questions to generate structured reasoning/answer responses with error handling.

## 7. DS_review.py
Uses DeepSeek to review/correct answers in VQA datasets. Checks answer consistency with reasoning and suggests corrections.

## 8. DeepSeek_Score.py
Evaluates semantic similarity between model outputs and human references using DeepSeek. Calculates reasoning/answer scores and their harmonic mean for multiple models.

## 9. COT_Transform.py
Transforms descriptive reasoning into structured Chain-of-Thought (CoT) format using DeepSeek-Reasoner. Maintains original information in logical steps.

## 10. berscore.py
Evaluates model outputs using BERTScore for semantic similarity. Calculates reasoning/answer scores and their harmonic mean.

## 11. BenchGen.py
Generates VQA dataset using Qwen2.5-VL-7B-Instruct with LoRA fine-tuning. Processes images recursively, answers polar region questions, and saves results with intermediate checkpoints.