# Automated Student Answer Grading System

This project provides an **LLM + RAG-based automated grading pipeline**.  
It compares student answers against faculty-provided solutions and relevant textbook material, then uses Google Gemini for reasoning and final scoring.

---

## 🚀 Setup Instructions

### 1. Environment Setup

Create a Python virtual environment:

```bash
# Create virtual environment
python -m venv venv

# Activate environment
# On Windows:
venv\Scripts\activate
# On Linux/Mac:
source venv/bin/activate
```

---

### 2. Install Dependencies

Install all required packages:

```bash
pip install -r requirements.txt
```

Required packages include:

- LangChain and Google Gemini integration  
- FAISS for vector storage  
- HuggingFace Transformers for embeddings  
- PDF processing and Excel handling libraries  

---

### 3. Configuration Setup

#### Create Environment File

Create a `.env` file in the project root:

```bash
GOOGLE_GEMINI_API_KEY=your_api_key_here
```

#### Configure Paths

Update `config.py` with your specific paths and settings:

```python
# Data paths - ensure these match your file structure
DATA_DIR = "data"
TEXTBOOK_PDF = f"{DATA_DIR}/history_book.pdf"
QUESTIONS_EXCEL = f"{DATA_DIR}/questions.xlsx"
ANSWER_KEY_EXCEL = f"{DATA_DIR}/faculty_answers.xlsx"
STUDENT_ANSWERS_EXCEL = f"{DATA_DIR}/Student_Answers_Real.xlsx"

# Model settings
LLM_MODEL_NAME = "gemini-2.5-pro"
EMBEDDING_MODEL_NAME = "gemini-embedding-001"

# Thresholds (adjust as needed)
THRESHOLD_RAG1 = 0.2  # 20% threshold for RAG1 scoring
THRESHOLD_RAG2 = 0.3  # 30% threshold for fallback RAG2
TOP_K_RAG2 = 5        # number of top documents to fetch from RAG2
CACHE_HOT_PROMOTE = 4 # occurrences to promote to hot cache
```

---

### 4. Data Preparation

Create the following directory structure and add your data files:

```
project_root/
├── data/
│   ├── history_book.pdf          # Textbook for RAG2
│   ├── questions.xlsx            # Questions list
│   ├── faculty_answers.xlsx      # Faculty answer key (RAG1)
│   └── Student_Answers_Real.xlsx # Student answers to grade
├── results/                      # Auto-created for outputs
└── faiss_index/                  # Auto-created for vector storage
```

#### Required Excel File Formats

**faculty_answers.xlsx (Answer Key):**

| QNumber | Question | Answer | Marks |

**Student_Answers_Real.xlsx (Student Submissions):**

| StudentID | Question | Answer |

**questions.xlsx (Questions List):**

| Question |

---

## ⚙️ Execution Pipeline

### Step 1: Ingest Textbook (RAG2 Setup)

Process the textbook PDF and create FAISS vector index:

```bash
python ingest.py
```

This step:

- Splits the textbook PDF into chunks  
- Generates embeddings using HuggingFace sentence-transformers  
- Creates and saves FAISS vectorstore index  
- Preloads cold cache with relevant facts for each question  

---

### Step 2: Grade Student Answers

Run the main grading pipeline:

```bash
python grader.py
```

The grading process:

1. Loads faculty answer key (RAG1) and student answers  
2. Initializes cache manager and loads RAG2 FAISS index  
3. For each student answer:
   - Compares with RAG1 faculty answer using embedding similarity  
   - Falls back to cold cache if below RAG1 threshold  
   - Retrieves new facts from RAG2 if still below threshold  
   - Uses LLM (Google Gemini) for final reasoning and scoring  
4. Saves results to `results/final_scores.xlsx`  
5. Automatically saves updated cache state  

---

### Step 3: Save Cache (Optional)

Export cache data for analysis:

```bash
python save_cache.py
```

This saves:

- Cold cache → `data/final_cold_cache.json`  
- Hot cache → `data/final_hot_cache.json`  
- Complete cache object → `data/final_cache.pkl`  

---

## 📂 Output Locations

- **Results:**  
  Main output: `results/final_scores.xlsx` → Student scores and reasoning  

- **Cache Files:**  
  `data/final_*_cache.json` → Exported cache data  

- **Generated Files:**  
  FAISS Index: `faiss_index/`  
  Results Directory: `results/`  

---

## ✅ Summary

This project automates grading by combining:  

- **Faculty answers (RAG1)**  
- **Textbook retrieval (RAG2 with FAISS)**  
- **Google Gemini reasoning**  
- **Caching for efficiency**  

It outputs structured marks + detailed explanations in Excel for easy review.
