# Gemini File Search Baseline

Uses Google's [File Search API](https://ai.google.dev/gemini-api/docs/file-search) for RAG-based MADQA.

## Overview

This baseline:
1. Loads PDFs from the local `sample_pdfs/` directory
2. Indexes them in a Gemini File Search store (embeddings + chunking handled by Google)
3. Answers questions using the file search tool

## Setup

```bash
pip install -r requirements.txt
export GOOGLE_API_KEY="your-api-key"
```

## Usage

### 1. Index PDFs

First, index the PDFs from the sample directory:

```bash
# Index all PDFs in sample_pdfs/
python gemini_file_search_agent.py index

# Index from a different directory
python gemini_file_search_agent.py index --pdf-dir /path/to/pdfs

# Index with limit (for testing)
python gemini_file_search_agent.py index --limit 10
```

### 2. Ask Questions

```bash
# Single question
python gemini_file_search_agent.py ask "What is the total revenue?"

# With output file
python gemini_file_search_agent.py ask "What is the total revenue?" -o result.json
```

### 3. Run Evaluation

```bash
# Evaluate on dev split (has ground truth)
python gemini_file_search_agent.py evaluate results.jsonl --split dev

# Evaluate on test split
python gemini_file_search_agent.py evaluate results.jsonl --split test

# With limit
python gemini_file_search_agent.py evaluate results.jsonl --split dev --limit 100
```

## How It Works

Gemini File Search is a single-shot retrieval approach:
1. PDFs are uploaded to a File Search store (Google handles chunking and embedding)
2. When answering questions, the model automatically retrieves relevant chunks
3. The model generates an answer based on retrieved context

This differs from agentic approaches that can iterate and refine searches.

## Note on Sample PDFs

Due to file size constraints, only a small sample of PDFs is included in `../sample_pdfs/`. 
For full evaluation, you will need to obtain the complete PDF corpus separately.
