# RAG Corpus Extraction Framework

This repository implements a black-box corpus extraction framework for Retrieval-Augmented Generation (RAG) systems. The method iteratively explores the retriever embedding space, synthesizes new queries, and reconstructs hidden retrieval corpora under limited query budgets.

---

## File Descriptions

### data_proc.py
Loads 8 public HuggingFace datasets, samples 500 texts per dataset, and builds a merged corpus.

### data_encoding.py
Encodes merged corpus using a sentence-transformer retriever and saves.

### RAG.py
Implements a black-box RAG system:
- Loads corpus and embeddings  
- Performs nearest-neighbor retrieval  
- Builds retrieval-augmented prompts  
- Queries external LLM API  

### utils.py
Provides core algorithmic components:
- Surrogate embedding encoder  
- Latent inversion with semantic-guided decoding  
- Gaussian local perturbation  
- Orthogonal direction synthesis  
- Retrieved context parsing  
- Logging utilities  

### main.py
Implements the **Retrieve–Plan–Invert** extraction loop:
- Start from seed queries  
- Explore embedding space  
- Invert latent vectors into new queries  
- Query RAG system  
- Accumulate extracted corpus chunks  
- Save intermediate logs  

### evaluate.py
Computes extraction performance:
-  Coverage
-  Efficiency

---

## Running Order

### 1. Build corpus

### 2. Encode corpus


### 3. Run extraction


### 4. Evaluate performance


---


## Dependencies

- PyTorch  
- HuggingFace Transformers  
- Sentence-Transformers  
- HuggingFace Datasets  
- OpenAI-compatible LLM API  

---

## Notes

All datasets are public.  
No private data or personal information is used.
