# Rethinking LLM Parametric Knowledge as Confidence for Effective and Efficient RAG
[English](readme.md) | [简体中文](readme_zh-CN.md)

## Intro

The code includes the following contents:

- Code related to the Confidence Detection Model, including model structure, training dataset construction, and training code.

- Code for constructing the NQ_Rerank dataset

- Fine-tuning code and evaluation code for the Reranker model

- Evaluation code for the RAG system, including code for dynamic retrieval based on confidence

- Relevant model weights and datasets will be made public after the paper is published

## Confidence Detection Model

The training of the Confidence Detection Model fully follows the procedure outlined in the paper *Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception* (https://arxiv.org/abs/2502.11677), and the code also uses their open-source related code (code source: https://github.com/Trustworthy-Information-Access/LLM-Knowledge-Boundary-Perception-via-Internal-States). We only made minor modifications, so we make the relevant statement here.

### Hidden Detection Model Structure

Refer to MLPNet in hidden_state_detection/models.py; the MLPNet is adopted in the paper.

### Obtaining Internal Hidden States of the Target LLM

```bash
python -u run_nq.py \
--source ./data/nq/nq-dev.jsonl \
--type qa \
--ra none \
--outfile ./data/nq/nq-dev-res.jsonl \
--model_path YOUR_MODEL_PATH \
--batch_size 36 \
--task nq \
--max_new_tokens 512 \
--hidden_states 1 \
--hidden_idx_mode first \
--need_layers mid
```

- Fill in the weights of the target LLM for model_path

### Training Code

Refer to hidden_state_detection/main.py; you can directly execute the training via `bash hidden_state_detection/scripts/run_nq.sh`.

## Construction of the NQ_Rerank Dataset

### Model-related Code

#### Communication with LLM

You need to run your LLM and make it support the OpenAI-style interface. You can call it in the following way:

```python
from llms import OpenAIClient
llm_client = OpenAIClient({
"model": "qwen", # model name
"url": "http://127.0.0.1:8081/v1/chat/completions", # url
})
message = [{"role": "user", "content": "Hello"}]
response = llm_client.chat(message)
```

#### Constructing the NQ_Rerank Dataset

Refer to the `generate_rerank_preference_dataset` function in wash_rerank_data.py. You need to first obtain the confidence scores of each Context in part of the NQ dataset.

## Reranker Model

#### Loading the Reranker Model

You need to download the corresponding Reranker weights in advance. You can call it in the following way:

```python
from llms import BgeRerankCls, GteRerankCls, QwenRerankCls
bge_rerank_client = BgeRerankCls("weights")
gte_rerank_client = GteRerankCls("weights")
qwen_rerank_client = QwenRerankCls("weights")
eval_dataset = [{"query": "xx", "pos": [], "neg": []}, {"query": "xx", "pos": [], "neg": []}]
generate_result = bge_rerank_client.generate(eval_dataset, top_k = 1/3/5)
```

#### Reranker Training

Refer to run_reranker_finetune.sh; you can directly call the training code via `bash run_reranker_finetune.sh`.

#### Reranker Evaluation

Refer to evaluate_rerank.py; you can execute it with the following command:

```bash
python evaluate_rerank.py \
--llm_weight xxx/xxx \
--eval_data_path xxx/xxx \
--output_data_dir xxx/xxx \
--reranker_cls xxx/xxx \
--top_k 1/3/5 \
```

- llm_weight: Weight file of the Reranker
- eval_data_path: Path to the evaluation data file
- output_data_dir: Directory for outputting evaluation results, which will output two files: one result file and one intermediate rerank result file.
- reranker_cls: Type of the Reranker, supporting: BgeRerankCls, QwenRerankCls, GteRerankCls
- top_k: Take the top k contexts after reranking by the Reranker



## Evaluation of the RAG System

### Evaluating the Accuracy of the RAG System

Refer to eval_rerank_and_llms.py; you can execute it with the following command:

```bash
python evaluate_rerank.py \
    --llm_weight xxx/xxx \
    --input_data_path xxx/xxx \
    --output_data_path xxx/xxx \
    --eval_data_path xxx/xxx \
    --reranker_cls xxx \
    --top_k 1/3/5 \
    --data_type xxx \
    --gold_data_path xxx \
    --dynamic 0/1 \
    --threshold 0.98 \
```

- llm_weight: Weight file of the Reranker
- input_data_path: Path to the evaluation data file
- output_data_path: Output path for reranking results of the Reranker, i.e., the top-k contexts after reranking by the Reranker
- eval_data_path: Output path for results after the RAG system execution; the final result evaluation is also based on this output
- reranker_cls: Type of the Reranker, supporting: BgeRerankCls, QwenRerankCls, GteRerankCls
- top_k: Take the top k contexts after reranking by the Reranker
- data_type: Format of the evaluation dataset, currently supporting 2 types: nq, hotpot_qa
- gold_data_path: Optional, required when data_type = hotpot_qa; see code for details
- dynamic: Optional, whether to enable CBDR dynamic retrieval. 0 indicates disabled, 1 indicates enabled. Default is 0
- threshold: Model confidence threshold, required after enabling CBDR

## Not Yet Public

The following content will be released after the paper is published:

- Weights of the Confidence Detection Model
- Preference dataset NQ_Rerank
- Weights of the fine-tuned bge-reranker-v2-m3-ft model


