
## Full Context & MemArt
For **Full Context** and **MemArt**, we build upon **transformers v4.46.3**. 

```bash
python -m eval.eval_full_context_llama --model_path $YOUR_MODEL_PATH --output_path $YOUR_OUTPUT_PATH 
```

```bash
python -m eval.eval_memArt_llama --model_path $YOUR_MODEL_PATH \
    --block_size 16 \
    --topk_threshold 128\
    --mts_strategy "MAX"\  
    --digest_strategy "bounding_cuboid"\
    --log_dir $YOUR_LOGS_PATH \
    --output_path $YOUR_OUTPUT_PATH
```


## Mem0

For **Mem0**, we reused the evaluation logic from the open-source Mem0, and conducted offline deployment evaluation using vLLM + bge-m3 embedding model + FAISS vector database.

```bash
for idx in {0..9}
do
    rm -rf /root/storage/faiss_memories/*
    echo "Processing Conversation $idx..."
    python eval/eval_mem0.py $idx # Execute the Python script with conversation index
    echo "Finished processing Conversation $idx."
done

echo "All conversations have been processed."
```

## Zep

For **Zep**, we leverage the client API provided by Zep to **generate and retrieve memories**.  
These retrieved memories are then composed into prompts and fed into a locally deployed **vLLM** model for inference.

```bash
python eval/eval_zep.py
```

## Data and Judge

The experimental data used in the paper is stored in the `data` folder. The QA results for each memory approach are formatted as follows, where the `standard answer` comes from the **LoCoMo** dataset and the `answer` is generated by the LLM using different memory  approach. 

```json
[
    ...
    {
      "question": "Where has Melanie camped?",
      "standard answer": "beach, mountains, forest",
      "answer": "The mountains and the forest"
    },
    ...
]
```

You can evaluate the results using the provided scoring script:

```bash
python eval/judge.py --input $YOUR_RESULTS_JSON
```



