1、Introduction
    We introduce TestJudge, a benchmark designed to evaluate both the quality and error-detection capabilities of generated unit tests. TestJudge contains 8,000 programming problems in Python and C++ sourced from Codeforces. For each problem, we provide 10 diverse code submissions with known correctness labels, where a generated test is considered valid only if it correctly classifies all 10 submissions according to ground-truth verdicts.

2、file description
    (1) data/ Folder for storing evaluation datasets
    (2) action.py Code execution related code, including pytest, gtest
    (3) client.py Request LLMs API
    (4) prompt.py Prompts for all tasks
    (5) run.py Main Execution Files
    (6) utils.py Various tool functions

3、quick start
    (1) Environment Setup
    conda conda create -n testjudge python=3.10
    conda activate testjudge
    pip install -r requirements.txt

    (2) Execution
	python run.py --model_id "" \
	--server_url "http://localhost:8000/v1" \
	--api_key "" \
	--log_dir experiment/ \
	--temperature 0.8 --top_p 0.8 --max_tokens 4096 \
	--concurrency 50 --output_type (code,test_code,text) --subtasks (python,cpp) --codes (1,2,5,10) (--think)