This repo contains all data and codes for our paper: RETHINKING LLM-AS-A-JUDGE: REPRESENTATION-
AS-A-JUDGE WITH SMALL LANGUAGE MODELS VIA SEMANTIC CAPACITY ASYMMETRY

To start:
You can use collect_llm_res_vllm.py to collect the model's responses to different benchmark questions. It will generate files like "Meta-Llama-3-8B-Instruct_math_results.json".

Then, you can use roscoe5dim_multiprocess.py to get DeepSeek-V3 evaluation scores for responses and generate the probing dataset.  It will generate files like "Meta-Llama-3-8B-Instruct_math_results_roscoe5dim.jsonl".

Then, you can use our Jupyter notebook files like "probing_math.ipynb" to prob and build classifiers based on probing datasets ("Meta-Llama-3-8B-Instruct_math_roscoe5dim_probing.json") for different tasks. The notebook also includes fine-tuning RoBERTa, and you can use "tuning_lora.py" to fine-tune small LMs. It will show our main results in Figure 3 and the ablation study in Figure 4.

Finally, you can choose one trained classifier to filter data by "filter.py", SFT ("sft_lora.py"), and test ("test_vllm.py") models' performance on the filtered dataset. The results are in Figure 5.