Keywords: LLM Benchmark
Abstract: Large language models (LLMs) have achieved strong performance on standard benchmarks,yet their performance is not robust across different task manifestations. It remains unclear how performance changes under controlled task rewrites that preserve the original solution structure, while varying the rewrite type and level. To address this question, We introduce ReTRE (Rewrite-based Transfer Robustness Evaluation), an evaluation benchmark inspired by learning transfer theory that probes transfer robustness along two rewrite level: Near Transfer and Far Transfer. Given the increasing support for multimodal inputs in modern LLMs, ReTRE considers not only text-based rewrites but also modality-type rewrites. To ensure that the solution structure is preserved, ReTRE employs a multi-agent pipeline that extracts the solution steps from the original task, designs a corresponding transfer strategy, and generates rewritten variants. Each stage is equipped with a dedicated validation agent that iteratively verifies structure preservation and correctness. Evaluations on mathematical and science tasks across state-of-the-art multimodal LLMs reveal a consistent transfer gap: performance exhibits a general declining trend as transfer similarity drop and strong text performance can face performance decline under cross-modal transfer. Crucially, we identify a divergence between post-training paradigms: reinforcement learning preserves transfer robustness, whereas supervised fine-tuning tends to overfit the training distribution, leading to severe degradation in far-transfer performance despite strong in-distribution accuracy. The code is available at https://anonymous.4open.science/r/TransferRobust-E738/
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; evaluation;corpus creation;
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 7336
Loading