ReTRE: Benchmarking LLM Transfer Robustness with Structure-Preserving Variants

ReTRE: Benchmarking LLM Transfer Robustness with Structure-Preserving Variants

ACL ARR 2026 January Submission7336 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Benchmark

Abstract: Large language models (LLMs) have achieved strong performance on standard benchmarks,yet their performance is not robust across different task manifestations. It remains unclear how performance changes under controlled task rewrites that preserve the original solution structure, while varying the rewrite type and level. To address this question, We introduce ReTRE (Rewrite-based Transfer Robustness Evaluation), an evaluation benchmark inspired by learning transfer theory that probes transfer robustness along two rewrite level: Near Transfer and Far Transfer. Given the increasing support for multimodal inputs in modern LLMs, ReTRE considers not only text-based rewrites but also modality-type rewrites. To ensure that the solution structure is preserved, ReTRE employs a multi-agent pipeline that extracts the solution steps from the original task, designs a corresponding transfer strategy, and generates rewritten variants. Each stage is equipped with a dedicated validation agent that iteratively verifies structure preservation and correctness. Evaluations on mathematical and science tasks across state-of-the-art multimodal LLMs reveal a consistent transfer gap: performance exhibits a general declining trend as transfer similarity drop and strong text performance can face performance decline under cross-modal transfer. Crucially, we identify a divergence between post-training paradigms: reinforcement learning preserves transfer robustness, whereas supervised fine-tuning tends to overfit the training distribution, leading to severe degradation in far-transfer performance despite strong in-distribution accuracy. The code is available at https://anonymous.4open.science/r/TransferRobust-E738/

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; evaluation;corpus creation;

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 7336

Loading