Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

ACL ARR 2026 May Submission17065 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM agents, text-based agents, action selection, neural reranking, cross-environment transfer, agent evaluation, sample-efficient learning, efficient NLP

Abstract: Large language model agents achieve strong performance on text-based benchmarks but incur prohibitive inference costs, motivating the use of compact neural rerankers for action selection. We investigate whether a single lightweight model can perform action selection across multiple diverse environments, a capability that would reduce per-environment model maintenance. Training DeBERTa-v3 (184M-434M parameters) jointly on ALFWorld, WebShop, and ScienceWorld with minority-class upsampling, we find that rebalanced two-environment joint training substantially improves over single-environment ALFWorld performance (net gain +0.412) while maintaining competitive WebShop performance (+0.214 vs. +0.249 single-environment). Three-environment training yields a mean combined net gain of +0.551 +/- 0.024 across 4 seeds, with per-environment results approaching specialized single-environment models while providing positive cross-domain transfer. Cross-environment adaptation is highly sample-efficient: fine-tuning on only 9.2% of target-domain data recovers 93% of full-data performance, and scaling model capacity yields limited benefits, indicating data diversity is the primary driver. Environment-aware LoRA adapter routing with PCGrad achieves a best-seed result of +0.611, with seeds 456 and 789 at +0.554 and +0.559, but exhibits high variance due to seed 123 collapsing to +0.263 (4-seed mean +0.497 +/- 0.158), representing a promising but currently unstable direction. Joint training with clean splits and data rebalancing is a key ingredient. We will release our three-environment benchmark of 51,580 training instances (41,740 raw unique states with minority-class upsampling) and all model checkpoints upon acceptance.

Paper Type: Long

Research Area: LLM agents

Research Area Keywords: LLM agents, agent evaluation, environment interaction, grounded agents, text-based agents, action selection, neural reranking, cross-environment transfer

Contribution Types: NLP engineering experiment, Approaches to low-compute settings (efficiency), Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: yes

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Visa Needs: yes

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: The dedicated Limitations section discusses potential risks and deployment caveats, including limited environment coverage, oracle dependency, static candidate sets, seed variance, and the gap between step-level action selection and full agent-in-the-loop success.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Sections 2 and 4 cite the creators of the artifacts used, including ALFWorld, WebShop, ScienceWorld, DeBERTa-v3, ETO, LoRA/PCGrad-related prior work, and related text-agent benchmarks.

B2 Discuss The License For Artifacts: No

B2 Elaboration: The current submission cites artifact creators and uses the artifacts for research evaluation, but does not include a full license/terms table for every upstream benchmark and model. We will provide complete license and usage-term details in the released artifact documentation.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Section 4 uses existing public benchmark, demonstration, simulator, and pretrained-model artifacts for research evaluation of text-based agent action selection. The derived candidate datasets and checkpoints are intended for research on action selection and agent evaluation, and their release will follow the original artifact access and usage conditions.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: No new data was collected from people. The constructed candidate-action records are derived from public text-based benchmark, demonstration, and simulator artifacts and consist of task descriptions, environment observations, and candidate actions. We did not identify personally identifying information in these constructed records; any release will follow the original benchmark data terms.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 4, Table 1, and the appendix document the covered environments, domains, language, data sources, training instances, raw unique states, candidate examples, candidate-set statistics, split protocol, upsampling factors, and few-shot data sizes.

B6 Statistics For Data: Yes

B6 Elaboration: Section 4, Table 1, and the appendix report the number of training instances, raw unique states, candidate examples, candidate-set statistics, train/test split, upsampling, and few-shot data sizes.

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: Sections 1, 3, 4, and 6 report model sizes and computing infrastructure, including DeBERTa-v3-base/large parameter counts and single A100 40GB GPU use. The current submission does not report a complete aggregate GPU-hour budget for all experiments.

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 4 reports the experimental setup and hyperparameters, including optimizer, learning-rate range, weight decay, warmup, gradient clipping, maximum sequence length, number of epochs, episode-level split, and upsampling factors.

C3 Descriptive Statistics: Yes

C3 Elaboration: Sections 5 and 6 report whether results are single-seed clean evaluations, four-seed clean means, best-seed results, or preliminary diagnostics. The paper reports a 4-seed mean and standard deviation for three-environment training (+0.551 +/- 0.024) and a 4-seed mean and standard deviation for the env-aware LoRA + PCGrad extension (+0.497 +/- 0.158), while marking router warmup as preliminary.

C4 Parameters For Packages: No

C4 Elaboration: The paper reports the main model architectures, evaluation script names, training protocol, and key hyperparameters, but it does not provide complete version numbers or every package-specific preprocessing and evaluation parameter. We will include the full reproducibility environment and package-version details in the released artifact documentation.

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D1 Elaboration: No new human participants or annotators were recruited for this work, so no participant instructions were used.

D2 Recruitment And Payment: N/A

D2 Elaboration: No new human participants or annotators were recruited or paid for this work.

D3 Data Consent: N/A

D3 Elaboration: No new human participants or annotators were recruited or paid for this work.

D4 Ethics Review Board Approval: N/A

D4 Elaboration: No new human-subject data collection was conducted, so ethics review board approval was not applicable.

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: AI assistants were used for coding support, formatting, editing, consistency checking, and submission-form preparation. The author verified all claims, experiments, citations, numerical results, and final text.

Author Submission Checklist: yes

Submission Number: 17065

Loading