TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

ACL ARR 2026 January Submission3470 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Math Reasoning, Process Reward Model, Best-of-N
Abstract: Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-$N$ selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces a key limitation: the underutilization of the LLM's intrinsic latent representations. We introduce **_TrajSelector_**, an efficient and effective Best-of-$N$ framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that **_TrajSelector_** delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing 7B-scale process reward models by 4.31% to 12.21%, while using only 0.6B parameters and requiring lower GPU memory footprint.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: math QA, reasoning
Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3470
Loading