LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
Track: long paper (up to 10 pages)
Keywords: parallel reasoning, rope, positional encoding
Abstract: Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt.
Such techniques demonstrate improved accuracy and since the $N$ generations are batched, they better utilize compute cores.
However, traditionally, each sequence in the batch is generated independently and hence do not reuse compute, intermediate generations, or observations between sequences.
In this paper, we propose LaneRoPE to enable coordination and collaboration between $N>1$ sequences at generation time.
LaneRoPE involves two key ideas:
(a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and
(b) a RoPE extension to inject positional information that captures relative positions between tokens, both within and outside a particular sequence.
We evaluate our approach on mathematical reasoning tasks and find promising results:
LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length.
Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introducing negligible new learnt parameters, it is appealing to rapidly introduce collaborative reasoning in existing LLM inference pipelines.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Gabriele_Cesa1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 75
Loading