Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: length generalization, OOD robustness
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: It has been observed in recent years that transformers have problems
with length generalization for certain types of reasoning and
arithmetic tasks. In particular, the performance of a transformer
model trained on tasks (say addition) up to a certain
length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the
same problem. This work proposes an
approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it
is helpful to simultaneously train the model to solve a
simpler but related auxiliary task as well.
We study the classical sorting problem as a canonical example
to evaluate our approach. We design a multitask training
framework and show that models trained via task hinting
significantly improve length generalization. In particular, for sorting we show that it is possible to
train models on data consisting of sequences having length at most
$20$, and improve the test accuracy on sequences of length $100$
from less than $1$% (for standard training) to more than $92$%
(via task hinting).
Our study uncovers several interesting aspects of length
generalization. We observe that while several auxiliary tasks may
seem natural a priori, their effectiveness in improving
length generalization differs dramatically. We further use probing
and visualization-based techniques to understand the internal
mechanisms via which the model performs the task, and propose a theoretical construction
consistent with the observed learning behaviors of the model. Based
on our construction, we show that introducing a small number of
length dependent parameters into the training procedure can further
boost the performance on unseen lengths. Finally, we also show the
efficacy of our task hinting based approach beyond
sorting, giving hope that these techniques will be applicable in
broader contexts.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6194
Loading