Abstract: Recent advances in Large Language Models
(LLMs) have yielded impressive successes on
many language tasks. However, efficient processing of long contexts using LLMs remains a
significant challenge. We introduce EpMAN
– a method for processing long contexts in an
episodic memory module while holistically attending to semantically relevant context chunks.
The output of episodic attention is then used
to reweigh the decoder’s self-attention to the
stored KV cache of the context during training and generation. When an LLM decoder
is trained using EpMAN, its performance on
multiple challenging single-hop long-context
recall and question-answering benchmarks is
found to be stronger and more robust across
the range from 16k to 256k tokens than baseline decoders trained with self-attention, and
popular retrieval-augmented generation frameworks. Our source code will be made available
at https://github.com/IBM/epman
Loading