Taku Ito, Luca Cocchi, et al.
ICML 2025
While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to captures long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention over less than 100B tokens, achieving strong performance on a suite of synthetic tasks. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding, operating as content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into their behavior towards length generalization.