Feb. 9, 2024, 5:44 a.m. | Abhishek Panigrahi Sadhika Malladi Mengzhou Xia Sanjeev Arora

cs.LG updates on arXiv.org arxiv.org

Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce …

