Feb. 8, 2024, 5:46 a.m. | Qingyu Yin Xuzheng He Xiang Zhuang Yu Zhao Jianhua Yao Xiaoyu Shen Qiang Zhang

cs.CL updates on arXiv.org arxiv.org

The decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling. Despite its exceptional performance across various tasks, we have identified two limitations: First, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information. This compels the model to assign disproportional excessive attention to specific tokens. Second, RPE-based Transformers are not universal approximators due to their limited capacity …

architecture attention become cs.ai cs.cl current decoder embedding encoding language limitations masking modeling performance tasks the decoder transformer transformer architecture

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote