Feb. 6, 2024, 5:46 a.m. | Matteo Pagliardini Amirkeivan Mohtashami Francois Fleuret Martin Jaggi

cs.LG updates on arXiv.org arxiv.org

The transformer architecture from Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past …

application architecture cs.cl cs.lg domains flow image information language language processing natural natural language natural language processing perplexity processing simple speech speech processing standard transformer transformer architecture transformers understanding via

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote