Feb. 27, 2024, 5:43 a.m. | Filip Szatkowski, Bartosz W\'ojcik, Miko{\l}aj Pi\'orczy\'nski, Kamil Adamczewski

cs.LG updates on arXiv.org arxiv.org

arXiv:2310.04361v2 Announce Type: replace
Abstract: Transformer models, despite their impressive performance, often face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by transforming parts of the network into Mixture-of-Experts (MoE) layers. However, despite the crucial role of activation sparsity, its impact on this process remains unexplored. In this paper, we enhance the efficiency of MoE conversion through activation sparsity enforcement. Moreover, …

abstract arxiv computational cost cs.lg dynamic experts face inference limitations moe network performance practical reduce requirements sparsity transformer transformer models type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne