April 16, 2024, 4:42 a.m. | Guangyan Li, Yongqiang Tang, Wensheng Zhang

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.09695v1 Announce Type: new
Abstract: Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And …

abstract arxiv attention become compression computational cs.ai cs.cl cs.lg head hotspots language language models large language large language models llms massive memories multi-head observation performance reduce research resources scale self-attention show study tasks transformer type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Scientist

@ Publicis Groupe | New York City, United States

Bigdata Cloud Developer - Spark - Assistant Manager

@ State Street | Hyderabad, India