all AI news
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models
April 16, 2024, 4:42 a.m. | Guangyan Li, Yongqiang Tang, Wensheng Zhang
cs.LG updates on arXiv.org arxiv.org
Abstract: Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And …
abstract arxiv attention become compression computational cs.ai cs.cl cs.lg head hotspots language language models large language large language models llms massive memories multi-head observation performance reduce research resources scale self-attention show study tasks transformer type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Scientist
@ Publicis Groupe | New York City, United States
Bigdata Cloud Developer - Spark - Assistant Manager
@ State Street | Hyderabad, India