all AI news
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models
April 16, 2024, 4:42 a.m. | Guangyan Li, Yongqiang Tang, Wensheng Zhang
cs.LG updates on arXiv.org arxiv.org
Abstract: Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And …
abstract arxiv attention become compression computational cs.ai cs.cl cs.lg head hotspots language language models large language large language models llms massive memories multi-head observation performance reduce research resources scale self-attention show study tasks transformer type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US