all AI news
CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers
April 11, 2024, 4:46 a.m. | Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng
cs.CL updates on arXiv.org arxiv.org
Abstract: The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of …
abstract arxiv complexity computation cs.cl however independent inference inference latency language language models language processing large language large language models latency natural natural language natural language processing optimization parameters performance processing scale tasks type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior Machine Learning Engineer
@ Samsara | Canada - Remote