all AI news
Less is More: Task-aware Layer-wise Distillation for Language Model Compression. (arXiv:2210.01351v2 [cs.CL] UPDATED)
Oct. 6, 2022, 1:13 a.m. | Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao
cs.LG updates on arXiv.org arxiv.org
Layer-wise distillation is a powerful tool to compress large models (i.e.
teacher models) into small ones (i.e., student models). The student distills
knowledge from the teacher by mimicking the hidden representations of the
teacher at every intermediate layer. However, layer-wise distillation is
difficult. Since the student has a smaller model capacity than the teacher, it
is often under-fitted. Furthermore, the hidden representations of the teacher
contain redundant information that the student does not necessarily need for
the target task's learning. …
More from arxiv.org / cs.LG updates on arXiv.org
Regularization by Texts for Latent Diffusion Inverse Solvers
1 day, 7 hours ago |
arxiv.org
When can transformers reason with abstract symbols?
1 day, 7 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Data Scientist (m/f/x/d)
@ Symanto Research GmbH & Co. KG | Spain, Germany
Data Engineer
@ Paxos | Remote - United States
Data Analytics Specialist
@ Media.Monks | Kuala Lumpur
Software Engineer III- Pyspark
@ JPMorgan Chase & Co. | India
Engineering Manager, Data Infrastructure
@ Dropbox | Remote - Canada
Senior AI NLP Engineer
@ Hyro | Tel Aviv-Yafo, Tel Aviv District, Israel