Oct. 6, 2022, 1:13 a.m. | Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao

cs.LG updates on arXiv.org arxiv.org

Layer-wise distillation is a powerful tool to compress large models (i.e.
teacher models) into small ones (i.e., student models). The student distills
knowledge from the teacher by mimicking the hidden representations of the
teacher at every intermediate layer. However, layer-wise distillation is
difficult. Since the student has a smaller model capacity than the teacher, it
is often under-fitted. Furthermore, the hidden representations of the teacher
contain redundant information that the student does not necessarily need for
the target task's learning. …

arxiv compression distillation language language model

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

Data Engineer

@ Paxos | Remote - United States

Data Analytics Specialist

@ Media.Monks | Kuala Lumpur

Software Engineer III- Pyspark

@ JPMorgan Chase & Co. | India

Engineering Manager, Data Infrastructure

@ Dropbox | Remote - Canada

Senior AI NLP Engineer

@ Hyro | Tel Aviv-Yafo, Tel Aviv District, Israel