Oct. 19, 2022, 1:17 a.m. | Charith Peris, Lizhen Tan, Thomas Gueudre, Turan Gojayev, Pan Wei, Gokmen Oz

cs.CL updates on arXiv.org arxiv.org

Teacher-student knowledge distillation is a popular technique for compressing
today's prevailing large language models into manageable sizes that fit
low-latency downstream applications. Both the teacher and the choice of
transfer set used for distillation are crucial ingredients in creating a high
quality student. Yet, the generic corpora used to pretrain the teacher and the
corpora associated with the downstream target domain are often significantly
different, which raises a natural question: should the student be distilled
over the generic corpora, so …

arxiv distillation impact knowledge nlu transfer

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Global Data Architect, AVP - State Street Global Advisors

@ State Street | Boston, Massachusetts

Data Engineer

@ NTT DATA | Pune, MH, IN