all AI news
Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation
March 5, 2024, 2:52 p.m. | Heegon Jin, Seonil Son, Jemin Park, Youngseok Kim, Hyungjong Noh, Yeonsoo Lee
cs.CL updates on arXiv.org arxiv.org
Abstract: The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention …
abstract alignment architecture arxiv attention cs.ai cs.cl datasets distillation efficiency knowledge large datasets machine machine translation neural machine translation performance scalable transformer transformer architecture translation type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Robotics Technician - 3rd Shift
@ GXO Logistics | Perris, CA, US, 92571