Feb. 12, 2024, 5:43 a.m. | Anirudh S. Sundar Chao-Han Huck Yang David M. Chan Shalini Ghosh Venkatesh Ravichandran Phani Sankar Nidadavol

cs.LG updates on arXiv.org arxiv.org

Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowledge transfer from attention matrices of models rooted in high resource modalities, text and images, to those in resource-constrained domains, speech and audio, employing a zero-shot paradigm. …

attention audio classification compute cs.lg cs.sd data eess.as event fine-tuning foundation merging multimodal recognition speech speech recognition standard tasks training

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne