Aug. 16, 2022, 1:13 a.m. | Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang

cs.CV updates on arXiv.org arxiv.org

We present a simple yet effective self-supervised framework for audio-visual
representation learning, to localize the sound source in videos. To understand
what enables to learn useful representations, we systematically investigate the
effects of data augmentations, and reveal that (1) composition of data
augmentations plays a critical role, i.e. explicitly encouraging the
audio-visual representations to be invariant to various transformations~({\em
transformation invariance}); (2) enforcing geometric consistency substantially
improves the quality of learned representations, i.e. the detected sound source
should follow the …

arxiv cv sound transformation

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Social Insights & Data Analyst (Freelance)

@ Media.Monks | Jakarta

Cloud Data Engineer

@ Arkatechture | Portland, ME, USA