April 3, 2024, 4:42 a.m. | Mengfei Du, Binhao Wu, Jiwen Zhang, Zhihao Fan, Zejun Li, Ruipu Luo, Xuanjing Huang, Zhongyu Wei

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.01994v1 Announce Type: cross
Abstract: Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction. For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history. Existing works primarily concentrate on cross-modal attention at the fusion stage to achieve this objective. Nevertheless, modality features generated by disparate uni-encoders reside in their own spaces, leading to a decline in the quality of cross-modal fusion and decision. To …

abstract agent alignment arxiv attention cs.cl cs.cv cs.lg environment fusion history language modal natural natural language navigation observation type vision vision-and-language

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Associate Data Engineer

@ Nominet | Oxford/ Hybrid, GB

Data Science Senior Associate

@ JPMorgan Chase & Co. | Bengaluru, Karnataka, India