June 5, 2024, 4:49 a.m. | Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu

cs.CV updates on arXiv.org arxiv.org

arXiv:2406.02208v1 Announce Type: new
Abstract: Current Vision-and-Language Navigation (VLN) tasks mainly employ textual instructions to guide agents. However, being inherently abstract, the same textual instruction can be associated with different visual signals, causing severe ambiguity and limiting the transfer of prior knowledge in the vision domain from the user to the agent. To fill this gap, we propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task augmenting traditional VLN by integrating both natural language and images in instructions. VLN-MP …

abstract agents arxiv cs.ai cs.cl cs.cv current domain guide however knowledge language modal multi-modal navigation prior prompts tasks text textual transfer type vision vision-and-language visual

