Web: http://arxiv.org/abs/2209.04899

Sept. 23, 2022, 1:15 a.m. | Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

cs.CV updates on arXiv.org arxiv.org

In human environments, robots are expected to accomplish a variety of
manipulation tasks given simple natural language instructions. Yet, robotic
manipulation is extremely challenging as it requires fine-grained motor
control, long-term memory as well as generalization to previously unseen tasks
and environments. To address these challenges, we propose a unified
transformer-based approach that takes into account multiple inputs. In
particular, our transformer architecture integrates (i) natural language
instructions and (ii) multi-view scene observations while (iii) keeping track
of the full …

