Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Wang, Xin; Huang, Qiuyuan; Çelikyılmaz, Aslı; Gao, Jianfeng; Shen, Dinghan; Wang, Yuan-Fang; Wang, William Yang; Zhang, Lei

doi:10.1109/cvpr.2019.00679

articleJun 1, 2019Closed access

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

XWXin Wang QHQiuyuan Huang AÇAslı Çelikyılmaz JGJianfeng Gao DSDinghan Shen

University of California, Santa Barbara · Microsoft Research (United Kingdom) · +1 more institution

Indexed incrossref

Abstract

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene.…

Citation impact

544

total citations

FWCI: 34.09
Percentile: 100%
References: 94

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Computer science
Matching (statistics)
Reinforcement learning
Imitation
Modal
Artificial intelligence
Task (project management)
Benchmark (surveying)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.