RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, Anthony; Brown, Noah; Carbajal, Justice; Chebotar, Yevgen; Chen, Xi; Choromański, Krzysztof; Ding, Tianli; Driess, Danny; Dubey, Avinava; Finn, Chelsea; Florence, Pete; Fu, Chuyuan; Arenas, Montse Gonzalez; Gopalakrishnan, Keerthana; Han, Kehang; Hausman, Karol; Herzog, Alexander; Hsu, Jasmine; Ichter, Brian; Irpan, Alex; Joshi, Nikhil; Julian, Ryan; Kalashnikov, Dmitry; Kuang, Yuheng; Leal, Isabel; Lee, Lisa; Lee, Tsang-Wei Edward; Levine, Sergey; Lu, Yao; Michalewski, Henryk; Mordatch, Igor; Pertsch, Karl; Rao, Kanishka; Reymann, Krista; Ryoo, Michael S.; Salazar, Grecia; Sanketi, Pannag; Sermanet, Pierre; Singh, Jaspiar; Singh, Anikait; Soricut, Radu; Tran, Huong; Vanhoucke, Vincent; Vuong, Quan; Wahid, Ayzaan; Welker, Stefan; Wohlhart, Paul; Wu, Jialin; Xia, Fei; Xiao, Ted; Xu, Peng; Xu, Sichun; Yu, Tianhe; Zitkovich, Brianna

doi:10.48550/arxiv.2307.15818

preprintarXiv (Cornell University)Jul 28, 2023GREEN OA

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

ABAnthony Brohan NBNoah Brown JCJustice Carbajal YCYevgen Chebotar XCXi Chen

Indexed inarxivdatacite

Abstract

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language…

Citation impact

266

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

54

Topics & keywords

Topics

Keywords

Computer science
Natural language
Artificial intelligence
Robot
Generalization
The Internet
Human–computer interaction
Language model

No related works found for this paper.