LaNMP: A Dataset of Language, Navigation, Manipulation, and Perception

1Brown University, 2Rutgers University 3University of Pennsylvania

Abstract

As robots that follow natural language become more capable and prevalent, we need a dataset to develop and evaluate their ability to solve long-horizon mobile manipulation tasks in diverse environments. To tackle this challenge, robots must use their multimodal sensing, navigation, and manipulation capabilities. Existing datasets do not integrate all these aspects, restricting their efficacy as benchmarks. We present the Language, Navigation, Manipulation, Perception (LaNMP) dataset and show the benefits of integrating all robot capabilities and sensing modalities.

LaNMP comprises 573 natural language commands in seven simulated and real-world environments for long-horizon room-to-room pick-and-place tasks. Each command is paired with a trajectory comprised of over 20 data types, such as RGB-D images, segmentations, and the poses of the robot body, end-effector, and grasped object. LaNMP can be used to develop and benchmark a variety of algorithms. To demonstrate its applicability, we fine-tuned and tested three pretrained models in simulation and one on a physical robot. The suboptimal performance of all the models compared to humans across various metrics shows significant room for improvement in developing multimodal mobile manipulation models using our benchmark.

Models

We evaluated our dataset on three existing models.

Episodic Transformer (E.T.).

Alfred.

Prompter.