Justin Wasserman

Last-Mile Embodied Visual Navigation

Justin Wasserman^1*†, Karmesh Yadav², Girish Chowdhary^1†, Abhinav Gupta³, Unnat Jain^2*,

¹University of Illinois at Urbana-Champaign, ²Facebook AI Research, ³Carnegie Mellon University
^*Equal Contribution ^†Corresponding Authors

Paper Code

Left video sped up x20

Abstract

Realistic long-horizon tasks like image-goal navigation involve exploratory and exploitative phases. Assigned with an image of the goal, an embodied agent must explore to discover the goal, i.e., search efficiently using learned priors. Once the goal is discovered, the agent must accurately calibrate the last-mile of navigation to the goal. As with any robust system, switches between exploratory goal discovery and exploitative last-mile navigation enable better recovery from errors. Following these intuitive guide rails, we propose SLING to improve performance of existing image-goal navigation systems. Entirely complementing prior methods, we focus on last-mile navigation, and leverage the underlying geometric structure of the problem with neural descriptors. With simple but effective switches, we can easily connect SLING with heuristic, reinforcement learning, and neural modular policies.

On a standardized image-goal navigation benchmark, we improve performance across policies, scenes, and episode complexity, raising the state-of-the-art from 45% to 55% success rate and 28.0% to 37.4% SPL. Beyond photorealistic simulation, we conduct real-robot experiments in three physical scenes and find these improvements to transfer well to real environments.

SLING

We propose Switchable Last-Mile Image-Goal Navigation

(SLING) -- a simple yet very effective geometric navigation module and associated switches. Our approach can be combined with any off-the-shelf learned policy that uses semantic priors to explore the scene. As soon as the object or view of interest is detected, the SLING switches to the geometric navigation module.

Last-Mile Navigation: Neural keypoint features descriptors are extracted and matched to obtain correspondences between agent's view and image-goal. The geometric problem of estimating the relative pose between the agent and goal view is solved using efficient perspective-n-point. The estimations are fed into a local policy head to decide the agent's actions.

Results

Last-Mile Navigation Only

Previous Method: In this real-world last-mile navigation example, the previous method predicted a waypoint (red square in top down map) that was not within 1 meter of the goal (blue square/circle in top down map, visualized in top left corner). Even worst, it caused the robot to hit the corn! Video sped up x3.
Failure

SLING: However, with SLING, the agent predicted a valid waypoint and was able to navigate towards the goal. Video sped up x3.
Success

Geometric Switches Improve Success Rate

Previous Method: The switching mechanism at (0:02) fails to notice that the agent is close to the goal and continues exploring the environment. This causes the agent to explore more and then once in the kitchen at (0:36), it believes it is near the goal causing it to navigate to the final waypoint and fail. The goal is shown in the topdown map as a the red circle, and the agent is the blue circle.
Failure

SLING: Our switching mechanism notices the goal at (0:02) and navigates to a point near the goal.
Success

SLING Improves Last-Mile Navigation Over Previous Methods

Previous Method: In the following video, even though the robot sees the goal in front of it, the previous method poorly predicts the final waypoint to the goal and fails.
Failure

SLING: However, SLING gives more accurate predictions of the relative pose to the goal, allowing it to succeed.
Success

BibTeX

    
@inproceedings{
wasserman2022lastmile,
title={Last-Mile Embodied Visual Navigation},
author={Justin Wasserman and Karmesh Yadav and Girish Chowdhary and Abhinav Gupta and Unnat Jain},
booktitle={6th Annual Conference on Robot Learning},
year={2022},
url={https://openreview.net/forum?id=RgJwDQwW82y}
}

Website source code