Open Scene Graphs for
Open-World Object-Goal Navigation

1National University of Singapore, 2Cornell University
*Indicates Equal Contribution
IJRR Special Issue: Foundation Models and Neurosymbolic AI for Robotics

OSG Navigator searches in the open world for open-set objects. It composes foundation models with our novel Open Scene Graph to handle diverse object goals, environments and embodiments.

Abstract

How can we build general-purpose robot systems for open-world semantic navigation, e.g., searching a novel environment for a target object specified in natural language? To tackle this challenge, we introduce OSG Navigator, a modular system composed of foundation models, for open-world Object-Goal Navigation (ObjectNav). Foundation models provide enormous semantic knowledge about the world, but struggle to organise and maintain spatial information effectively at scale. Key to OSG Navigator is the Open Scene Graph representation, which acts as spatial memory for OSG Navigator. It organises spatial information hierarchically using OSG schemas, which are templates, each describing the common structure of a class of environments. OSG schemas can be automatically generated from simple semantic labels of a given environment, e.g., "home" or "supermarket". They enable OSG Navigator to adapt zero-shot to new environment types. We conducted experiments using both Fetch and Spot robots in simulation and in the real world, showing that OSG Navigator achieves state-of-the-art performance on ObjectNav benchmarks and generalises zero-shot over diverse goals, environments, and robot embodiments.

Open Scene Graphs


Open Scene Graphs

The Open Scene Graph (OSG) is a hierarchical representation of open-set objects and spatial regions. To represent open-set spatial regions across diverse environments, OSGs can be configured with OSG schemas. Schemas are templates that capture the common structure across scenes belonging to a particular environment type. They enable OSG Navigator to adapt zero-shot to new environment types, and may either be defined manually using domain knowledge or generated automatically with LLMs.

OSG Navigator



Open Scene Graphs

OSG Navigator is a system architecture composed from foundation models: Large Language Models (LLMs) for reasoning, Visual Foundation Models (VFMs) for perception and General Navigation Models (GNMs) for locomotion. Foundation models enable a system that generalises zero-shot across diverse object goals, environments and embodiments. Such models are insufficient on their own, due to their limited capacity to organise and maintain spatial information. Open Scene Graphs provide a coherent spatial representation of the scene, and serve as a unified scene memory for foundation models.

Poster (ICRA '24 VLMNM workshop)