CatchMeIfYouHearMe

Abstract

Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and static sound sources and struggle to generalize to unheard sounds. In this work, we propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds. We introduce a reinforcement learning approach that learns a robust navigation policy for these complex settings. To achieve this, we propose an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals. We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments, on two challenging 3D scanned real-world environments, namely Matterport3D and Replica.

Dynamic Audio-Visual Navigation

Figure 1: The novel dynamic audio-visual navigation benchmark. The paths of the agent and the sound source are shown in blue and red respectively, with initial poses marked as squares. The green line represents the optimal behavior to catch the moving target.

We introduce the novel task of dynamic audio-visual navigation. In this task, the agent must navigate towards a moving sound-emitting source in an unmapped complex 3D environment and output Stop when it catches it. This captures common scenarios such as a robot navigating to a person issuing commands or following pets or people in the house. We argue that this strongly increases the complexity of the task through two channels: on one hand, previous observations no longer capture the current state of the environment and the agent has to learn to update its memory accordingly. On the other hand, optimal behavior now requires not just to follow the sound intensity but proactively reason about the movement of the target to catch it efficiently.

Complex Audio Scenarios

Figure 2: Illustration of a complex audio scenario. The agent needs to navigate towards the ringing phone while being confronted with a second sound source (a crying baby), and various distractor sounds such as a piano.

Audio-visual navigation approaches have shown that agents can successfully extract information from the audio signals. However, they have mostly focused on clean and distractor-free audio settings in which the only change to the audio signal comes from changes in the agent's position. Furthermore, they have struggled to generalize to unheard sounds.

Inspired by challenges of real world scenarios, we design complex audio-scenarios in which the agent is confronted with strong perturbations and randomizations such as second sound-emitting sounds at the goal location, augmented audio signals and distractor sounds. This provides the agent with a more realistic, highly diverse training experience in which the agent has to focus on the directional and spatial information inherent in the audio signal. We show that this greatly improves generalization to unheard and noisy environments at test time.

Spatial Audio-Visual Fusion

Figure 3: Our proposed architecture. The depth image is projected into an allocentric geometric map G_map. A novel channel consisting of a Spatial Audio Encoder and Audio-Visual Encoder fuses the spatial information inherent in geometric map and audio signal. A GRU then combines this channel with separate depth and audio encodings. A PPO agent then produces close-by waypoints that are executed by a Djikstra planner.

Binaurally perceived spectrograms from the sound source contain a large amount of information about the space and room geometry, due to how the sound propagates through the rooms and reflects off of walls and obstacles. Previous work has shown that this information can reveal room geometries. We hypothesize that learning to extract and focus on this information and to learn to combine it with the spatial information from geometric maps is an appropriate architectural prior for audio navigation tasks. Furthermore, we hypothesize that a structure that succeeds to focus on this part of the audio information is more likely to generalize to unheard sounds and to succeed in noisy and distracting audio environments. We introduce an architecture that explicitly enables the agent to spatially fuse the geometric information inherent in obstacle maps and audio signals. We show that this leads to further gains in the generalization to unheard sounds in both clean and complex audio scenarios.