The Audio-Visual BatVision Dataset for Research on Sight and Sound

IEEE/RSJ IROS 2023

Amandine Brunetto1*
Sascha Hornauer1*
Stella X. Yu2
Fabien Moutarde1
1Center for Robotics - Mines Paris, PSL Research University
2University of Michigan


[Paper]
[ArXiv]
[Code/Dataset]
[Slides]
[Poster]




The BatVision dataset contains large scale audio-visual data from a robots’ perspective. For its creation, a robot traversed corridors, offices, lecture halls and driveways at a historic campus and modern office building like a bat, emitting chirping sounds with a speaker. A binaural microphone recorded their echoes which carry rich scene information of objects, materials and layout. With this paper we provide echoes, camera images and depth maps, shown overlayed on typical scenes on the right. Echoes are shown under images. This dataset will help investigate fundamental questions on how sound interacts with spaces, how it can be harnessed for robotic navigation and what in general can be understood about a scene from how it sounds.


Abstract

Vision research showed remarkable success in understanding our world, propelled by datasets of images and videos. Sensor data from radar, LiDAR and cameras supports research in robotics and autonomous driving for at least a decade. However, while visual sensors may fail in some conditions, sound has recently shown potential to complement sensor data. Simulated room impulse responses (RIR) in 3D apartment-models became a benchmark dataset for the community, fostering a range of audiovisual research. In simulation, depth is predictable from sound, by learning bat-like perception with a neural network. Concurrently, the same was achieved in reality by using RGB-D images and echoes of chirping sounds. Biomimicking bat perception is an exciting new direction but needs dedicated datasets to explore the potential. Therefore, we collected the BatVision dataset to provide large-scale echoes in complex real-world scenes to the community. We equipped a robot with a speaker to emit chirps and a binaural microphone to record their echoes. Synchronized RGB-D images from the same perspective provide visual labels of traversed spaces. We sampled modern US office spaces to historic French university grounds, indoor and outdoor with large architectural variety. This dataset will allow research on robot echolocation, general audio-visual tasks and sound phænomena unavailable in simulated data. We show promising results for audio-only depth prediction and show how state-of- the-art work developed for simulated data can also succeed on our dataset.



The BatVision Dataset



[Download Link]  



Presentation Video




Qualitative Results


Results of two different depth prediction methods on the BatVision Dataset.
(a) State-of-the-art method (Parida et al., 2021). It was originally developped in simulation.
(b) Baseline: U-Net based method using only sound as input.



Paper

Amandine Brunetto*, Sascha Hornauer*, Stella X. Yu, Fabien Moutarde.
The Audio-Visual BatVision Dataset for Research on Sight and Sound
2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 1-8, doi: 10.1109/IROS55552.2023.10341715.
(Paper)


[Bibtex]


Acknowledgements

The authors acknowledges the support of the French Agence Nationale de la Recherche (ANR), under grant ANR22-CE94-0003 (projet Omni-BatVision). The webpage template was originally made by Phillip Isola and Richard Zhang for a Colorization project.