NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

Preprint

Amandine Brunetto
Sascha Hornauer
Fabien Moutarde
Centre for Robotics, Mines Paris - PSL Research University


[ArXiv]
[Code]




NeRAF learns radiance and acoustic fields from a set of images and audio recordings. It synthesizes binaural RIRs and RGB images at novel camera, microphone and source positions and orientations. NeRAF benefits from cross-modal training without requiring co-located audio-visual data and can render spatially separate modalities. NeRAF enables auralization and audio spatialization, along with enhanced image rendering, which are essential for creating a realistic perception of space.


Abstract

Sound plays a major role in human perception, providing essential scene information alongside vision for understanding our environment. Despite progress in neural implicit representations, learning acoustics that match a visual scene is still challenging. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF is designed as a Nerfstudio module for convenient access to realistic audio-visual generation. It synthesizes both novel views and spatialized audio at new positions, leveraging radiance field capabilities to condition the acoustic field with 3D scene information. At inference, each modality can be rendered independently and at spatially separated positions, providing greater versatility. We demonstrate the advantages of our method on the SoundSpaces dataset. NeRAF achieves substantial performance improvements over previous works while being more data-efficient. Furthermore, NeRAF enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning.



Model Overview



NeRF maps 3D coordinates and orientations to density and color. The grid sampler fills a 3D grid representing the scene by querying the radiance field with voxel center coordinates and multiple viewing directions. NAcF learns to map source-microphone poses and directions to STFT coefficients. It is conditioned by extracted scene features. Predicted RIRs can be convolved with audio to obtain auralized and spatialized sound matching the scene.



Demo Videos

We present exemples of audio-visual and audio-only generation done with NeRAF.
Each video includes a mini-map displaying the position of the sound source (blue cross) and microphone+camera trajectory (green arrow). Audios are obtained by convolving the predicted RIRs with the source audio. To best experience the videos, please use headphones.
These examples illustrate the ability of NeRAF to render realistic audio-visual scenes. In particular, NeRAF allows:
  • Audio and image synthesis at novel camera and microphone positions.
  • The possibility to render each modality independently.
  • Generation of auralized, spatialized (orientation-aware) and distance-aware audios.
  • Enhanced image synthesis on complex scenes with sparse data through cross-modal learning.


Audio-Visual Generation



Audio-Only Generation




Paper

Amandine Brunetto, Sascha Hornauer, Fabien Moutarde.
NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields.
(hosted on ArXiv)


[Bibtex]


Acknowledgements

The authors acknowledges the support of the French Agence Nationale de la Recherche (ANR), under grant ANR22-CE94-0003. The webpage template was originally made by Phillip Isola and Richard Zhang for a Colorization project.