Few-shot Acoustic Synthesis with Multimodal Flow Matching

Amandine Brunetto

Few-shot Acoustic Synthesis with Multimodal Flow Matching

CVPR 2026

Amandine Brunetto
Mines Paris - PSL University

Few-shot flow-matching acoustic synthesis (FLAC) and scene-consistency evaluation: Given a few-shot context \( \tau \), including a depth map, an acoustic observation, and sensor poses, FLAC uses a DiT trained with flow matching to generate room impulse responses in novel rooms. FLAC models the distribution of plausible RIRs under sparse scene context, capturing acoustic uncertainty. Even with one shot, FLAC outperforms 8-shot state-of-the-art methods. We also introduce AGREE, a CLIP-style audio-geometry embedding that aligns both modalities in a shared latent space, enabling scene-consistency evaluation through retrieval and distributional metrics.

Abstract

Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic–geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

Results

BibTeX

@inproceedings{brunetto2026flac,
      title={Few-shot Acoustic Synthesis with Multimodal Flow Matching},
      author={Amandine Brunetto},
      booktitle={CVPR},
      year={2026},
}