AV-DAR

Abstract

An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.

Method Pipeline

Qualitative Results

More Comparisons

Example 1: Piano Chord^[1]

Dry Sound:

Layout:

	GT	Ours	AVR	INRAS++
RIR
Rerverb

Example 2: Stomping Drum^[2]

Dry Sound:

Layout:

	GT	Ours	AVR	INRAS++
RIR
Rerverb

All the models are trained on 1% of the original training data in the Real Acoustic Field dataset.
Sound sources: [1], [2].

@inproceedings{jin2025avdar, title = {Differentiable Room Acoustic Rendering with Multi-View Vision Priors}, author = {Jin, Derong and Gao, Ruohan}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2025}, }

Differentiable Room Acoustic Rendering with Multi-View Vision Priors

Binaural audio tour rendered by AV-DAR across 6 real scenes. Trained on 1% (RAF) / 12 (HAA) IRs. Headphones are strongly recommended.

Sound Source: DayNightMorning

Abstract

Method Pipeline

Interactive Demo

Qualitative Results

More Comparisons

Example 1: Piano Chord^[1]

Example 2: Stomping Drum^[2]

All the models are trained on 1% of the original training data in the Real Acoustic Field dataset.
Sound sources: [1], [2].

BibTeX

Differentiable Room Acoustic Rendering with Multi-View Vision Priors

Binaural audio tour rendered by AV-DAR across 6 real scenes. Trained on 1% (RAF) / 12 (HAA) IRs. Headphones are strongly recommended.

Sound Source: DayNightMorning

Abstract

Method Pipeline

Interactive Demo

Qualitative Results

More Comparisons

Example 1: Piano Chord[1]

Example 2: Stomping Drum[2]

All the models are trained on 1% of the original training data in the Real Acoustic Field dataset. Sound sources: [1], [2].

BibTeX

Example 1: Piano Chord^[1]

Example 2: Stomping Drum^[2]

All the models are trained on 1% of the original training data in the Real Acoustic Field dataset.
Sound sources: [1], [2].