SonoWorld: From One Image to a 3D Audio-Visual Scene
A first generative framework for creating explorable 3D audio-visual scenes from one image, with spatial audio grounded in scene semantics and 3D geometry.
I am a second-year Master’s student in Computer Science at the University of Maryland, College Park, where I’m advised by Prof. Ruohan Gao. My research focuses on multimodal computer vision and differentiable rendering.
I earned my B.Eng. in Computer Science and Engineering from The Chinese University of Hong Kong, Shenzhen in 2024, working under the supervision of Prof. Xiaoguang Han.
A first generative framework for creating explorable 3D audio-visual scenes from one image, with spatial audio grounded in scene semantics and 3D geometry.
A first physics-based, audio-visual differentiable acoustic renderer that uses multi-view images and beam tracing to reconstruct room impulse responses (RIRs) efficiently and accurately from sparse real-world measurements.
A unified benchmark for human NeRF models, providing standardized datasets, metrics, and a generalizable baseline.
A 3D generative framework that produces realistic and diverse digital humans with both texture and geometry by combining both 2D and 3D priors with minimal 3D supervision.