Urban Scene Diffusion through Semantic Occupancy Map

Junge Zhang 1,4 , Qihang Zhang 2 , Li Zhang 1 , Ramana Rao Kompella 3 , Gaowen Liu 3 , Bolei Zhou 4
1 Fudan University , 2 The Chinese University of Hong Kong , 3 Cisco , 4 University of California, Los Angeles

Overview

Fig. 1: Diverse individual scenes generated by UrbanDiffusion.

This work presents UrbanDiffusion, a novel 3D diffusion model for generating large-scale urban scenes from Bird’s-Eye View (BEV) maps. The model innovatively incorporates both the geometry and semantics of urban structures and objects, extending beyond mere visual representation. It learns the data distribution of scene-level structures within a latent space, thereby facilitating the generation of diverse urban scenes of any scale. Trained on a real-world driving dataset, this model is capable of generating scenes from both held-out BEV maps and synthesized maps from a driving simulator. Furthermore, this work illustrates its applicability in scene image synthesis using a pretrained image generator.

Method

Fig. 2: Framework of UrbanDiffusion.

In order to train the diffusion model in a fast and memory efficient way, we choose to follow the latent diffusion model to learn the latent distribution of the 3D data. We thus first embed the 3D semantic data in the space with a lower dimension and then conduct a classifier-free guidance for the diffusion process in this latent feature space. Given a BEV layout, the trained model can generate diverse and realistic samples that contain the scene geometry and semantic information through sampling process:

$$ \hat{\epsilon}_\theta ( z_t |c_{bev})= \epsilon_{\theta}(z_t|\phi) + w \cdot (\epsilon_{\theta}(z_t|c_{bev})-\epsilon_{\theta}(z_t|\phi)). $$

Scene Generation

Condition on single frame BEV map
We demonstare the generated samples on BEV maps from different datasets, including nuScenes Validation set, Waymo Motion Dataset, nuPlan Dataset and Metadrive Procedural Generation Map.

nuScenes Validation set

nuPlan Dataset

Metadrive Procedural Generation Map

Large-scale Scene Generation


Scene Synthesis

Synthesis on scenes from different dataset
We perform scene synthesis on different scenes which are sampled from diverse BEV maps:


Reference

@article{urbandiff,
    title={Urban Scene Diffusion through Semantic Occupancy Map},
    author={Junge Zhang and Qihang Zhang and Li Zhang and Ramana Rao Kompella and Gaowen Liu and Bolei Zhou},
    journal={arXiv preprint arXiv:2403.11697},
    year={2024}
}


Acknowledgement

This work was supported by the Cisco Faculty Award.