BEVGen: Street-View Image Generation from a Bird's-Eye View Layout

Alexander Swerdlow, Runsheng Xu, Bolei Zhou

University of California, Los Angeles

Webpage | Code | Paper

Overview

Fig. 1 BEVGen framework: A BEV layout and source multi-view images are encoded to a discrete representation and are flattened before passed to the autoregressive transformer. Spatial embeddings are added to both camera and BEV tokens inside each transformed bloc, the learned pairwise camera bias are added to the attention weights. Weighted CE loss is applied during training, and we pass the tokens to the decoder to obtain generated images during inference.

In this work, we tackle the new task of generating street-view images from a BEV layout and propose a generative model called BEVGen to address the underlying challenges. We develop an autoregressive neural model called BEVGen that generates a set of realistic and spatially consistent images. BEVGen has two technical novelties: (i) it incorporates spatial embeddings using camera instrinsics and extrinsics to allow the model to attend to relevant portions of the images and HD map, and (ii) it contains a novel attention bias and decoding scheme that maintains both image consistency and correspondence.

Fig. 2: Synthesized multi-view images from BEVGen on nuScenes. Image contents are diverse and realistic. The two instances in the bottom row use the same BEV layout for synthesizing the same location in day and night.

Demo Video

To further demonstrate the spatial disentanglement of the model, we compare the generated images to the source images over multiple frames from the same scene. We use the original BEV layouts from the validation set, and create a video by appending each generation. We observe that cars and road markings generally stay consistent between frames, with the layout visually matching the source images. Note that our model does not enforce temporal consistency, and thus it is expected that the generated frames may not produce the same vehicles and background scenery in two adiacent frames. We leave incorporating temporal consistency for future work.

Reference

@article{swerdlow2024streetview,
    title={Street-View Image Generation from a Bird's-Eye View Layout}, 
    author={Alexander Swerdlow and Runsheng Xu and Bolei Zhou},
    year={2024},
    journal={IEEE Robotics and Automation Letters},
    volume={Preprint},
}

Acknowledgement

This work was supported by the National Science Foundation under Grant No. 2235012.