Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

Ziyang Xie 1,2 , Zhizheng Liu 1 , Zhenghao Peng 1 , Wayne Wu 1 , Bolei Zhou 1
1 University of California, Los Angeles , 2 University of Illinois Urbana-Champaign

TL; DR

:selfie: Vid2Sim can convert real-world monocular videos into realistic and interactive digital-twin environments in simulation for urban navigation training.

:robot: Vid2Sim enables reinforcement learning agents to learn navigation policies in diverse realistic simulation scenarios, where the learned policies can be directly deployed on real-world robots with minimal sim-to-real gap.

Vid2Sim Architecture

Image

Vid2Sim framework consists of three key stages: (1) Geometry-consistent reconstruction for high-quality real-to-sim environment creation, (2) building a realistic and interactive simulation with hybrid scene representation and diverse obstacle and scene augmentation for urban navigation training, and (3) zero-shot sim-to-real deployment in the real world to verify our pipeline’s effectiveness.

Interactive Scene Composition

Image

Vid2Sim facilitates augmented real-to-sim environment creation through interactive scene composition, including varied static obstacles and other dynamic agents. This approach enables the generation of diverse, controllable and safty critical corner cases for safe urban navigation training.

Real2Sim Navigation Training

We train and test our agents in diverse realistic real2sim environments with augmented static obstacles and dynamic agents. (Bottom-right is the agent’s view)

Scene 1
Scene 2


Zero-shot Sim2Real Deployment

After training in real2sim environments, we deploy our agents to the real world in a zero-shot manner. This demonstrates the effectiveness of our Vid2Sim pipeline in bridging the sim-to-real gap.

  • Note: The navigation agent only takes RGB image as visual input and deployed in the real world without any fine-tuning
Static & Dynamic Obstacles Avoidance
Sudden Pedestrian Cut-in

Real2Sim Digital-Twin

We show our Vid2Sim pipeline could generate realistic digital-twin environments for realworld scenes. The digital-twin environment is controllable as well as interactive and can be used for both policy training and evaluation navigation policies.

Image

Diverse Environment Augmentation

Vid2Sim can further support controllable scene editing and advanced weather simulation through 3D scene editing and particle system simulation. This enables more robust and generalizable policy training under different lighting and weather conditions.

Scene Style Augmentation

Weather Simulation

Rain Simulation
Fog Simulation

Vid2Sim Dataset

We curate a dataset of 30 diverse real-to-sim (real2sim) environments from web sourced videos for urban navigation training. We further evaluate the generalizability improvement of our agents as the number of the training environment increase.

Environments Gallery


Generalizability Results

Image

This table compares (a) the success rate (SR) and (b) success rate weighted by path length (SPL) across varying numbers of training environments. Increasing the number of training environments leads to a higher test success rate and SPL, which indicates improved agents generalizability.


Reference

@article{xie2024vid2sim,
  title={Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation},
  author={Ziyang Xie and Zhizheng Liu and Zhenghao Peng and Wayne Wu and Bolei Zhou},
  journal={Preprint},
  year={2024}
}

Acknowledgement

We thank COCO Robotics for donating mobile hardware for our real-world experiment.