MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility

TL;DR

MetaUrban

compositional

generalizable

safe

Introducing MetaUrban

Download Video from Google Drive | Baidu Netdisk

Abstract

Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while robot dogs and humanoids have recently emerged in the street. Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system. Ensuring the generalizability and safety of AI models maneuvering mobile machines is essential. In this work, we present MetaUrban, a compositional simulation platform for AI-driven urban micromobility research. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for urban micromobility research and establish various baselines of Reinforcement Learning and Imitation Learning. We conduct extensive evaluation across mobile machines, demonstrating that heterogeneous mechanical structures significantly influence the learning and execution of AI policies. We perform a thorough ablation study, showing that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide research opportunities and foster safe and trustworthy embodied AI and micromobility in cities. The code and dataset are released.

Procedural Generation Pipeline

MetaUrban can automatically generate complex urban scenes with its compositional nature. MetaUrban uses a structured description script to create urban scenes. Based on the provided information about street blocks, sidewalks, objects, agents, and more, it starts with the street block map, then plans the ground layout by dividing different function zones, then places static objects, and finally populates dynamic agents. In the Figure, the first column is the structured description script. From the second to the fourth column, the top rows show the 2D road maps, and the bottom rows show the bird-eye view of 3D scenes in the simulator.

Urban Scene Gallery

Parade of Dynamic Agents

Sensors

Benchmarks

We design two common tasks in urban scenes as the pilot study: Point Navigation (PointNav) and Social Navigation (Saccess to a pre-built environment map. In SocialNav, the agent is required to reach a point goal in dynamic environments that contain moving environmental agents. The agent shall avoid collisions or proximity to environmental agents beyond thresholds to avoid penalization (distance <0.2 meters). The agent is evaluated using the Success Rate (SR) and Success weighted by Path Length (SPL) metrics, which measure the success and efficiency of the path taken by the agent. For SocialNav, except Success Rate (SR), the Social Navigation Score (SNS), is also used to evaluate the social complicity of the agent. For both tasks, we further report the Cumulative Cost (CC) to evaluate the safety properties of the agent. It records the crash frequency to obstacles or environmental agents. We evaluate 7 typical baseline models to build comprehensive benchmarks on MetaUrban, across Reinforcement Learning (PPO), Safe Reinforcement Learning (PPO-LagocialNav). In PointNav, the agent’s goal is to navigate to the target coordinates in static environments without, and PPO-ET), Offline Reinforcement Learning (IQL and TD3+BC), and Imitation Learning (BC and GAIL).

Results

The results of a PPO policy trained in MetaUrban environments on the social navigation task. We demonstrate success cases that can avoid collision with objects and other agents. However, there are still many interesting failure cases, which indicate the complexity of MetaUrban environments and the significant room for improvement of embodied agents in urban spaces.

Success Cases

Failure Cases

Terrains and Materials

Traffic Rules

Interface for Demonstration Data Collection

Interface for Human-in-the-loop Learning

Impacts

Embodied AI. MetaUrban contributes to advancing areas such as robot navigation, social robotics, and interactive systems. It could facilitate the development of robust AI systems capable of understanding and navigating complex urban environments.
Economy. MetaUrban could be used in businesses and services operating in urban environments, such as last-mile food delivery, assistive wheelchairs, and trash-cleaning robots. It could also drive innovation in urban planning and infrastructure development by providing simulation tools and insights into how spaces are utilized, thereby enhancing the economic and societal efficiency of public urban spaces like sidewalks and parks.
Society. By enabling the safe integration of robots and AI systems in public spaces, MetaUrban could support the development of assistive technologies that can aid in accessibility and public services. Using AI in public spaces might foster new forms of social interaction and community services, making urban spaces more livable and joyful.

Acknowledgement

The project was supported by the NSF Grants CNS-2235012, IIS-2339769, and TI-2346267, and the Intel Rising Star Faculty Award. We extend our gratitude for the excellent datasets, including 3D objects from Objaverse-XL and OmniObject3D, 3D humans from SynBody, and human motions from BEDLAM. Special thanks to Hao Zhu, Haiyi Mei, Jianglin Fu, Rongzhang Gu, Lei Yang, and Zhenghao Peng for their assistance and insightful discussions. Thank COCO Robot for donating virtual and physical robots.

Reference

@article{wu2025metaurban,
  title={MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility},
  author={Wu, Wayne and He, Honglin and He, Jack and Wang, Yiran and Duan, Chenda and Liu, Zhizheng and Li, Quanyi and Zhou, Bolei},
  journal={International Conference on Learning Representation},
  year={2025}
}