Data-Efficient Learning from Human Interventions for Mobile Robots

ICRA 2025

Zhenghao Peng , Zhizheng Liu , Bolei Zhou
University of California, Los Angeles

TL; DR

We train two mobile robots in real world in real time :robot: via human-in-the-loop learning! Our method:

:star2: Learns from online human intervention and demonstration!

:star2: Trains from scratch, without reward!

Safe Navigation

A delivery robot (Unitree Go2) learns to navigate safely in a real-world environment, avoiding collisions with static or dynamic obstacles. Even though the observation is RGB-D image, training can be done in 20 minutes and the robot can generalize to unseen environments.

Human Following

We train a quadruped robot (Unitree Go2) to follow a human subject in a real-world environment. Training is completed within 10 minutes.

Zero-shot Deployment

We have successfully deployed the learned policies to unseen environments.

Safe Navigation:


Human Following:

Full Training Footage

  • Predictive Preference Learning (NeurIPS 2025): PPL is a model-based online preference learning algorithm. It predicts future failures and learn from hypotheical preference data: if expert takeover now, it might also takeover in near states if we let the agent continuously run.

  • Adaptive Intervention Mechanism (ICML 2025): AIM is a robot-gated Interactive Imitation Learning (IIL) algorithm that cuts expert takeover cost by 40%.

  • PVP for Real-world Robot Learning (ICRA 2025): We apply PVP for real-world robot learning, showing that we can train mobile robots from online human intervention and demonstration, from scratch, without reward, from raw sensors, and in 10 minutes!

  • Proxy Value Propagation (PVP) (NeurIPS 2023 Spotlight): Proxy Value Propagation (PVP) is an Interactive Imitation Learning algorithm adopts the reward-free setting and further improves learning from active human involvement. These improvements address the catastrophic forgetting and unstable behavior of the learning agent, and the difficulty in learning the sparse yet crucial human behaviors. As an PVP achieves 10x faster learning efficiency, the best user experience and safer human-robot shared control.

  • Teacher-Student Shared Control (ICLR 2023): In Teacher-Student Shared Control (TS2C), we examined the impact of using the value function as a criterion for determining when the PPO expert should intervene. TS2C makes it possible to achieve student policy that has super-teacher performance.

  • Human-AI Copilot Optimization (ICLR 2022): Building upon the methodology of EGPO, and substituting the PPO expert with a real human subject, we proposed Human-AI Copilot Optimization (HACO) and it demonstrated significant improvements in learning efficiency over traditional RL baselines.

  • Expert Guided Policy Optimization (CoRL 2021): Our research on human-in-the-loop policy learning began in 2021. The first published work is Expert Guided Policy Optimization (EGPO), where we explored how an RL agent can benefit from the intervention of a PPO expert.

Reference

@article{peng2025data,
  title={Data-Efficient Learning from Human Interventions for Mobile Robots},
  author={Peng, Zhenghao and Liu, Zhizheng and Zhou, Bolei},
  journal={arXiv preprint arXiv:2503.04969},
  year={2025}
}