Guarded Policy Optimization with
Imperfect Online Demonstrations

International Conference on Learning Representations (ICLR) 2023

Zhenghai Xue1,   Zhenghao Peng2,   Quanyi Li3,   Zhihan Liu4,   Bolei Zhou2 
1Nanyang Technological University, Singapore, 2University of California, Los Angeles,
3University of Edinburgh, 4Northwestern University
Webpage | Code | Paper | Talk

Fig. 1: Overview of the proposed method.
As shown in Fig. 1, we include a teacher policy \(\pi_t\) in the training loop of RL. During the training of the student policy, both \(\pi_s\) and \(\pi_t\) receive current state \(s\) from the environment. They propose actions \(a_s\) and \(a_t\), and then a value-based intervention function \(\mathcal{T}(s)\) determines which action should be taken and applied to the environment. The student policy is then updated with data collected through both policies.
Previous works assume the availability of a well-performing teacher policy. It intervenes whenever the student acts differently. However, it is time-consuming or even impossible to obtain a well-performing teacher in many real-world applications such as object manipulation with robot arms and autonomous driving. If we turn to employ a suboptimal teacher, incorrect teacher intervention will result in a burdened student policy.

Fig. 2: In an autonomous driving scenario, the ego vehicle is the blue one on the left, following the gray vehicle on the right. The upper trajectory is proposed by the student to overtake and the lower trajectory is proposed by the teacher to keep following.
We illustrate this phenomenon with the example in Fig. 2 where a slow vehicle in gray is driving in front of the ego-vehicle in blue. The student policy is aggressive and would like to overtake the gray vehicle to reach the destination faster, while the teacher intends to follow the vehicle conservatively. In this scenario, the teacher will intervene the student's exploration as it behaves differently. The student policy can never accomplish a successful overtake.
To address this issue, we propose a new algorithm called Teacher-Student Shared Control (TS2C). It only allows the teacher to intervene when the student action is dangerous or has low expected return. The risk and the expected return of the student action is jointly determined by the value functions, leading to the following intervention function: $$\mathcal{T}(s)=\begin{cases} 1 \quad\text{if}~V^{\pi_t}\left(s\right)-\mathbb{E}_{a\sim\pi_s(\cdot|s)}Q^{\pi_t}\left(s, a\right)>\varepsilon,\\[.8em] 0 \quad\text{otherwise}, \end{cases}$$ where \(V^{\pi_t}\) and \(Q^{\pi_t}\) are the teacher's state and action-value functions, \(\pi_s(\cdot|s)\) is the student policy, and \(\varepsilon\) is a threshold that controls the risk tolerance of the teacher.
The training result with three different levels of teacher policy can be seen in Fig. 3. The first row shows that the performance of TS2C is not limited by the imperfect teacher policies. It converges within 200k steps, independent of different performances of the teacher. EGPO and Importance Advicing is clearly bounded by teacher-medium and teacher-low, performing much worse than TS2C with imperfect teachers. The second row of Fig. 3 shows TS2C has lower training cost than both algorithms.

Fig. 3: Comparison between our method TS2C and other algorithms with teacher policies providing online demonstrations. "Importance" refers to the Importance Advising algorithm. For each column, the involved teacher policy has high, medium, and low performance respectively.
The performances of TS2C in different MuJoCo environments are presented in Fig. 4. The figures show that TS2C is generalizable to different environments. It can outperform SAC in all three MuJoCo environments taken into consideration. On the other hand, though the EGPO algorithm has the best performance in the Pendulum environment, it struggles in the other two environments, namely Hopper and Walker

Fig. 4: Performance comparison between our method TS2C and baseline algorithms on three environments from MuJoCo.
We summarize our core technical comtribution in this talk.
  title   = {Guarded Policy Optimization with Imperfect Online Demonstrations},
  author  = {Zhenghai Xue and Zhenghao Peng and Quanyi Li and Zhihan Liu and Bolei Zhou},
  journal = {International Conference on Learning Representations},
  year    = {2023},
  url     = {}