We present the performance of agents trained using the PbRL algorithms of PPE and QPA with real human preference feedback.

Additionally, we provide the results presented in the Supplementary Material [2] of the QPA [1] paper for easy comparison of effectiveness.

>**File Explanation**

>>**./humanFeedbackDemostration/handcraft.mp4:** The motion performance of an agent trained by the sac algorithm using the reward of the DMControl suite itself, as provided in [2].

>>**./humanFeedbackDemostration/humanFeedback_QPA:** The motion performance of an agent trained by the QPA algorithm with 100 human preference annotations, as provided in [2].

>>**./humanFeedbackDemostration/humanFeedback_reproducedQPA.mp4:** The motion performance of an agent trained by obtaining 100 human feedback from our volunteers using the code provided by the QPA paper [3].

>>**./humanFeedbackDemostration/humanFeedback_PPE+QPA.mp4:** The motion performance of an agent trained by obtaining 100 human feedback from our volunteers using PPE algorithm.


Through these video files, we visually demonstrate that the agent trained by the PPE algorithm is more in line with people's understanding of the cheetah run task description compared to the agent trained by the QPA algorithm.


[1]https://openreview.net/forum?id=UoBymIwPJR
[2]https://openreview.net/attachment?id=UoBymIwPJR&name=supplementary_material
[3]https://github.com/huxiao09/QPA