Demonstrations of Uncertainty-Sensitive Privileged Learning (USPL)

1. PP-DP Behavior Divergence

This section compares the Deployment Policy (DP) and Privileged Policy (PP) behaviors. Each row shows the baseline algorithm on the left and our method on the right. Except for Blind-Mass Stack, the transparent robot represents the PP, and the other robot represents the DP.

Blind-Mass Stack (RMA). Left: DP, Right: PP

Blind-Mass Stack (USPL). Left: DP, Right: PP

Lateral Choice (RMA)

Lateral Choice (USPL)

Midpoint Choice (RMA)

Midpoint Choice (USPL)

Biased Quadrotor (RMA)

Biased Quadrotor (USPL)

Signpost Nav (RMA)

Signpost Nav (USPL)

Square Maze (RMA)

Square Maze (USPL)

Stairway Search (RMA)

Stairway Search (USPL)

These results show that the behavioral discrepancy between the DP and PP in USPL is significantly lower than in RMA, with their trajectories almost entirely overlapping most of the time.

2. Behavior and Privileged Prediction Visualization

This section shows USPL robot trajectories alongside predicted privileged observations. Standard deviation indicates uncertainty. In Blind-Mass Stack and Biased Quadrotor, the semi-transparent red/blue/green blocks indicate the range of predicted privileged observations, and the dashed lines pointed to by the two triangles denote the actual privileged observation.

Blind-Mass Stack

Lateral Choice

Midpoint Choice (image)

Midpoint Choice

Biased Quadrotor

Signpost Nav (image)

Signpost Nav

Stairway Search

Square Maze (Goal at Bottom Left)

Square Maze (Goal at Upper Left)

These results indicate that the observation encoder can sensitively predict the current uncertainty; once some information is discovered, the uncertainty responds rapidly and accurately reflects how much information remains to be gathered.

3. Manually Set Uncertainty

Here, instead of feeding the uncertainty output by the observation encoder to the policy, we manually set the uncertainty. The uncertainty value is displayed in the top-left corner of the video. We start by assigning a high uncertainty, and after some time, reduce it to observe how the policy responds.

Blind-Mass Stack

Lateral Choice

Midpoint Choice (image)

Midpoint Choice

Biased Quadrotor

Signpost Nav (image)

Signpost Nav

Square Maze

Stairway Search

As observed, when the uncertainty is high, the policy remains in an exploratory mode and avoids completing the task. Once the uncertainty decreases, the policy immediately proceeds to accomplish the task, demonstrating that our privileged policy is highly sensitive to uncertainty.

4. Tasks Descriptions

Stairway Search

Observation space: Depth images from an onboard depth camera; current robot position and orientation.

Action space: Desired change in heading (yaw) and desired body pitch.

Privileged observation: Coordinates of the target platform.

Reward: Shaped reward for approaching the target platform and a terminal reward for reaching it.

Task description: The robot starts on a large platform and must step onto a smaller platform with stairways on both sides to descend to the ground. A low-level controller receives target speed (0.5 m/s), desired pitch, and yaw commands.

Optimal behaviour: Peer over the edge to locate the smaller platform, then walk onto it.

Lateral Choice

Observation space: Current robot position and orientation.

Action space: Desired change in heading (yaw).

Privileged observation: Coordinates of the goal point.

Reward: Shaped reward for approaching the goal and a terminal reward for reaching it.

Task description: The goal may be on left or right; the robot is blind to terrain and must navigate by trial. Episode ends when the goal is reached.

Optimal behaviour: Walk toward one side; if the goal is not reached, turn around and go to the opposite side.

Blind-Mass Stack

Observation space: Positions of three cubes; end-effector position; index of green cube; grasp flag and index; measured mass of grasped cube.

Action space: Discrete: end-effector x,y target (one of three cubes), end-effector z target (three heights), gripper open/close.

Privileged observation: Index of red cube and bias on weight sensor.

Task description: Cubes red, green, blue placed left to right (1–3). Green’s mass known (1 kg); red is 0.75 kg but index unknown; blue unknown. Weight sensor biased by a factor. Actor must self-calibrate and stack the red cube on green.

Optimal behaviour: Grasp green to calibrate bias, then pick a remaining cube, check calibrated mass, and stack the red cube on green.

Signpost Nav

Observation space: Position, orientation, plus either scandot heights or head-mounted depth images.

Action space: Desired change in heading (yaw).

Privileged observation: Coordinates of the goal point.

Reward: Shaped reward for moving toward goal and terminal reward for reaching it.

Task description: A signpost encodes hidden goal direction (orientation) and distance (length). Robot must infer goal and navigate.

Optimal behaviour: Reach signpost, infer goal geometry, then travel to goal.

Square Maze

Observation space: Position and orientation.

Action space: Desired change in heading (yaw).

Privileged observation: Coordinates of the goal point.

Reward: Shaped reward for approaching goal and terminal reward for reaching it.

Task description: Goal at one of four corners; maze layout slightly varies. Probe junctions: if collision then right, else left.

Optimal behaviour: Probe each junction to infer layout and reach goal corner.

Midpoint Choice

Observation space: Position, orientation, plus either scandot or depth images.

Action space: Desired change in heading (yaw).

Privileged observation: Coordinates of the goal point.

Reward: Shaped reward for approaching goal and terminal reward for reaching it.

Task description: Four corner platforms each preceded by two columns; only goal platform’s pair is second-tallest. Robot measures heights, identifies second tallest, then moves there.

Optimal behaviour: Circle platforms, record heights, identify second-tallest pair, go to corresponding platform.

Biased Quadrotor

Observation space: Perceived altitude, roll, pitch, angular velocity, vertical velocity, and landed flag.

Action space: Target pitch, roll, and altitude.

Privileged observation: Biases in altitude, roll, and pitch sensors.

Reward: Shaped for maintaining target altitude/attitude, terminal for achieving them, penalty for touching ground.

Task description: Quadrotor must hover at target altitude with corrupted sensor readings. Ground contact provides reference to calibrate sensors, then hover.

Optimal behaviour: Land to zero-reference sensors, calibrate, then take off and maintain steady hover.