Abstract: We present ForceSight, a system for text-guided mobile manipulation that predicts visual-force goals using
a text-conditioned vision transformer. Given a single RGBD image and a text prompt, ForceSight determines a target end-
effector pose in the camera frame (kinematic goal) and the associated forces (force goal). Together, these two components
form a visual-force goal. Prior work has demonstrated that deep models outputting human-interpretable kinematic goals
can enable dexterous manipulation by real robots. Forces are critical to manipulation, yet have typically been relegated
to low-level execution in these systems. When deployed on a mobile manipulator equipped with an eye-in-hand RGBD
camera, ForceSight performed tasks such as precision grasps, drawer opening, and object handovers with an 81% success
rate in unseen environments with object instances that differed significantly from the training data. In a separate experiment,
relying exclusively on visual servoing and ignoring force goals dropped the success rate from 90% to 45%, demonstrating
that force goals can significantly enhance performance.
Loading