We find that driving reasonably accurate dialogue action generation through speech and spatial intent information is entirely learnable. However, we believe the current data volume is far from sufficient for optimal performance—the mere 8,000 data points only validate the feasibility of this approach. In future work, we plan to increase the dataset by one to two orders of magnitude, which would likely elevate the task’s performance to a human-satisfactory level.