Rule and Convention Transfer in Context

Introduction

AI has the potential to streamline important aspects of society, such as labor and transportation. As artificially intelligent systems grow in number, we observe a greater need for humans to collaborate with them effectively. Improving AI’s ability to collaborate with us allows us to solve increasingly complex tasks together with greater efficiency. In On the Critical Role of Conventions in Adaptive Human-AI Collaboration, Shih et al. look to improve Human-AI Collaboration by having AI adapt to new tasks and agents quickly. (Shih et al. 2021)

Rules and Conventions

The authors approach the problem by considering a notion of conventions, which represent specific shared knowledge that arises from interactions between agents. These are presented in contrast to rules, which are established constraints inherent to a given task. Similarly, convention-dependent skills are shared partner-specific strategies developed through interaction, and rule-dependent skills are those that develop as a result of task constraints, independent of who the partner is.

We can illustrate this via an example: consider the popular sport of American football. Rules in football are constraints placed on players, such as the bounds they are allowed to stand within while holding the football during play when they are allowed to pass the line of scrimmage and the requirements for earning points. Some rule-dependent skills, in this case, could be knowing how to throw a football and knowing how to tackle, because these both contribute to the goals of the game and are generally independent of which players are involved. On the other hand, conventions in football could be protocols for verbal or hand signals meant to communicate where the quarterback expects players to run. A convention-dependent skill, then, could be the hand signal itself or the trajectory a receiver takes when running. Notably, such conventions allow specific partners to communicate on a certain method to solve a task, and as such faster adaptation to conventions yields faster solutions.

The authors employ this distinction between conventions and rules by learning separate representations for the two for a given collaborative multi-agent task. Intuitively, learning separation representations between tasks and partners allows them to specifically generalize to new tasks or new partners of their choice.

Distinguishing conventions from rules is useful, but it is certainly not the only approach in human-AI collaboration. Shih et al. have written a great blog post detailing their paper here, so instead, we aim to provide some context for Shih et al. ‘s work by additionally discussing other approaches to fast and seamless adaptation to humans. We begin by discussing other work using convention formation, specifically the Simplified Action Decoder (SAD). We then discuss “Other-Play” for Zero-Shot Coordination by Hu et al., which attempts zero-shot coordination with new partners. Finally, we talk through perhaps the most similar approach to Shih’s in Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer by Devin et al., showing that the goals of the approaches are different, yet both accomplish the transfer of information that has already been learned. Throughout this process, we look to motivate Shih et al.’s work, before discussing possible future improvements in this approach to human-AI collaboration beyond what the authors offered in their paper.

The Benefit of Separating Rules and Conventions

The mechanics of Shih’s approach are going to be important in distinguishing it from other works. Their modular policy network, which learns distinct representations for conventions and rules, is shown in Figure 1, representing a task with four possible moves and three partners (note, however, that any number of actions or partners is possible). It begins by processing the task module from the current state, outputting an action distribution for the task and a latent representation, $z$. This is done to assign a high probability to promising actions for a task (in this case, actions 2 and 4). The next step is for the three partner modules to use $z$ in order to produce their own action distributions. The policy distribution for each partner combination is given by the product of the task action distribution and the respective partner action distribution.

Figure 1: The modular policy network proposed by Shih et al.

Since the model assigns probabilities based on the task first, the only work the partner action distribution does will be in the event of a tie or ties between certain actions. When this occurs, the agent and its partner(s) can converge on one of the actions. Once this symmetry comes up again in a related task, they will know what to do, thus having developed a convention. The policy for partner 3 is shown in Figure 2, representing that the agent and partner 3 would converge to action 4 when a similar task is presented. Rule-dependent representations are captured because the marginal best-response strategy is going to be learned in the task module. In our example, actions 2 and 4 would be rules because they show the highest probability in the action distribution.

Figure 2: Policy outcome from agent and partner 3 in Figure 1 shows action 4 is best, hence being a convention for this pair.

Importantly, the agent will have separate conventions with different partners. Using the intuition in Figure 2, we can see that the agent will form the convention of executing action 2 due to these partners assigning higher probabilities to this action. This process now has modules representing each partner and task, in addition to representations of rules for the task and conventions with their partners.

With the mechanics now laid out, we can discuss how it is used in contrast to other approaches.

Learning Conventions

A common approach in the field of human-AI collaboration is for an agent to mimic the capabilities of humans to “attribute beliefs, desires, and percepts to others that are grounded in physical activity and the state of the world.” (Baker et al. 2017) This is known as Theory of Mind, and it is implemented in artificial agents so that they can make sense of why some other agent took its specific actions to inform their own decisions. One approach that addressed this challenge was the Simplified Action Decoder (SAD), which is a simplified improvement of the Bayesian Action Decoder (BAD), both developed by Foerster et al. BAD and SAD both showed promising improvements on other models in the fully-cooperative Hanabi task, a benchmark for good collaboration.

Understanding the Simplified Action Decoder

BAD samples deterministic partial policies from a distribution that conditions on a common knowledge Bayesian belief. (Nayyar et al. 2012) This allows agents to learn informative actions that lead to the discovery of conventions via Bayesian updating, however, it comes at the cost of simplicity. The common knowledge Bayesian belief must be tracked, adding sampling steps and increasing the computation required. Additionally, the sampling process requires expert knowledge of the dynamics of the task. BAD must also use population-based training to address problems associated with its original methods, which adds work to the sampling process. Not only is BAD computationally taxing, but its reliance on common knowledge also reduces its generality because not all tasks will involve situations in which common knowledge is reasonable. SAD is able to address these issues with BAD and will thus be our focus in this section.

The key problem that SAD tries to solve is having an agent explore without adding unnecessary noise to its policy, which would make its actions less informative to its partners. The idea is to forgo the tradeoff between being informative and exploring. SAD solves this by making the agent take two different actions at each time step. One action is the greedy (informative) action, which isn’t actually executed by the agent but is observed by the other agents at the next time step as input. The other action is the standard (exploration) action that gets executed by the environment while being observed by other agents. The authors assume an ε-greedy execution for the standard action, meaning the agent takes a random action with probability ε, and a greedy action every other time. The reason this is helpful is best illustrated when considering a situation where an agent takes a random action. This action is not informative to its teammates, so, with the additional input that SAD provides (the greedy action), players are able to intuit more about the agent’s actions.

With these two streams of inputs, agents do not need to choose between being informative, by taking greedy actions and exploring by taking random actions. This leads to state of the art performance on Hanabi, especially with 2, 4, and 5 player tasks, as shown in Figure 3.

Figure 3: Learning curves of different algorithms averaged over 13 seeds per algorithm. 25 is the high score for this version of Hanabi. SAD achieves 23.81 and 23.01 for 4-player and 5-player scenarios, respectively, which is state of the art. (Image Source: [Hu and Foerster 2020])

This is one approach that uses Bayesian reasoning to implement Theory of Mind processes, specifically in the Hanabi task. Shih et al. also used Hanabi to assess their results, although they only looked at the two-player case due to the novelty of the method. In their experiment, they found that adaptation to a new partner was quicker than other approaches including some baselines and first-order MAML, an algorithm that was designed to adapt quickly and perform well on previously unknown tasks (Nichol et al. 2018). Although Shih’s approach improves upon these other methods, there is no claim of achieving SOTA.

Comparing SAD to Convention and Rule Transfer

SAD successfully uses information about other agents in the task to inform decision-making. However, it is different from Shih et al.’s approach in that it does not save this information for transfer between similar tasks or new partners. Shih’s approach can transfer task modules to new partners seamlessly. This means that the existing task action distribution will show optimal actions (rule-dependent skills), and the new partner module will be used to break symmetries (form conventions). Similarly, when presented with a new, but similar, task to one already encountered, the partner modules that were already trained can be used to settle symmetries quicker. The agents still need to learn the new task though and this is done by fine-tuning the task module with training. With $ 2n $ partners, the module is fine-tuned by training with partners 1 to $ n $ and their respective partner modules. This makes coordinating with partners $ n + 1 $ to $ 2n $ possible in a zero-shot manner, hence, speeding up the process.

Let’s consider this with a real-world scenario. Consider a relationship on a football team between an AI quarterback and a human receiver. Through repeated interactions, the quarterback and receiver learn that when the receiver reaches her release point (the point where the receiver breaks full speed), she makes eye contact with the quarterback, letting him know that she is ready. Learning representations for this convention allows the agent to time his throw more accurately and make a successful play.

Now, let’s say the human and her AI companion leave the football field and go to the basketball courts. In a 2v2 game, the agent, who can’t shoot very well, must pass the ball to the human when she is open so that she can take a good shot. If the agent used SAD, she would use the information provided by his teammate’s actions and respective greedy actions to inform her own decisions on the court. If the agent was programmed using Shih’s approach, the human and AI would use the convention they already learned while playing football, using the human’s eye contact to make well-timed passes. SAD may eventually learn this convention on the basketball court, but that work has already been done, so why do it again? This scenario demonstrates the importance of learning about the partners an agent is interacting with because the transfer of conventions to similar tasks is something Shih et al. found to be something humans actually do in their user study which you can read more about in their paper.

Approaches like SAD successfully use information about others involved in the task, however, when similar tasks come up, the work that was already done has to be repeated. The transfer of task and partner modules, and their respective rule and convention representations get closer to the goal of collaboration because having an interaction history with some agent is going to lead to shared knowledge. This should be remembered and applied.

Other-Play

Approaches like SAD and Shih’s rely heavily on learning conventions, but some approaches to adaptive Human-AI collaboration don’t rely on the use of conventions at all. For example, in “Other-Play” for Zero-Shot Coordination, the authors introduce an algorithm called other-play (OP). OP assumes that the cooperative Markov game comes with a set of symmetries, which can be thought of as sets of actions that have no clear optimal choice. Breaking these symmetries is a key problem in collaborative settings. OP attempts to solve this by finding a maximally robust strategy for partners breaking symmetries in different ways. By using reinforcement learning, OP “maximize[s] reward when matched with agents playing the same policy under a random relabeling of states and actions under the known symmetries.” (Hu et al. 2021)

The example that the authors use to illustrate this algorithm is a two-player game in which there are ten levers, with nine of them providing an award of 1.0 and one providing an award of 0.9, as shown below in Figure 4.

Figure 4: Example of lever game used to test “other-play” (Image Source: [Hu et al. 2021])

However, the only way for the players to receive the award is if they both pull the same lever. In this scenario, the levers with a payoff of 1.0 are symmetric. With OP, agents do not coordinate on how they should break symmetries. What OP will do is make both agents study the problem and realize that they are most likely to win if they both choose the single 0.9 lever, instead of each of them randomly selecting a 1.0 lever. Interestingly, this is the only action that algorithms such as SAD would never play. With SAD, agents coordinate on breaking symmetries during training because it uses self-play. OP is an alternative to this. It is a version of self-play where agents do not coordinate on how to break symmetries. This is the key difference between OP and Shih’s approach.

Conventions in Shih’s approach are represented as the means by which an agent and its partner break symmetries. This is not present in OP as it relies more on features of the problem description to drive its coordination rather than conventions that are made with partners. OP shows promise in tasks such as Hanabi, but a method whose goal is collaboration should involve some collaboration with partners. Human-AI collaboration demands quick adaptation to novel tasks and partners, and OP does provide a promising avenue for this. Shih’s approach relies on translating human behavior into a modular policy network whereas OP attempts to solve the problem of zero-shot performance, or adapting to new situations seamlessly. These are two separate approaches to attempting advanced collaboration and research is needed in both to truly see which has more potential.

Modular Policy Networks in Robots

The next work we will discuss is designed to allow for the transfer of information between tasks and robot partners. In their 2016 paper, Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer, Devin et al. introduce modular policy networks similar to the one Shih et al. used. While this approach is catered to robotics, the problem is similar: avoid the redundancy of relearning helpful information such as task-specific and agent-specific information by mechanizing the transfer of this information.

Transfer Learning in Robots

For Devin et al.’s work, there is a module for each robot and a module for each task. Each one of the modules is a small neural network. Given a specific robot and a specific task (i.e. Robot 1 and Task 2), they combine their modules, creating a modular policy that takes in observations and outputs actions. By combining the appropriate tasks and robots into modular policies, the model can train end to end using back-propagation. The training process is shown in Figure 5.

Figure 5: The training process for robot transfer learning.

Zero-shot performance in this work is defined as performing well on unseen combinations of robots and tasks given that the model was already trained on a subset of all possible combinations.

This method allows the policies to be decomposed into task-specific and robot-specific modules. Robots share the task-specific modules and the robot-specific modules are shared in all the tasks involving that robot. This means that task information can be shared amongst robots, while tasks can share robot information. This information could be the goals of the task, the dynamics or kinematics of the robots, or other important information. These tasks typically involve one robot trying to complete a goal, so collaboration is not the goal here.

Transfer Learning for Robots vs. Human-AI Collaboration

The goal in Devin et al.’s paper is not to improve collaboration amongst robots, but to use the robots that have already been trained on certain tasks to assist in new robot-task combinations. This is obviously separate from the goal of Shih’s paper, which is to improve collaboration performance between artificial agents and humans. Besides these significant differences, the approach is similar in that it avoids repeating work that is not necessary by using past knowledge. The means by which they do this differs when it comes to the input that the agents are receiving.

A world can be defined as a certain task and robot combination. The observation input of this world is split into intrinsic and extrinsic observations. Intrinsic observations are robot-specific, therefore, corresponding to the robot itself. The extrinsic observations correspond to the task being performed. This split is how policies are able to later be decomposed into their task- or robot-specific information. In contrast, Shih’s approach assumes no split in input, hence, learning convention and rule representations from one channel. This simplifies the process by putting off unnecessary work until it becomes necessary.

Improvements to Human-AI Collaboration

Shih et al.’s approach to fast adaptation to new partners and tasks is clearly different from other approaches and it shows potential for several collaborative tasks. By separating rules and conventions into their own representations, the transfer of these ideas is a lot quicker, especially given the modular policy architecture. With this being said, there is still room for improvement in order to scale this approach into something applicable to more than just simple games and tasks.

Room for Improvement

A common method for training in reinforcement learning is self-play. This is when an agent repeats a task with itself multiple times to learn which strategies produce the highest rewards, ultimately leading to optimal performance in the task. In On the Utility of Learning About Humans for Human-AI Coordination, Carroll et al. found that when agents that are trained via self-play interact with humans, they do not perform as well compared to interacting with themselves, specifically in collaborative tasks. (Carroll et al. 2019) The partners used in the experiments for Shih’s paper were “either generated via self-play or were hand-designed to use varying conventions.” Since no agent in this model was trained with humans, Carroll would predict that collaboration with humans would not be optimal. It is clear that Shih’s model provides quicker adaptation to tasks and partners, but the ultimate goal is for this adaptation to be possible for all partners; most importantly humans. If training were to incorporate behaviors similar to human behaviors, we can focus on the problems that still persist and take steps to solve those problems.

A fairly clear issue here is scale. All three experiments in this paper were with no more than one additional partner. Additionally, due to the novel nature of the approach and the lack of available benchmarks, the tasks tested were less complex. The authors ran experiments on two-player versions of the collaborative contextual bandit, block-placing, and Hanabi tasks. The symmetries that showed up are thus limited. Furthermore, while the paper does show improvements on Hanabi, a fairly complex task, a proper next step to would be to run experiments on new tasks with existing partners. The only experiments run were testing for adaptation to new partners, which shows promise with self-play partners but little more at moment.

As mentioned above, part of the issue with scale is that available cooperative task environments are lacking, and this is emphasized for tasks that enable convention tracking. In designing new environments for cooperative tasks, a natural extension of Shih et al.’s work would be to think about what really makes tasks conducive to conventions. Is there a shared, quantifiable structure that requires conventions to succeed? If there is such a structure, how do we find it? Answering such questions would be a productive step in comparing the nature of collaborative tasks mentioned in the above papers such as Hanabi, Overcooked (Carroll et al.), and negotiation, so that more complex and interaction-rich tasks can be designed. Furthermore, it would also be useful to understand which tasks are better evaluators of human-AI collaboration, as opposed to just agent-agent collaboration.

Speaking of human-AI collaboration, there is considerable work to be done in improving these collaborative algorithms’ performance with real human subjects. Prior work such as OP and Eger et al.’s An Intentional AI for Hanabi documented the effectiveness of their approaches in zero-shot coordination and task performance respectively. In particular, the latter discuss how designing agents with intentionality helped with performance on the Hanabi task. They also note how conventions were important for appropriate “optimal” play even with the boost that intentionality provided. Such discussion motivates the question of whether there are certain agent design principles, when enjoyed in tandem with conventions, that will significantly improve collaboration with real humans. It is also worth thinking about what properties of environments and artificial agents will make humans better able to generalize between them, which will be important for applicability.

The goals and motivation of this paper are great for improving the capabilities of human-AI interaction. Transferring already learned knowledge is essential for letting artificial agents adapt to the humans and environments around them. By starting with conventions and rules, Shih et al. provide an avenue for future research that incorporates this more general challenge of information transfer between tasks and partners. Successfully facilitating this transfer can provide a more comprehensive agent that draws from its past in order to adapt to new people and new tasks. This type of advanced human-AI collaboration has the potential to change many aspects of society by combining the best attributes of each in order to address problems efficiently. However, before this happens, we must continue to account for more layers of human cognition, such as conventions, when training AI agents. This line of work presents a step forward to that goal and makes a strong case for aspects of its approach to be used in future works.

References

[1] Shih et al.“On the Critical Role of Conventions in Adaptive Human-AI Collaboration” ICLR 2021.
[2] A. Shih “On the Critical Role of Conventions in Adaptive Human-AI Collaboration” iliad Blog Post, 2021.
[3] Baker et al. “Rational quantitative attribution of beliefs, desires and percepts in human mentalizing” Nature Human Behavior, 1(4):1-10, 2017.
[4] H. Hu, J. Foerster “Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning” ICLR 2020.
[5] Foerster et al. “Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning” arXiv:1811.01458v3 [cs.MA] (2019).
[6] Bard et al. “The Hanabi Challenge: A New Frontier for AI Research” arXiv:1902.00506v2 [cs.LG] (2019).
[7] Nayyar et al. “Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach” arXiv:1209.1695v1 [cs.SY] (2012).
[8] Nichol et al. “On First-Order Meta-Learning Algorithms” arXiv:1803.02999v3 [cs.LG] (2018).
[9] Hu et al. "”Other-Play” for Zero-Shot Coordination” arXiv:2003.02979v3 [cs.AI] (2021).
[10] Devin et al. “Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer” arXiv:1609.07088v1 [cs.LG] (2016).
[11] Carroll et al. “On the Utility of Learning About Humans for Human-AI Coordination” arXiv:1910.05789v2 [cs.LG] (2019).