Weak identifiability and its consequences in strategic settings
One way that agents might become involved in catastrophic conflict is if they have mistaken beliefs about one another. Maybe I think you are bluffing when you threaten to launch the nukes, but you are dead serious. So we should understand why agents might sometimes have such mistaken beliefs. In this post I'll discuss one obstacle to the formation of accurate beliefs about other agents, which has to do with identifiability. As with my post on equilibrium and prior selection problems, this is a theme that keeps cropping up in my thinking about AI cooperation and conflict, so I thought it might be helpful to have it written up.
We say that a model is unidentifiable if there are several candidate models which produce the same distributions over observables. It is well-understood in the AI safety community that identifiability is a problem for inferring human values  . This is because there are always many combinations of preferences and decision-making procedures which produce the same behaviors. So, it's impossible to learn an agent's preferences from their behavior without strong priors on their preferences and/or decision-making procedures. I want to point out here that identifiability is also a problem for multi-agent AI safety, for some of the same reasons as in the preference inference case, as well as some reasons specific to strategic settings. In the last section I'll give a simple quantitative example of the potential implications of unidentifiability for bargaining failure in a variant of the ultimatum game.
1 Sources of unidentifiability in strategic settings
By modeling other agents, I mean forming beliefs about the policy that they are following based on observations of their behavior. The model of an agent is unidentifiable if there is no amount of data from the environment in question that can tell us exactly what policy they are following. (And because we always have finite data, "weak identifiability'' more generally is a problem — but I'll just focus on the extreme case.)
Consider the following informal example (a quantitative extension is given in Section 3). Behavioral scientists have an identifiability problem in trying to model human preferences in the ultimatum game. The ultimatum game (Figure 1) is a simple bargaining game in which a Proposer offers a certain division of a fixed pot of money to a Responder. The Responder may then accept, in which case each player gets the corresponding amount, or reject, in which place neither player gets anything. Standard accounts of rationality predict that the Proposer will offer the Responder the least amount allowed in the experimental setup and that the Responder will accept any amount of money greater than . However, humans don’t act this way: In experiments, human Proposers often give much closer to even splits, and Responders often reject unfair splits.
The ultimatum game has been the subject of extensive study in behavioral economics, with many people offering and testing different explanations of this phenomenon. This had led to a proliferation of models of human preferences in bargaining settings (e.g. Bicchieri and Zhang 2012; Hagen and Hammerstein 2006 and references therein). This makes the ultimatum game a rich source of models and data about human preferences in bargaining situations. And the game is similar to the one-shot threat game used here to illustrate the prior selection problem. Thus it can be used to model some of the high-stakes bargaining scenarios involving transformative AI that concern us most.
Suppose that you observe a Responder play many rounds of the ultimatum game with different Proposers, and you see that they tend to reject unfair splits. You think there are two possible kinds of explanation for this behavior:
- Unfairness aversion: The Responder may intrinsically disvalue being treated unfairly, and therefore reject splits they regard as unfair even if they have nothing to gain in the future by doing so. (This can also be interpreted as a commitment not to give into unfair deals.)
- Uncertainty about iterated play: The Responder may be uncertain as to whether they’ll play with the Proposer again (or with an onlooker), and how these agents will adjust their future play to the Responder's refusal to take unfair splits. If it’s sufficiently likely that the game is repeated, they might want to reject unfair offers in order to establish a reputation for punishing unfairness. (The ultimatum game experiments are designed to be anonymous and so avoid this possibility, but it is present in the real world, among the kinds of agents we want to model.)
The problem is that (depending on the details), these models might make exactly the same predictions about the outcomes of these experiments so that no amount of data from these experiments can ever distinguish between them. This makes it difficult, for instance, to decide what to do if you have to face the Responder in an ultimatum game yourself.
The basic problem is familiar from the usual preference inference case: there are many combinations of world-models and utility functions which make the same predictions about the Responder's behavior. But it is also a simple illustration of a few other factors which make unidentifiability particularly severe in strategic settings:
- More models. There are simply many more things to model in a setting with other strategic agents. For instance, an agent in a two-agent setting using a -level model of their counterpart already has models to reason over. More models mean more models that might be equally consistent with the data. The problem is even worse when there are more than two agents, where each agent has to model the other agents' models of each other...
One of our models of the Responder in the ultimatum game contains a simple illustration of -level modeling. Under the iterated play explanation, you model the Responder as modeling other players as responding to their refusals of unfair splits with higher offers in the future.
- Costly signaling. In multi-agent settings, agents will sometimes deliberately behave so as to make their private information unidentifiable. (Cf. pooling equilibria in classical game theory.) Again the reputation model of the Responder is a simple example: one explanation of the Responder's behavior is that they are engaging in costly signaling of their resolve not to give into unfair deals.
2 Dangers of unidentifiability in multi-agent systems
Unidentifiability may be dangerous in multi-agent contexts for similar reasons that it may be dangerous in the context of inferring human preferences. If uncertainty over all of the models which are consistent with the data is not accounted for properly — via specification of “good” priors and averaging over a sufficiently large space of possibilities to make decisions — then our agents may give excessive weight to models which are far from the truth and therefore act catastrophically.
Two broad directions for mitigating these risks include:
- Proper specification of the initial priors (or, more generally, the biases in the agent's reasoning about other agents), similarly to how strong priors over human values may need to be specified for preference inference to work well;
- Ensuring that agents can efficiently reason over potentially large classes of models which fit the data equally well. (Ideal Bayesian agents take expected values over the entire class
of candidate models by definition, but fully accounting for uncertainty over the relevant models may be computationally difficult for realistic agents.)
3 Quantitative example in the ultimatum game
In this example, I focus on inferring the preferences of a Responder given some data on their behavior. I'll then show that for some priors over models of the Responder, decisions made based on the resulting posterior can lead to rejected splits. Importantly, this behavior happens given any amount of observations of the Responder's ultimatum game behavior, due to unidentifiability.
Consider the following simple model. For offer in and parameters and , the Responder makes a decision according to these utility functions:
The term can be interpreted as the Responder deeming offers of less than as unfair. Then, the parameter measures how much the Responder intrinsically disvalues unfair splits, and the parameter measures how much the Responder expects to get in the future when they reject unfair splits.
Split is accepted if and only if , or equivalently, Notice that the decision depends only on , and thus the data cannot distinguish between the effects of and . So we have a class of models parameterized by pairs . Now, suppose that we have two candidate models — one on which fairness is the main component, and one on which iterated play is:
The likelihoods for any data are the same for any such that is the same: If are the offered split and the Responder's decision in the experiment, the likelihood of model given observations is
Since under both and , this means that the prior and posterior over are equal.
Now here is the decision-making setup:
- The Proposer observes an arbitrary number of ultimatum games played by the Responder and other Proposers.
- The Proposer decides what offer to make, under common knowledge that there is no iterated play. This means that the Responder's utility function depends only on the fairness variable .
Call the prior model probabilities . Thus, the Proposer's posterior expected payoff for split is
In Figure 2, I compare the expected payoffs to the Proposer under different splits, when the true parameters for the Responder's utility function are . The three expected payoff curves are:
- Posterior expected payoffs with prior ;
- Posterior expected payoffs with prior ;
- Expected payoffs given the Responder's exact parameters.
We can see from the blue curve that when there's sufficient prior mass on the wrong model , the Responder will propose a split that's too small, resulting in a rejection. This basically corresponds to a situation where the Responder thinks that the Proposer rejects unfair splits in order to establish a reputation for rejecting unfair splits, rather than rejecting unfair splits because of a commitment not to accept unfair splits. And although I've tilted the scales in favor of a bad outcome by choosing a prior that gives a lot of weight to an incorrect model, keep in mind that this is what the posterior expectation will be given any amount of data from this generative model. We can often count on data to correct our agents' beliefs, but this is not the case (by definition) when the relevant model is unidentifiable.
Cristina Bicchieri and Jiji Zhang. An embarrassment of riches: Modeling social preferences in ultimatum games. Handbook of the Philosophy of Science, 13:577–95, 2012.
Edward H Hagen and Peter Hammerstein. Game theory and human evolution: A critique of some recent interpretations of experimental games. Theoretical population biology, 69(3):339–348, 2006.