Individually incentivized safe Pareto improvements in open-source bargaining
Summary
Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have individual incentives to avoid catastrophic bargaining failures. More precisely, DCM constructs strategies that are plausibly individually incentivized, and, if adopted by all, guarantee each player no less than their least preferred trade outcome. Figure 0 below illustrates this.
This result is significant because artificial general intelligences (AGIs) might (i) be involved in high-stakes negotiations, (ii) be designed with the capabilities required for the type of strategy we’ll present, and (iii) bargain poorly by default (since bargaining competence isn’t necessarily a direct corollary of intelligence-relevant capabilities).
Introduction
Early AGIs might fail to make compatible demands with each other in high-stakes negotiations (we call this a “bargaining failure”). Bargaining failures can have catastrophic consequences, including great power conflicts, or AI triggering a flash war. More generally, a “bargaining problem” is when multiple agents need to determine how to divide value among themselves.
Early AGIs might possess insufficient bargaining skills because intelligence-relevant capabilities don’t necessarily imply these skills: For instance, being skilled at avoiding bargaining failures might not be necessary for taking over. Another problem is that there might be no single rational way to act in a given multi-agent interaction. Even arbitrarily capable agents might have different priors, or different approaches to reasoning under bounded computation. Therefore they might fail to solve equilibrium selection, i.e., make incompatible demands (see Stastny et al. (2021) and Conitzer & Oesterheld (2023)). What, then, are sufficient conditions for agents to avoid catastrophic bargaining failures?
Sufficiently advanced AIs might be able to verify each other’s decision algorithms (e.g. via verifying source code), as studied in open-source game theory. This has both potential downsides and upsides for bargaining problems. On one hand, transparency of decision algorithms might make aggressive commitments more credible and thus more attractive (see Sec. 5.2 of Dafoe et al. (2020) for discussion). On the other hand, agents might be able to mitigate bargaining failures by verifying cooperative commitments.
Oesterheld & Conitzer (2022)’s safe Pareto improvements1 (SPI) leverages transparency to reduce the downsides of incompatible commitments. In an SPI, agents conditionally commit to change how they play a game relative to some default such that everyone is (weakly) better off than the default with certainty.2 For example, two parties A and B who would otherwise go to war over some territory might commit to, instead, accept the outcome of a lottery that allocates the territory to A with the probability that A would have won the war (assuming this probability is common knowledge). See also our extended example below.
Oesterheld & Conitzer (2022) has two important limitations: First, many different SPIs are in general possible, such that there is an “SPI selection problem”, similar to the equilibrium selection problem in game theory (Sec. 6 of Oesterheld & Conitzer (2022)). And if players don’t coordinate on which SPI to implement, they might fail to avoid conflict.3 Second, if expected utility-maximizing agents need to individually adopt strategies to implement an SPI, it’s unclear what conditions on their beliefs guarantee that they have individual incentives to adopt those strategies.
So, when do expected utility-maximizing agents have individual incentives to implement mutually compatible SPIs? And to what extent are inefficiencies reduced as a result? These are the questions that we focus on here. Our main result is the construction of strategies that (1) are individually incentivized and (2) guarantee an upper bound on potential utility losses from bargaining failures without requiring coordination, under conditions spelled out later. This bound guarantees that especially bad conflict outcomes — i.e., outcomes that are worse for all players than any Pareto-efficient outcome — will be avoided when each agent chooses a strategy that is individually optimal given their beliefs. Thus, e.g., in mutually assured destruction, if both parties prefer yielding to any demand over total annihilation, then such annihilation will be avoided.
Importantly, our result:
- Applies to agents who individually optimize their utility functions;4
- does not require them to coordinate on a specific bargaining solution;
- places few requirements on their beliefs; and
- holds for any game of complete information (i.e., where the agents know the utilities of each possible outcome for all agents, and all of each other’s possible strategies). That said, we believe extending our results to games of incomplete information is straightforward.5
Our result does however require:
- Simultaneous commitments: Agents commit independently of each other to their respective strategies;6,7
- Perfect credibility: It is common knowledge that any strategy an agent commits to is credible to all others — for instance because agents can see each other’s source code;
-
Mild assumptions on players’ beliefs.8
The Pareto meet minimum bound
What exactly is the bound we’ll put on utility losses from bargaining failures? Brief background: For any game, consider the set of outcomes where each player’s payoff is at least as good as their least-preferred Pareto-efficient outcome — Rabin (1994) calls this set the Pareto meet. (See the top-right triangle in the figure below, which depicts the payoffs of a generic two-player game.) The Pareto meet minimum (PMM) is the Pareto-worst payoff tuple in the Pareto meet.
Our central claim is that agents will, under the assumptions stated in the previous section, achieve at least as much utility as their payoff under the PMM. The bound is tight: For some possible beliefs satisfying our assumptions, players with those beliefs are not incentivized to use strategies that guarantee strictly more than the PMM. Our proof is constructive: For any given player and any strategy this player might consider adopting, we construct a modified strategy such that 1) the player weakly prefers to unilaterally switch to the modified strategy, and 2) when all players modify their strategies in this way, they achieve a Pareto improvement guaranteeing at least the PMM. In other words, we construct an individually incentivized SPI as defined above.
Related work
Rabin (1994) and Santos (2000) showed that players in bargaining problems are guaranteed their PMM payoffs in equilibrium, i.e., assuming players know each other’s strategies exactly (which our result doesn’t assume). The PMM is related to Yudkowsky’s / Armstrong’s (2013) proposal for bargaining between agents with different notions of fairness.9 While Yudkowsky (2013) similarly presents a joint procedure that guarantees players the PMM when they all use it, he does not prove (as we do) that under certain conditions players each individually prefer to opt in to this procedure.
In the rest of the post, we first give an informal intuition for the PMM bound, and then move on to proving it more rigorously.
Intuition for the PMM bound
This post continues on LessWrong here.
- An outcome is a Pareto improvement another outcome if all players weakly prefer to , and at least one player strictly prefers over . We’ll also sometimes say that Pareto dominates or that is Pareto-better than . An outcome is Pareto-efficient if it’s not possible to Pareto improve on it. The set of Pareto-efficient outcomes is called the Pareto frontier. We’ll say that weakly Pareto improves on if all players weakly prefer to .
- “With certainty”: For any unmodified strategy each player might play.
- For instance, in the example above, A might only agree to a lottery that allocates the territory to A with the probability that A wins the war, while B only agrees to a different lottery, e.g. one that allocates the territory with 50% probability to either player.
- I.e., we don’t assume the agents engage in “cooperative bargaining” in the technical sense defined here.
- I.e., by using commitments that conditionally disclose private information (see DiGiovanni & Clifton (2022)).
- In particular, we require that each agent either (a) believes that which commitment they make gives negligible evidence about which commitments others may make, or (b) does not take such evidence into account when choosing what to commit to (e.g., because they are causal decision theorists).
- Note that strategies themselves can condition on each other. For instance, in the Prisoner’s Dilemma, “Cooperate if {my counterpart’s code} == {my code}”. We discuss this in more detail in the section on “Program games.”
- Namely: To show the result, we will construct a way for each player to unilaterally modify their bargaining strategy. Then, the assumption is that no player believes the others will bargain more aggressively if they themselves modify their strategy this way. This is a mild condition since no player is better off bargaining less aggressively conditional on the modification being made.
- See also Diffractor’s (2022) cooperative bargaining solution that uses the PMM as the disagreement point. Diffractor (2022) doesn’t address the problem of coordinating on their bargaining solution.