A high-level model of AI bargaining
Advanced AIs might be capable of various credible commitments unavailable to humans, which they could use when bargaining with each other. “Bargaining” can sound like something pretty specific: haggling over (literal) prices. But, in the sense discussed in Schelling’s The Strategy of Conflict for instance, “bargaining” refers to any attempt to resolve a dispute over resources — from algorithmic trading and litigation, to diplomacy between national AGI projects and negotiations over norms for space settlement.
To think clearly about interventions to mitigate conflict between AIs, I think it’s important to ground our research and strategy in a very general qualitative model of bargaining with commitments. This post sketches such a model, plus some more concrete examples of its building blocks.
I plan to explain some crucial implications in future writings. But as a teaser, this model doesn’t imply agents will play a Nash equilibrium!
Contents
Model
Where does this model come from? Basically, I started with the classic model from open-source game theory or “program equilibrium” literature. Then, I relaxed several assumptions to allow for some realistic, strategically relevant dynamics. That said, I’ve glossed over some other important dynamics for ease of exposition. I’ll say more on these at the end of the post.
High-level setup
Two AI agents, Alice and Bob, interact over two phases: before vs. after some time T, defined below. It will help to start with the “after” phase.
Bargaining phase (after T): Alice and Bob bargain over some contested resource. Specifically, “bargaining” consists in credibly reporting to each other their (i) demands/offers and (ii) policies for which outside options they’d each take if bargaining failed (such as leaving the resource alone, or initiating conflict). Bargaining ends when they either:
- agree on compatible demands, and split the resource accordingly; or
- lock in incompatible demands (e.g., Alice wants 50% of the resource while Bob wants 70%), and each follow their outside option policy.
How each agent decides (i) and (ii) is determined by some procedure chosen before the bargaining phase, as follows.
Pre-bargaining phase (before T): Each agent might try to shape the other’s incentives by credibly committing to constraints on their procedure for deciding (i) and (ii) — e.g., committing to never accept less than 50%. So they need to decide:
- which credible commitments12 to make (if any) among the commitments they’re capable of, and when;
- what information to gain/receive about each other’s commitments before making their own; and
- what to do to prepare for (1) and (2), such as thinking more about possible strategies, or developing new capabilities for commitments or information control. (These capabilities can also change over time due to exogenous causes.)
The agents make these decisions under uncertainty about each other’s decisions, though they can resolve some of this uncertainty via (2).
Now for more details on how these commitments to bargaining procedures might work, and on the three actions above.
Bargaining programs
Each agent’s procedure for what they’ll do in the bargaining phase, called a program, takes as input information about the other agent’s program, as well as other features of the strategic situation.3 (See (2) below for an example of a relevant “feature of the strategic situation”.)
As a very simplified example, Alice might follow the program: “If I can prove that Bob’s program would eventually accept my demands if I stuck to them, then I’ll demand 100%. Otherwise, I’ll accept no less than 50%, and fight if we disagree.” So, the AIs can implement conditional commitments, instead of necessarily either locking in rigid demands or conceding to whoever commits first.
Then, T is the first time both agents know which single program each other has committed to.4
Pre-bargaining actions
At any time t < T, each agent can do one of three actions:
- Credibly commit to some constraints on the program they’ll use, including deciding on a single program. The agent makes these constraints verifiable and tries to communicate them to the other agent, possibly with some lag time.
- These constraints can be fairly indirect, e.g., Alice might delegate her remaining pre-bargaining actions to a successor, or decide to follow a particular decision theory.
- The idea is that the agent might want to:
- make certain partial or “meta” commitments early on, but
- delay committing to a single program until after learning or thinking more, or developing the capabilities for more advanced commitments.
- Open or close information channels (e.g., some mutually trusted third-party server), through which they learn whether the other agent has made some credible commitment.
- (More in-the-weeds note:) Why might Alice or Bob close their info channel? A full discussion of this is out of scope, but intuitively: Suppose Alice wants to follow some sophisticated program p after T, but can’t yet credibly commit to p. In the meantime, Bob might commit to high demands that are incompatible with those of p, but he’d only want to do so if he expects his commitment could influence her behavior. She might therefore try to show him that he can’t influence her, by closing her info channel until she can credibly commit to p. (Though of course, this is risky, because Bob might try to do the same!)
- Prepare, i.e., anything other than the above. This could include thinking more, developing capabilities, etc.
Concrete examples of aspects of this model
| Part of the model | Examples |
| What the AIs bargain over | Allocation of compute among stakeholders; legal settlements; compensation for AIs’ labor or commitments not to sabotage alignment; contents of a jointly trained successor AI’s model spec |
| Commitment (verification) capabilities the AIs might develop | Finetuning to implement a surrogate goal; interpretability tools to make commitments verifiable; scaffolding to cleanly separate “first-order” commitments from a conditional renegotiation wrapper |
| Partial commitments | Delegating future (pre-)bargaining decisions to a successor with a surrogate goal; delegating to successors trained to adhere to a particular decision theory; putting strong norms of impartiality/symmetry for bargaining in an auditable model spec |
| Information that programs might condition their behavior on | Whether their first-order demands would be incompatible; whether third parties are observing the interaction; whether the other program is unconditionally committed to certain demands |
Commentary on the model’s assumptions
The model makes the simplifying assumptions listed below. None of these are trivial. But overall, I think it will be fruitful to start by working out the main implications of the model as-is, and relax these assumptions from there.
- The agents only make commitments that are fully credible, and there’s no private information (in the technical sense from, e.g., Fearon (1995)).
- We should expect there to be not-fully-credible commitments and private information. But in general, I don’t expect these elements to have qualitatively different implications for the most promising interventions against AI conflict, like safe Pareto improvements.
- Only two agents interact at a time, which means the model doesn’t explicitly account for reputational incentives.
- Strategic choices by the AIs’ human developers (e.g., do they train AIs to have cooperative propensities?) aren’t represented.
However, we don’t assume the agents:
- are expected utility maximizers;
- are necessarily capable of credibly committing to complex counterfactual behaviors (e.g., that they have fine-grained self-modification abilities and the ability to verify each other’s “source code”);
- have fixed capabilities over time;
- make commitments simultaneously;
- make full commitments all at once, rather than a sequence of partial commitments as the agents gain information about each other;5
- follow, or will converge to, any particular decision theory (see here for more);
- can, or can’t, unilaterally force each other to verify their commitments; or
- have well-calibrated beliefs about each other’s commitments before time T, or otherwise coordinate on a Nash equilibrium (including a program equilibrium) or cooperative bargaining solution.
This last point is worth a closer look. Indeed, I think dropping the equilibrium assumption is one of the most important starting points for a good theory of AI bargaining. But we’ll get to that in another post.6
- Echoing the safe Pareto improvements agenda post: “Commitments” are meant to include modifications to one’s decision theory or values/preferences. It has been argued (example) that decision theories like updateless decision theory (UDT) can sidestep the need for “commitments” in the usual sense. We’ll set this question aside here, and treat the resolution to make one’s future decisions according to UDT as a commitment in itself.
- We define a commitment’s “credibility” relative to the set of agents the commitment needs to be made credible to. In some contexts, agents might want to make commitments that they can’t make credible to others. E.g., they might follow acausal decision theories and expect that if they commit to participate in evidential cooperation in large worlds (ECL), others are more likely to make the same commitment. These commitments are (vacuously) “credible”, because they don’t need to be made credible to anyone else.
- This is inspired by the “program game” formalism of Tennenholtz (2004), but my model isn’t committed to the specific assumptions in that paper — most notably, that players choose programs simultaneously. As described in the “Bargaining phase”, we allow for strategic decision-making to be carried out by the program itself, not just by the agent choosing the program.
- More generally, we could define each agent’s subjective T as the first time after which (a) that agent has decided a single program and (b) they know the other agent’s single program. But as far as I can tell, the implications of the model aren’t sensitive to this.
- As a point of contrast, the first five assumptions are made by this paper, which I nonetheless consider an important result in AI bargaining theory.
- Thanks to Nathaniel Sauerberg for helpful comments.