The post Open positions: Summer Research Fellowship 2023 appeared first on Center on LongTerm Risk.
]]>Your contributions to our research program will have a positive impact through their influence on our strategic direction, grantmaking, communications, events, and other activities. You will work autonomously on challenging research questions relevant to reducing suffering. You will become part of our team of intellectually curious, hardworking, and caring people, all of whom share a profound drive to make the biggest difference they can.
We are worried that some people might not apply because they wrongly believe they are not a good fit for working with us. While such a belief is sometimes true, it is often the result of underconfidence rather than an accurate assessment. We would therefore love to see your application even if you are not sure if you are qualified or otherwise competent enough for the positions listed. We explicitly have no minimum requirements in terms of formal qualifications and many of the past summer research fellows have had no or little prior research experience. Being rejected this year will not reduce your chances of being accepted in future hiring rounds. If you have any doubts, please don’t hesitate to reach out (see “Application process” > “Inquiries” below).
The purpose of the fellowship varies from fellow to fellow. In the past, have we often had the following types of people take part in the fellowship:
There might be many other good reasons for completing the fellowship. We encourage you to apply if you think you would benefit from the program, even if your reason is not listed above. In all cases, we will work with you to make the fellowship as valuable as possible given your strengths and needs. In many cases, this will mean focusing on learning and testing your fit for srisk research, more than seeking to produce immediately valuable research output.
We don’t require specific qualifications or experience for this role, but the following abilities and qualities are what we’re looking for in candidates. We encourage you to apply if you think you may be a good fit, even if you are unsure whether you meet some of the criteria.
We encourage you to apply even if any of the below does not work for you. We are happy to be flexible for exceptional candidates, including when it comes to program length and compensation.
You can find an overview of our current priority areas here. However, If we believe that you can somehow advance highquality research relevant to srisks, we are interested in creating a position for you. If you see a way to contribute to our research agenda or have other ideas for reducing srisks, please apply. We commonly tailor our positions to the strengths and interests of the applicants.
We value your time and are aware that applications can be demanding, so we have thought carefully about making the application process timeefficient and transparent. We plan to make the final decisions between May 5 and May 10.
Stage 1: To start your application for any role, please complete our application form. As part of this form, we also ask you to submit your CV/resume and give you the opportunity to upload an optional research sample. The deadline is Sunday, April 2, 2023 end of day anywhere. We expect this to take around 2 to 3 hours if you are already familiar with our work. In the interest of your time, you do not need to polish the language of your answers in the application form.
Stage 2: By Friday, April 7, we will decide whether to invite you to the second stage. We will ask you to write a research proposal (up to two pages excluding references) and two research proposal sketches, to be submitted by Sunday, April 23 end of day anywhere. This means applicants will have two weeks to complete this stage, which we expect will take up to 12h of work. Applicants may therefore want to keep some time free during this period to work on this. Applicants will be compensated with £250 for their work on this stage.
Stage 3: By Friday, April 28, we will decide whether to invite you to an interview via video call during the week of May 1. By May 10, we will send out final decisions to applicants.
If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you want to send an email not accessible to the hiring committee, please contact Amrit SidhuBrar at amrit.sidhubrar@longtermrisk.org.
In addition to their salary, CLR offers the following benefits to all staff (including Summer Research Fellows):
We aim to combine the best aspects of academic research (depth, scholarship, mentorship) with an altruistic mission to prevent negative future scenarios. So we leave out the less productive features of academia, such as precarious employment and publishorperish incentives, while adding a focus on impact and application.
As part of our team, you will enjoy:
You will advance neglected research to reduce the most severe risks to our civilization in the longterm future. Depending on your specific project, your work will help inform our activities across any of the following paths to impact:
The post Open positions: Summer Research Fellowship 2023 appeared first on Center on LongTerm Risk.
]]>The post Annual Review & Fundraiser 2022 appeared first on Center on LongTerm Risk.
]]>Our goal is to reduce the worst risks of astronomical suffering (srisks). These are scenarios where a significant fraction of future sentient beings are locked into intense states of misery, suffering, and despair.^{1} We currently believe that such lockin scenarios most likely involve transformative AI systems. So we work on making the development and deployment of such systems safer.^{2}
Concrete research programs:
Most of our work is research with the goal of identifying threat models and possible interventions. In the case of technical AI interventions (which is the bulk of our objectlevel work so far), we then plan to evaluate these interventions and advocate for their inclusion in AI development.
Next to our research, we also run events and fellowships to identify and support people wanting to work on these problems.
Due to recent events, we have had a shortterm funding shortfall. This caused us to reduce our original budget for 2023 by 30% and take various costsaving measures, including voluntary pay cuts by our staff, to increase our runway to six months.
Our mediumterm funding situation is hard to predict at the moment, as there is still a lot of uncertainty. We hope to gain more clarity about this in the next few months.
Our minimal fundraising goal is to increase our runway to nine months, which would give us enough time to try and secure a grant from a large institutional funder in the first half of 2023. Our main goal is to increase our runway to twelve months and roll back some of the budget reductions, putting us in a more comfortable financial position again. Our stretch goal is to increase our runway to fifteen months and allow for a small increase in team size in 2023. See the table below for more details.
Given the financial situation sketched above, we believe that CLR is a good funding opportunity this year. Whether it makes sense for any given individual to donate to CLR depends on many factors. Below, we sketch the main reasons donors could be excited about our work. In an appendix, we collected some testimonials by people who have a lot of context on our work.
Supporting srisk reduction.
You might want to support CLR’s work because it is one of the few organizations addressing risks of astronomical suffering directly.^{5} You could consider srisk reduction worthwhile for a number of reasons: (1) You find the combination of sufferingfocused ethics and longtermism compelling. (2) You think the expected value of the future is not sufficiently high to warrant prioritizing extinction risk reduction over improving the quality of the future. (3) You want to address the fact that work on srisks is comparatively neglected within longtermism and AI safety.
Since the early days of our organization, we have made significant progress on clarifying and modeling the concrete threats we are trying to address and coming up with technical candidate interventions (see Appendix).
Supporting work on addressing AI conflict.
Next to the benefits to srisk reduction, you might value some of our work because it addresses failure modes arising in multipolar AI scenarios more broadly (e.g., explored here, here). In recent years, we have helped to build up the field of Cooperative AI intended to address these risks (e.g., Stastny et al. 2021).
Supporting work on better understanding acausal interactions.
Such interactions are possibly a crucial consideration for longtermists (see, e.g., here). Some argue that, when acting, we should consider the noncausal implications of our actions (see, e.g., Ahmed (2014), Yudkowsky and Soares (2018), Oesterheld and Conitzer (2021)). If this is the case, these effects could dwarf their causal influence (see, e.g., here). Better understanding the implications of this would then be a key priority. CLR is among the few organizations doing and supporting work on this (e.g., here).
Much of our work on cooperation in the context of AI plausibly becomes particularly valuable from this perspective. For instance, if we are to act so as to maximize a compromise utility function that includes the values of many agents across the universe^{6}, as the ECL argument suggests, then it becomes much more important that AI systems, even if aligned, cooperate well with agents with different values.^{7}
Supporting causegeneral longtermism research.
CLR has also done important research from a general longtermist lens, e.g., on decision theory, meta ethics, AI timelines, risks from malevolent actors, and extraterrestrial civilizations. Our Summer Research Fellowship has been a springboard for junior researchers who then moved on to other longtermist organizations (e.g., ARC, Redwood Research, Rethink Priorities, Longview Philanthropy).^{8}
To donate to CLR, please go to the Fundraiser page on our website.
For frequently asked questions on donating to CLR, see our Donate page.
This group is led by Jesse Clifton. Members of the group are Anni Leskelä, Anthony DiGiovanni, Julian Stastny, Maxime Riché, Mia Taylor, and Nicolas Macé.
Have we made relevant research progress?^{9}
We believe we have made significant progress (e.g., relative to previous years) on improving our expertise in the reasons why AI systems might engage in conflict and the circumstances under which technical work done now could reduce these risks. We’ve built up methods and knowledge that we expect to make us much better at developing and assessing interventions for reducing conflict. (Some of this is reflected in our publicfacing work.) We have begun to capitalize on this in the latter part of 2022, as we’ve begun moving from improving our general picture of the causes of conflict and possibilities for intervention to developing and evaluating specific interventions. These interventions include surrogate goals, preventing conflictseeking preferences, preventing commitment races, and developing cooperationrelated content for a hypothetical manual for overseers of AI training.
The second main way in which we’ve made progress is the initial work we’ve done on the evaluation of large language models (LLMs). There are several strong arguments that those interested in intervening on advanced AI systems should invest in experimental work with existing AI systems (see, e.g., The case for aligning narrowly superhuman models). Our first step here has been to work on methods for evaluating cooperationrelevant behaviors and reasoning of LLMs, as these methods are prerequisites for further research progress. We are in the process of developing the first Cooperative AI dataset for evaluating LLMs as well as methods for automatically generating data on which to evaluate cooperationrelevant behaviors, which is a prerequisite for techniques like redteaming language models with language models. We are preparing to submit this work to a machine learning venue. We have also begun developing methods for better understanding the reasoning abilities of LLMs when it comes to conflict situations in order to develop evaluations that could tell us when models have gained capabilities that are necessary to engage in catastrophic conflict.
Has the research reached its target audience?
We published a summary of our thinking (as of earlier this year) on when technical work to reduce AGI conflict makes a difference on the Alignment Forum/LessWrong, which is visible to a large part of our target audience (AI safety & longtermist thinkers). We have also shared internal documents with individual external researchers to whom they are relevant. A significant majority of the research that we’ve done this year has not been shared with target audiences, though. Much of this is workinprogress on evaluating interventions and evaluating LLMs which will be incorporated into summaries shared directly with external stakeholders, and in some cases posted on the Alignment Forum/LessWrong or submitted for publication in academic venues.
What feedback on our work have we received from peers and our target audience?
Our Alignment Forum sequence When does technical work to reduce AGI conflict make a difference? didn’t get much engagement. We did receive some positive feedback on internal drafts of this sequence from external researchers. We also solicited advice from individual alignment researchers throughout the year. This advice was either encouraging of existing areas of research focus or led us to shift more attention to areas that we are now focusing on (summarized in “relevant research progress” section above).
Emery, Daniel^{10}, and Tristan work on a mix of macrostrategy, ECL, decision theory, anthropics, forecasting, AI governance, and game theory.
Have we made relevant research progress?^{11}
The main focus of Emery’s work in the last year has been on the implications of ECL for cause prioritization. This includes work on the construction of the compromise utility function^{12} under different anthropic and decisiontheoretic assumptions, on the implications of ECL for AI design, and on more foundational questions. Additionally, Emery worked on a paper (in progress) extending our earlier Robust Program Equilibrium paper^{13}. She also did some work on the implications of limited introspection ability for evidential decision theory (EDT) agents, and some related work on anthropics.
Daniel left for OpenAI early in the year, but not before making significant progress building a model of ECL and identifying key cruxes for the degree of decision relevance of ECL.
Tristan primarily worked on the optimal spending schedule for AI risk interventions and the probability that an Earthoriginating civilization would encounter alien civilizations. To that end, he built and published two comprehensive models.
Overall, we believe we made moderate research progress, but Emery and Daniel have accumulated a large number of unpublished ideas to be written up.
Has the research reached its target audience?
The primary goal of Tristan’s reports was to inform CLR’s own prioritization. For example, the existence of alien civilizations in the far future is a consideration for our work on conflict. That said, Tristan’s work on Grabby Aliens received considerable engagement on the EA Forum and on LessWrong.
As mentioned above, a lot of Emery and Daniel’s work is not yet fully written up and published. Whilst the target audience for some of this work is internal, it’s nevertheless true that we haven’t been as successful in this regard as we would like. We have had fruitful conversations with nonCLR researchers about these topics, e.g., people at Open Philanthropy and MIRI.
What feedback on our work have we received from peers and our target audience?
The grabby aliens report was well received by and cited by S. Jay Olson (coauthor of a recent paper on extraterrestrial civilizations with Toby Ord), who described it as “fascinating and complete”, and Tristan has received encouragement from Robin Hanson to publish academically, which he plans to do.
Progress on all fronts seems very similar to last year, which we characterized as “modest”.
Have we increased the (qualityadjusted) size of the community?
Community growth has continued to be modest. We are careful in how we communicate about srisks, so our outreach tools are limited. Still, we had individual contact with over 150 people who could potentially make contributions to our mission. Out of those, perhaps five to ten could turn out to be really valuable members of our community.
Have we created opportunities for inperson (and indepth online) contact for people in our community?
We created more opportunities for discussion and exchange than before. We facilitated more visits to our office, hosted meetups around EAG, and we ran an SRisk Retreat with about 30 participants. We wanted to implement more projects in this direction, but our limited staff capacity made that impossible.
Have we provided resources for community members that make it more likely they contribute significantly to our mission?
Our team has continued to provide several useful resources this year. We administered the CLR Fund, which supported various efforts in the community. We provided adhoc career advice to community members. We are currently experimenting with a research database prototype. We believe there are many more things we could be doing, but our limited staff capacity has held us back.
Guiding question: Are we a healthy organization with sufficient operational capacity, an effective board, appropriate evaluation of our work, reliable policies and procedures, adequate financial reserves and reporting, and high morale?
Our capacity is currently not as high as we would like it to be as a staff member left in the summer and we only recently made a replacement hire. So various improvements to our setup have been delayed (e.g., IT & security improvements, a systematic policy review, some visarelated issues, systematic risk management). That being said, we are still able to maintain all the important functions of the organization (i.e., accounting, payments, payroll, onboarding/offboarding, hiring support, office management, feedback & review, IT maintenance).
Members of the board: Tobias Baumann, Max Daniel, Ruairi Donnelly (chair), Chi Nguyen, Jonas Vollmer.
The main role of the board is to decide CLR’s leadership and structure, to resolve organizational conflicts at the highest level, as well as to advise and provide accountability for CLR leadership. The last was most important in 2022. They were involved in various decisions related to CLR’s new office, hiring/promotion, and overall financials. We generally considered their input to be valuable.
We collect systematic feedback on big communitybuilding and operations projects through surveys and interviews. We collect feedback on our research by submitting articles to journals & conferences and by requesting feedback on drafts of documents from relevant external researchers.
In 2022, we did not have any incidents that required a policydriven intervention or required setting up new policies. Due to a lack of operational capacity in 2022, we failed to conduct a systematic review of all of our policies.
See “Fundraising” section above.
We currently don’t track staff morale quantitatively. Our impression is that this varies significantly between staff members and is more determined by personal factors than organizational trends.
Our plans for 2023 fall into three categories.
Evaluating large language models. We will continue building on the work on evaluating LLMs that we started this year. Beyond writing up and submitting our existing results, the priority for this line of work is scoping out an agenda for assessing cooperationrelevant capabilities. This will account for work on evaluation that’s being done by other actors in the alignment space and possible opportunities for eventually slotting into those efforts.
Developing and evaluating cooperationrelated interventions. We will continue carrying out the evaluations of the interventions that we started this year. On the basis of these evaluations, we’ll decide which interventions we want to prioritize developing (e.g., working out in more detail how they would be implemented under various assumptions about what approaches to AI alignment will be taken). In parallel, we’ll continue developing content for an overseer’s manual for AI systems.
General srisk macrostrategy. Some researchers on the team will continue spending some of their time thinking about srisk prioritization more generally, e.g., thinking about the value of alternative priorities to our group’s current focus on AI conflict.
Emery plans to prioritize finishing and writing up her existing research on ECL. She also has plans for some more general posts on ECL, including on some common confusions, and on more practical implications for cause prioritization. Emery also plans to focus on finishing the paper extending Robust Program Equilibrium, and to explore further more objectlevel work.
Daniel no longer works at CLR but plans to organize a research retreat focused on ECL in the beginning of 2023.
Tristan broadly plans to continue strategyrelated modeling, such as on the spread of information hazards. He also plans to help to complete a project that calculates the marginal utility of AI xrisk and srisk work under different assumptions about AGI timelines, and to potentially contribute to work on ECL.
We had originally planned to expand our activities across all three communitybuilding functions. Without additional capacity, we would have to curtail these plans.
Outreach. If resources allow, we will host another Intro Fellowship and Summer Research Fellowship. We will also continue our 1:1 meetings & calls. We also plan to investigate what kind of mass outreach within the EA community would be most helpful (e.g., online content, talks, podcasts). Without such outreach, we expect that community growth will stagnate at its current low rate.
Resources. We plan to create more longlasting and lowmarginalcost resources for people dedicated to srisk reduction (e.g., curated reading lists, career guide, introductory content, research database). As the community grows and diversifies, these resources will have to become more central to our work.
Exchange. If resources allow, we will host another SRisk Retreat. We also want to experiment with other online and inperson formats. Again, as the community grows and diversifies, we need to find a replacement for more informal arrangements.
Nate Soares (Executive Director, Machine Intelligence Research Institute): "My understanding of CLR's mission is that they're working to avoid fates worse than the destruction of civilization, especially insofar as those fates could be a consequence of misaligned superintelligence. I'm glad that someone on earth is doing CLR's job, and CLR has in the past seemed to me to occasionally make small amounts of legibletome progress in pursuit of their mission. (Which might sound like faint praise, and I sure would endorse CLR more fullthroatedly if they spent less effort on what seem to me like obvious deadends, but at the same time it's not like anybody else is even trying to do their job, and their job is worthy of attempting. According to me, the ability to make any progress at all in this domain is laudable)"
Lukas Finnveden (Research Analyst, Open Philanthropy): “CLR’s focus areas seem to me like the most important areas for reducing future suffering. Within these areas, they’ve shown competence at producing new knowledge, and I’ve learnt a lot that I value from engaging with their research.”
Daniel Kokotajlo (Policy/Governance, OpenAI): “I think AI cooperation and srisk reduction are high priority almost regardless of your values / ethical views. The main reason to donate to, or work for, CLR is that the best thinking about srisks and AI cooperation happens here, better than the thinking at MIRI or Open Phil or anywhere else. CLR also contains solid levels of knowledge of AI governance, AI alignment, AI forecasting (less so now that I’m gone), cause prioritisation, metaethics, agent foundations, anthropics, aliens, and more. Their summer fellows program is high quality and has produced many great alumni. Their ops team is great & in general they are wellorganized. I left CLR to join the OpenAI governance team because I was doing mostly AI forecasting which benefits from being in a lab — but I was very happy at CLR and may even one day return.”
Michael Aird (Senior Research Manager, Rethink Priorities): 'I enjoyed my time as a summer research fellow at CLR in 2020, and I felt like I learned a lot about doing research and about various topics related to longtermist strategy, AI risk, and ethics. I was also impressed by the organization's culture and how the organization and fellowship was run, and I drew on some aspects of that when helping to design a research fellowship myself and when starting to manage people.'
Testimonials by other Summer Research Fellows can be found here.
What follows below is a somewhat stylized/simplified account of the history of the Center on LongTerm Risk prior to 2022. It is not meant to capture every twist and turn.
20112016: Incubation phase
What is now called the “Center on LongTerm Risk” starts out as a student group in Switzerland. Under the name “Foundational Research Institute,” we do preparadigmatic research into possible risks of astronomical suffering and create basic awareness of these scenarios in the EA community. A lot of pioneering thinking is done by Brian Tomasik. In 2016, we coin the term “risks of astronomical suffering” (srisks). Key publications from that period:
20162019: Early growth
More researchers join; the organization professionalizes and matures. We publish our first journal articles related to srisks. Possible interventions are being developed and discussed, surrogate goals among them. In 2017, we start sharing our work with other researchers in the longtermist community. That culminates in a series of research workshops in 2019. Key publications from that period:
20192022: Maturation
Before 2019, we were pursuing many projects other than research on srisks. In 2019, this stops. We start focusing exclusively on research. Increasingly, we connect our ideas to existing lines of academic inquiry. We also start engaging more with concrete proposals and empirical work in AI alignment. Key publications from that period:
The post Annual Review & Fundraiser 2022 appeared first on Center on LongTerm Risk.
]]>The post CLR Fundraiser 2022 appeared first on Center on LongTerm Risk.
]]>For further details of CLR's progress in 2022, plans for 2023, and funding needs, please see our full fundraiser post.
The Swiss charity Effective Altruism Foundation (EAF) collects and processes donations through the below form on our behalf. Such donations will be used exclusively to support CLR.
For frequently asked questions on donating to CLR, see our Donate page.
Note: since the fundraiser is now over, any donations from now on will not be listed in the fundraiser donations list below.
Name  Amount  Comment  

Simon Möller  CHF 15000  
David Lechner  CHF 250  
Swante Scholz  CHF 10000  Donation for Center on longterm risk (CLR)  
Kwan Yee Ng  USD 7000  
Spencer Pearson  USD 30  
Althaus Silvia  CHF 5000  
Markus Winkelmann  CHF 12000  
Markus Winkelmann  CHF 500  
Anonymous  USD 1000000  
Anonymous  EUR 387000  
Anonymous  USD 1000  
Anonymous  USD 500  
Anonymous  USD 1500  
Anonymous  USD 40000  
Jan Rüegg  CHF 4500  
Adrian Hutter  CHF 9250  
Patrick Levermore  GBP 10  
Jonas Vollmer  USD 1000  
Anonymous  GBP 3.13 
The post Private: 20221211 Anonymous 3.13 GBP via EA Funds appeared first on Center on LongTerm Risk.
]]>The post The optimal timing of spending on AGI safety work; why we should probably be spending more now appeared first on Center on LongTerm Risk.
]]>When should funders wanting to increase the probability of AGI going well spend their money? We have created a tool to calculate the optimum spending schedule and tentatively conclude funders collectively should be spending at least 5% of their capital each year on AI risk interventions and in some cases up to 35%.
This is likely higher than the current AI risk community spending rate which is at most 3%^{1}. In most cases, we find that the optimal spending schedule is between 5% and 15% better than the ‘default’ strategy of just spending the interest one accrues and from 15% to 50% better than a naive projection of the community’s spending rate^{2}.
We strongly encourage users to put their own inputs into the tool to draw their own conclusions.
The key finding of a higher spending rate is supported by two distinct models we have created, one that splits spending of capital into research and influence, and a second model (the ‘alternate model’) that supposes we can spend our stock of things that grow on direct work. We focus on the former with the latter described in the appendix since its output is more obviously actionguiding^{3}.
The table below shows our best guess for the optimal spending schedule using the former model when varying the difficulty of achieving a good AGI outcome and AGI timelines. We keep other inputs, such as diminishing returns to spending and interest rate constant^{4}.
Median AGI arrival  
Difficulty of AGI success  2030^{5}  2040^{6}  2050^{7} 
Easy^{8}  Easy difficulty 2030 median^{9}  Easy difficulty 2040 median  Easy difficulty 2050 median 
Medium^{10}  Medium difficulty 2030 median  Medium difficulty 2040 median  Medium difficulty 2050 median 
Hard^{11}  Hard difficulty 2030 median  Hard difficulty 2040 median  Hard difficulty 2050 median 
How much better the optimal spending schedule is compared to the 2%+2% constant spending schedule (withinmodel lower bound)^{12} 

Median AGI 

2030  2040  2050  
Easy  37.6%  18.4%  11.8% 
Medium  39.3%  14.9%  12.0% 
Hard  12.3%  5.85%  1.55% 
Some of the critical limitations of our model include: poorly modelling exogenous research, which is particularly important for those with longer timelines, and many parts of the model  such as diminishing returns  remaining constant over time.
Further, we find that robust spending strategies  those that work in a wide variety of worlds  also support a higher spending rate. We show the results of a Monte Carlo simulation in the appendix^{13}.
Humanity might be living at a hinge moment in history (MacAskill, 2020). This is partly due to the unusually high level of existential risks (Ord, 2020) and, in particular, the significant probability that humanity will build artificial general intelligence (AGI) in the next decades (Cotra, 2022). More specifically, AGI is likely to make up for a large fraction of extinction risks in the present and next decades (Cotra, 2022) and stands as a strong candidate to influence the longterm future. Indeed, AGI might play a particularly important role in the longterm trajectory change of Earthoriginating life by increasing the chance of a flourishing future (Bostrom, 2008) and reducing the risks of large amounts of disvalue (Gloor, 2016).
Philanthropic organisations aligned with effective altruism principles such as the FTX Foundation and Open Philanthropy play a crucial role in reducing AI risks by optimally allocating funding to organisations that produce research, technologies and influence to reduce risks from artificial intelligence. Figuring out the optimal funding schedule is particularly salient now with the risk of AI timelines under 10 years (Kokotajlo, 2022), and the substantial growth in effective altruism (EA) funding roughly estimated at 37% per year from 2015 to 2021 for a total endowment of about 46B$ by then end of 2021 (Todd, 2021).
Previous work has emphasised the need to invest now to spend more later due to low discount rates (Hoeijmakers, 2020, Dickens 2020). This situation corresponds to a “patient philanthropist”. Research has modelled the optimal spending schedule a patient philanthropist should follow if they face constant interest rates, diminishing returns and a low discount rate accounting for existential risks (Trammell, 2021, Trammell 2021). Extensions of the single provider of public goods model allowed for the rate of existential risks to be timedependent (Alaya, 2021) and to include a tradeoff between labour and capital where labour accounts for movement building and direct work (Sempere, Trammell 2021). Some models also discussed the tradeoff between economic growth and existential risks by modelling the dynamics between safety technology and consumption technology with an endogenous growth (Aschenbrenner, 2020) and an exogenous growth model (Trammell, 2021).
Without more specific quantitative models taking into account AI timelines, growth in funding, progress in AI safety and the difficulty of building safe AI, previous estimates of a spending schedule of just over 1% per year (Todd 2021, MacAskill 2022) are at risk of underperforming the optimal spending schedule by as much as 40%.
In this work, we consider a philanthropist or philanthropic organisation maximising the probability of humanity building safe AI. The philanthropist spends money to increase the stock of AI safety research and influence over AI development which translates into increasing the probability of successfully aligning AI or avoiding large amounts of disvalue. Our model takes into account AI timelines, the growth of capital committed to AI safety, diminishing returns in research and influence as well as the competitiveness of influencing AI development. We also allow for the possibility of a fire alarm shortly before AGI arrives. Upon “hearing” the fire alarm, the philanthropist knows the arrival time of AGI and wants to spend all of its remaining money until that time. The philanthropist also has some discount rate due to other existential risks and exogenous research that accelerate safety research.
Crucially, we have coded the model into a notebook accompanying this blog post that philanthropists and interested users can run to estimate an optimal spending schedule given their estimates of AI timelines, the difficulty of AI safety, capital growth and diminishing returns. Mathematically the problem of finding the optimal spending schedule translates into an optimal control problem giving rise to a set of nonlinear differential equations with boundary conditions that we solve numerically.
We discuss the effect of AI timelines and the difficulty of AI safety on the optimal spending schedule. Importantly, the optimal spending schedule typically ranges from 5% to 35% per year this decade, certainly above the current typical spending of EAaligned funders. A funder should follow the most aggressive spending schedule this decade if AI timelines are short (2030) and safety is hard. An intermediate scenario yields a yearly average spending of ~12% over this decade. The optimal spending schedule typically performs between 5 to 15% better than the strategy of spending the endowment’s rate of appreciation and between 18% to 40% better than the current EA community spending at ~3% per year.
We suppose that a single funder controls all of the community’s funding that is earmarked for AI risk interventions and that they set the spending rate for two types of interventions: research and influence. The funder’s aim is to choose the spending schedule  how much they spend each year on each intervention  that maximises the probability that AGI goes successfully (e.g. does not lead to an existential catastrophe).
The ‘model’ is a set of equations (described in the appendix) and accompanying Colab notebook. The latter, once given inputs from the user, finds the optimal spending schedule.
We suppose that any spending is on either research or influence. Any money we don’t spend is saved and gains interest. As well as investing money in traditional means, the funder is able to ‘invest’ in promoting earningtogive, which historically has been a source of a large fraction of the community’s capital.
We suppose there is a single number for each of the stocks of research and influence describing how much the community has of each^{14}.
Research refers to the community’s ability to make AGI a success given we have complete control over the system (modulo being able to delay its deployment indefinitely). The stock of research contains AI safety technical knowledge, skilled safety researchers, and safe models that we control and can deploy. Influence describes the degree of control we have over the development of AGI, and can include ‘soft’ means such as through personal connections or ‘hard’ means such as passing policy. Both research and influence contribute to the probability we succeed and the user can input the degree to which they are ‘substitutable’.
The equations modelling the time evolution of research and influence have the following features:
Any money we don’t spend appreciates. Historically the money committed to the effective altruism movement has grown faster than market real interest rates. The model allows for a variable real interest rate, which allows for the possibility that the growth of the effective altruism community slows.
We use the term preparedness at time to describe how ‘ready’ we are if AGI arrived at time Preparedness is a function of research and influence: the more we have of each the more we are prepared. The user inputs the relative importance of each research and influence as well as the degree they are substitutable.
We may find it useful to have money before AGI takeoff, particularly if we have a ‘fire alarm’ period where we know that AGI is coming soon and can spend most of it on lastminute research or influence. The model allows for such lastminute spending on research and influence, and so one’s money indirectly contributes to preparedness.
The probability of success given AGI arriving in year is an Sshaped function of our preparedness. The model is not fixed to any definition of ‘success’ and could be, but is not limited to, “AGI not causing an existential catastrophe” or “AGI being aligned to human values” or “preventing AI caused srisk”.
Since we are uncertain when AGI will arrive, the model considers AGI timelines input from the user and takes the integral of the product of {the probability of AGI arriving at time } and {the probability of success at time given AGI at time }.
The model also allows for a discount rate to account for nonAGI existential risks or catastrophes that preclude our research and influence from being useful or other factors.
The funder’s objective function, the function they wish to maximise, is the probability of making AGI go successfully.
The preceding qualitative description is of a collection of differential equations that describe how the numerical quantities of money, research and influence change as a function of our spending schedule. We want to find the spending schedule that maximises the objective function, the optimal spending schedule. We do this with tools from optimal control theory^{15}. We call such a schedule the optimal spending schedule.
We first review the table from the start of the post, which varies AGI timelines and difficulty of an AGI success while keeping all other model inputs constant. We stress that the results are based on our guesses of the inputs (such as diminishing returns to spending) and encourage people to try the tool out themselves.
Figure caption: Yearly optimal spending schedule averaged over this decade, 20222030 (left), and the next, 20302040 (right). For each level of AI safety difficulty (easy, medium and hard columns) and each decade we reported the average spending rates in research and influence in % of the funder’s endowment. 
We consider our best guess for the model’s parameters as given in the appendix (see “explaining and estimating the model parameters”). We describe the effects of timelines and the difficulty of AI safety on the spending schedule in this decade (20222030), the effects being roughly similar in the 2030 decade.
In most future scenarios we observe that the average optimal spending schedule is substantially higher than the current EA spending rate standing at around 13% per year. The most conservative spending schedule happens when the difficulty of AI safety is hard with long timelines (2050) with an average spending rate of around 6.5% per year. The most aggressive spending schedule happens when AI safety is hard and timelines are short (2030) with an average funding rate of about 35% per year.
For each level of difficulty and each AI timelines, the average allocation between research and influence looks balanced. Indeed, research and influence both share roughly half of the total spending in each scenario. Looking closer at the results in the appendix (see “appendix results”), we observe that influence seems to decrease more sharply than research spending, particularly beyond the 2030 decade. This is likely caused by the sharp increase in the level of competition over AI development making units of influence more costly relative to units of research. Although we want to emphasise that the share of influence and research in the total spending schedule could easily change with different diminishing returns in research and influence parameters.
The influence of AI timelines on the optimal spending schedule varies across distinct levels of difficulty but follows a consistent trend. Roughly, with AI timelines getting longer by a decade, the funder should decrease its average funding rate by 5 to 10%, unless AI safety is hard. If AI safety is easy, a funder should spend an average of ~25% per year for short timelines (2030), down to ~18% per year with medium timelines (2040) and down to ~15% for long timelines (2050). If AI safety difficulty is medium then the spending schedule follows a similar downtrend, starting at about 30% with short timelines down to ~12% with medium timelines and down to 10% with long timelines. If AI safety is hard, the decline in spending from short to medium timelines is sharper, starting at 35% per year with short timelines down to ~8% with medium timelines and down to about 5% with long timelines.
Interestingly, conditioning on short timelines (2030), going from AI safety hard to easier difficulty decreases the spending schedule from ~35% to ~25% but conditioning on medium (204) or long (2050) timelines going from AI safety hard to easier difficulty increases the spending schedule from 6% to 18% and 9% to 15% respectively.
In summary, in most scenarios, the average optimal spending schedule in the current decade typically varies between 5% to 35% per year. With medium timelines (2040) the average spending schedule typically stays in the 1020% range and moves up to the 2035% range with short timelines (2030). The allocation between research and influence is balanced.
In this section, we show the effect of varying one parameter (or related combination) on the optimal spending schedule. The rest of the inputs are described in the appendix. We stress again that these results are for the inputs we have chosen and encourage you to try out your own.
Varying just the discount rate we see that a higher discount rate implies a higher spending rate in the present.
Low discount rate  Standard  High discount rate 
It seems plausible that the community and its capital are likely to be going through an unusually fast period of growth that will level off.^{16} When assuming a lower rate of growth we see that the optimal spending schedule is a lower rate, but still higher than the community’s current allocation. In particular, we should be spending faster than we are growing.
Highly pessimistic growth rate: 5% growth rate  Pessimistic growth rate: 10% current growth decreasing to 5% in the five years.^{17}  Our guess: 20% current growth decreasing to 8.5% in the next ten years.^{18} 
We can compute the change in utility when the amount of funding committed to AI risk interventions changes. This is of relevance to donors interested in the marginal value of different causes, as well as philanthropic organisations that have not explicitly decided the funding for each cause area.
Starting money multiplier  0.001  0.01  0.1  0.5  1  1.1  1.5  10 
Absolute utility  0.031^{19}  0.044  0.092  0.219  0.317  0.332  0.386  0.668 
Multiple of 100% money utility  0.098  0.139  0.290  0.691  1  1.047  1.218  2.107 
A different initial endowment has qualitative effects on the spending schedule. For example, comparing the 10% to 1000% case we see that when we have more money we  unsurprisingly  spend at a much higher rate. This result itself is sensitive to the existing stocks of research and influence.
When we have 10% of our current budget of $4000m  When we have 1000% of our current budget 
The spending schedule is not independent of our initial endowment. This is primarily driven by the Sshaped success function. When we have more money, we can beeline for the steep returns of the middle of the Sshape. When we have less money, we choose to save to later reach this point.
We see that, unsurprisingly, lower diminishing returns to spending suggest spending at a higher rate.
High diminishing returns^{20}  Our guess^{21}  Low diminishing returns^{22} 
The constant controls whether research becomes cheaper as we accumulate more research () or more expensive (). The former could describe a case where an increase in research leads to the increasing ability to parallelize research or break down problems into more easily solvable subproblems. The latter could describe a case where an increase in research leads to an increasingly bottlenecked field, where further progress depends on solving a small number of problems that are only solvable by a few people.
Research is highly serial  Default ()  Research is highly parallelizable 
We see that in a world where research is either highly serial or parallelizable, we should be spending at a higher rate than if it is, on balance, neither. The parallelizable result is less surprising than the serial result, which we plan to explore in later work.
A more nuanced approach would use a function such that the field can become more or less bottlenecked as it progresses and the price of research changes accordingly.
Using our parameters, we find the presence of a fire alarm greatly improves our prospects and, perhaps unexpectedly, pushes the spending schedule upwards. This suggests it is both important to be able to correctly identify the point at which AGI is close and have a plan for the postfire alarm period.
No fire alarm.  Short fire alarm: funders spend 10% of one’s money over six months. In this case, we get 36% more utility than no fire alarm.  Long fire alarm: funders spend 20% of one’s money over one year. In this case, we get 56% more utility than no fire alarm. 
Increasing substitutability means that one (weight adjusted^{23}) unit of research can be replaced by closer to one unit of influence to have the same level of preparedness^{24}.
Since, by our choice of inputs, we already have much more importanceadjusted research than influence^{25}, in the case where they are very poor substitutes we must spend at a high rate to get influence.
When research and influence are perfect substitutes since research is ‘cheaper’^{26} with our chosen inputs the optimum spending schedule suggests that nearly spending should be on research^{27}.
Research and influence are very poor substitutes^{28}  Research and influence are poor substitutes^{29}  Standard case^{30}  Research and influence are perfect substitutes^{31} 
We make a note of some claims that are supported by the model. Since there is a large space of possible inputs we recommend the user specify their own input and not rely solely on our speculation.
Supposing the community indefinitely spends 2% of its capital each year on research and 2% on influence, the optimal spending schedule is around 30% better in the medium timelines, medium difficulty world.
Note: The default strategy is where you spend exactly the amount your money appreciates, and so your money remains constant. The greatest difference in utility comes from cases where it is optimal to spend lots of money now, for example in the (2030 median, hard difficulty) world, the optimal spending schedule is 15% better than the default strategy.
A wager is, e.g., thinking that ‘although I think AGI is more likely than not in the next t years, it is intractable to increase the probability of success in the next t years and so I should work on interventions that increase the probability of success in worlds where AGI arrives at some time . Saving money now, even though AGI is expected sometime soon, is only occasionally recommended by the model. One case occurs with (1) a sufficiently low probability of success but steep gains to this probability after some amount of preparedness that is achievable in the next few decades, (2) a low discount rate, and either (a) that influence does not get too much more expensive over time or (b) influence is not too important.
A ‘wager’ on long timelines in a case where we have 2030 AGI timelines. This case has a discount rate , the difficulty is hard^{32} and the substitutability of research and influence is high^{33}. 
To some extent, there is a ‘sweet spot’ on the sshaped success curve where we wager on long timelines. If we are able to push the probability of success to a region where the slope of the scurve is large, we should spend a high rate until we reach this point. If we are stuck on the flatter far left tail such that we remain in that region regardless of any spending we do this century to stay in that area, we should spend at a more steady rate.
In some cases, we should ‘wager’ on shorter timelines by spending at a high rate now
This trivially occurs, for example, if you have a very high discount rate. A more interesting case occurs when (1) influence is poorly substituted by research^{34} and either (a) influence depreciates quickly or (b) influence quickly becomes expensive.
A ‘wager’ on short timelines in a case where we have the 2050 AGI timeline. This case has ‘medium’ difficulty and low substitutability of research and influence^{35}. 
Since the opportunity to wager on short timelines only is available now, we believe more effort should go into investigating the wager.
We discuss the primary limitations here, and reserve some for the appendix. For each limitation, we briefly discuss how a solution would potentially change the results.
The model does not explicitly account for research produced exogenously (i.e., not as a result of our spending). For example, it is plausible that research produced in academia should be included in our preparedness.
Exogenous research can be (poorly) approximated in the current model in a few different ways.
First, one could suppose that research appreciates over time and set . This supposes that research being done by outsiders is (directly) proportional to the research ‘we’ already have (where in this case, research done by outsiders is included in ). Since we model exponential appreciation, appreciation leads to a research explosion. One could slow this research explosion by supposing the appreciation term was for some .
Second, one could suppose that exogenous research sometimes solves the problem for us, making our own research redundant. This can be approximated by increasing the discount rate to account for the ‘risk’ that our own work is not useful. This is unrealistic in the sense that we are ‘surprised’ by some other group solving the problem.
A possible modification to the model would be to add a term to the expression for that accounts for the exogenous rate of growth of research. Alternatively, one could consider a radically different model of research that considered our spending on research as simply speeding up the progress that will otherwise happen (conditioning on no global catastrophe).
We expect this is the biggest weakness of the model, especially for those with long AGI timelines. To a first approximation, if there is little exogenous research we do not need to account for it, and if there is a lot then our own spending schedule does not matter. Perhaps we might think our actions can lead us to be in either regime and our challenge is to push the world towards the latter.
We may hope that some realworld interventions may delay the arrival of AGI, for example, passing policies to slow AI capabilities work. The model does not explicitly account for this feature of the world at all.
One extension to the model would be to change the length of the fire alarm period to be a function of the amount of influence we have. We expect this extension to imply an increase in the relative spending rate on influence. Another, more difficult extension would be to consider timelines as function such that we can ‘push’ timelines down the road with more influence.
We expect that our ability to delay the arrival of AGI, particularly for shorter arrivals, is sufficiently minimal such that it would not significantly change the result. For longer timelines, this seems less likely to be the case.
AI capabilities and our research influence each other in the real world. For example, AI capabilities may speed up research with AI assistants. On the other hand, spending large amounts on AI interventions may draw attention to the problem and speed up AI capabilities investment.
We allow for a depreciation of research, which can be used to model research becoming outdated as capabilities advance. One can also research becoming cheaper over time^{36} to account for capabilities speeding up our research.
On balance, we expect this limitation to not have a large effect. If one expects a ‘slow AI takeoff’ with the opportunity to use highly capable AI tools, one can use the fire alarm feature and set the returns to research during this period to be high.
We model the returns to spending constant across time. However, actual funders seem to be bottlenecked by vetting capacity and a lack of scalable and highreturn projects and so the returns to spending are likely to be high at the moment. Grantmakers can ‘seed’ projects and increase capacity such that it seems plausible that diminishing returns to spending will decrease in the future.
However, the model input only allows for constant diminishing marginal returns.
The model could be easily extended to use a function such that marginal returns to spending on research and influence changed over time, similar to how the real interest rate changes over time. This would require more input from the user. Another extension could allow for the returns to be a function of how much was spent last year. However, such an extension would increase the model's complexity and decrease its usability, simplicity and (potentially) solvability.
This limitation also applies to other features of the model, such as the values and .
Most existing applications of optimal control theory to effective altruismrelevant decisionmaking have used systems of differential equations that are analytically solvable and have guarantees of optimality. Our model has neither property and so we must rely on optimization methods that do not always lead to a global maximum.
There are around 40 free parameters that the user can set.
Many model features can be turned off. To turn off the following features:
One can set parameters such that the model is equivalent to like the following system
Some results from this system^{37}:
The current growth rate of our money is continuous. However, this poorly captures the case where most growth is driven by the arrival of new donors with lots of capital. Further, any growth is endogenous  it is always in proportion to our current capital .
One modification would be to the model arrival time of future funders by a stochastic process, for example following a Poisson distribution. For example, take
Where can model endogenous and noncontinuous growth of funding.
Following some preliminary experiments with a deterministic flux of funders, we are skeptical that this would substantially change the recommendations of the current model.
We see two potential problems with this approach.
First, one may care about spending money on things other than making AGI go well. The model does not tell you how to tradeoff these outcomes. The model best fits into a portfolio approach of doing good, such as Open Philanthropy’s Worldview Diversification. Alternatively, one may attach some value to having money leftover postAGI.
Second, there may be outcomes of intermediate utility between AGI being successful and not. A simple extension could consider some function of the probability of success. A more complex extension could consider the utility of AGI conditioned on its arrival time and our preparedness that accounts for nearmiss scenarios.
The funders have a stock of capital . This goes up in proportion to real interest at time , and down with spending on research, , and spending on influence, .
The funders have a stock of research which goes up with spending on research and can depreciate over time.
Where
Similarly, funders have a stock of influence which obeys
With constants mutatis mutandis from the research stock case and describes how the influence gained per unit money changes over time due to competition effects. That is, over time as the field of AGI influencers crowd each unit of influence can be more expensive.
We allow for the existence of an AGI fire alarm which tells us that AGI is exactly years away and that we can spend fraction of our money on research and influence.
We write and for the amount of research and influence we have in expectation at AGI takeoff if the fire alarm occurred at time . Within the fire alarm period, we suppose that
The first and second assumptions allow for analytical expression for as a function of and .
We write for the constant spending rate on research postfire alarm. We take where is the fraction of post firealarm spending. The system
has an analytical solution and we take
Similarly for research we take and system
Where is the competition factor at the start of the fire alarm period and is chosen by the user to either be a constant or function of . Note that is a constant in the differential equation, so the system has an analytical solution of the research system above. Again we take . Note that the user can state that no fire alarm occurs; setting the implies and so and so .
Our preparedness is given by
Preparedness is the constant elasticity of the substitution production function of and where the user chooses and .
Conditioning on AGI happening at time , we take the probability of AGI being safe as
This is a logistic function with constants and determined by the user’s beliefs about the difficulty of making AGI safe.
Our objective is to maximize the probability that AGI is safe. We have an objective function
Where is the user’s AGI timelines and is some discount rate.
We have initial conditions
We apply standard optimal control theory results.
We have Hamiltonian
Where are the costate variables.
The optimal spending schedule, if it exists, necessarily follows
We solve this boundary value problem using SciPy’s solve_bvp function and apply further optimisation methods to avoid local optima.
The model is a Python Notebook accessible on Google Colaboratory here.
Any cells that contain “User guide” are for assisting with the running of the notebook.
Below the initial instructions, you will find the user input parameters.
In the next section of this document we describe the parameters in detail and our own guesses.
We discuss the parameters in the same order as in the notebook.
Note, the estimates given are from Tristan and not necessarily endorsed by Guillaume.
Epistemic status: I’ve spent at least five minutes thinking about each, sometimes no more.
We elicit user timelines using two points on the cumulative distribution function and fit a lognormal distribution to them.
We note Metaculus’ Date of Artificial General Intelligence community prediction: as of 20221006, lower 25% 2030, median 2040 and upper 75% 2072.
Note that since our lognormal distribution is parameterised by two pairs of (year, probability by year), the three distinct Metaculus interquartile pairings will give different distributions.
The discount rate needs to factor in both nonAGI existential risks as well as catastrophic (but nonexistential) risks that preclude our AI work from being useful or any unknown unknowns that have some per year risk.
We choose implying an AGI success in 2100 is worth as much as a win today. As we discuss in the limitations section, the discount rate can also account for other people making AGI successful, though this interpretation of is not unproblematic.
Of relevance may be:
Our 90% confidence interval for is
As of 20221006, Forbes estimates Dustin Moskovitz and Sam Bankman Fried have wealth of $8,200m and $17,000m respectively. Todd (2021) estimates $7,100m from other sources in 2021 giving a total of $32,300m within the effective altruism community.
How much of this is committed to AI safety interventions?
Open Philanthropy has spent $157m on AIrelated interventions, of approximately $1500m spent so far. Supposing that roughly 15% of all funding is committed to AGI risk interventions gives at least $4,000m.
Our 90% confidence interval is .
We suppose that we are currently at some interest rate
Supposing the movement had $10,000m in 2015 and $32,300 in mid2022, money in the effective altruism community has grown 21% per year.
We take . Our 90% confidence interval is .
Historical S&P returns are around 8.5%. There are reasons to think the longterm rate may be higher  such as increase in growth due to AI capabilities  or lower  there is a selection bias in choosing a successful index. We take . Our 90% confidence interval is .
Our 90% confidence interval is .
Influence
The constant controls the marginal returns to spending on influence. For we receive diminishing marginal returns.
The top fraction of spending per year on influence leads to fraction of increase in growth of influence in that year. For example, implies the top 20% of spending leads to roughly 80% of returns i.e. the Pareto principle.
We note that influence spending can span many orders of magnitude and this suggests reason to think there are high diminishing returns (i.e. low ). For example, one community builder may cost on the order of per year, but investing in AI labs with the purpose of influencing their decisions may cost on the order of per year.
We take which implies doubling spending on influence lead to times more growth of influence Our 90% confidence interval is .
Research
The constant acts in the same way for research as does for influence.
We take , which implies a doubling of research spending leads to times increase in research growth and that 20% of the spending in research accounts for of the increase in research growth.
Potential sources for estimating include using the distribution of karma on the Alignment Forum, citations in journals or estimates of researchers’ outputs.
Our 90% confidence interval is .
Influence
For the price is constant.
On balance, we think the former reasons outweigh the latter, and so take . This implies a doubling of influence leads to one unit of spending on influence leading to times more growth in influence compared to one unit of spending without this doubling. Our 90% confidence interval is .
Research
The constant acts in the same way for research as
We are uncertain about the net effect of the above contributions, and so take . Our 90% confidence interval is .
Influence
We take , which implies a halflife of around years. Our 90% confidence interval is
Research
We expect research to depreciate over time. Research can depreciate by
One intuition pump is to ask what fraction of research on current large language models will be useful if AGI does not come until 2050? We guess on the order of 1% to 30%, implying  if all our research was on large language models  a value of between and . Note that for , such research can be instrumentally useful for later years due to its ability to make future work cheaper by, for example, attracting new talent.
We take . Our 90% confidence interval is .
We allow for influence to become more expensive over time. The primary mechanism we can see is due to (a) competition with other groups that want to influence AI developers, and (b) competition within the field of AI capabilities, such that there are more organisations that could potentially develop AGI.
We suppose the influence per unit spending decreases over time following some Sshape curve, and ask for three points on this curve.
The first data point is the first year in which money was spent on influence. Since one can consider community building or spreading AI risk ideas (particularly among AI developers) as a form of influence, the earliest year of spending is unclear. We take 2015 (the first year Open Philanthopy made grants in this area). The relative cost of influence is set to 1 in this year.
We then require to further years, as well as influence per unit spending relative to the the first year of spending.
Our best guess is (2017, 0.9)  that is, in 2017 one received 90% as much influence per unit spending as one would have done in 2015 and (2022, 0.6).
The final input is the minimum influence per unit spending that will be reached be take this to be 0.02. That is, influence will eventually be more expensive per unit that it was in 2015. Our 90% confidence interval is (0.001, 0.1).
The model uses this data to calculate the quantities of research and influence we have now. Rough estimates are sufficient.
The Singularity Institute (now MIRI) was founded in 2000 and switched to work on AI safety in 2005. We take 2005 as the first year of spending on research.
Open Philanthropy has donated $243.5m to risks from AI since its first grant in the area in August 2015. We very roughly categorised each grant by research : influence fraction, and estimate that $132m has been spent on research and $111m on influence. We suppose that Open Philanthropy has made up twothirds of the overall spending, giving totals of $198m and $167m.
We guess that the research spending, which started in 2005, has been growing at 25% per year and influence spending has been growing 40% per year since starting in 2015.
By default, in the results we show, we assume no fire alarm in the model. This is achievable by setting the expected fraction of money spendable to 0. When considering the existence of a fire alarm, we take the following values.
For the fire alarm duration we ask Supposing that the leading AI lab has reached an AGI system that they are not deploying out of safety concerns, how far behind is a less safetyconscious lab? We guess this period to be half a year.
Our 90% confidence interval for this period, if it exists, is (one month, two years).
One may think that expected fraction of money spendable during the fire alarm is less than 1 for reasons such as
We take 0.1 as the expected fraction of money spendable with 90% confidence interval (0.01, 0.5).
During the fire alarm period, we enter a phase with no appreciation or depreciation of research or influence and a separate marginal returns to spending and can apply.
Some reasons to think (worse returns during the fire alarm period)
Some reasons to think the (better returns during the fire alarm period)
We expect that the returns to research spending will be very low, and take , implying that the amount of research we can do in the postfirealarm period is not very dependent on the money we have.
We expect that returns to influence spending will be less than in the period before, lower. We take .
In the fire alarm phase, the cost per unit of research and influence can also change depending on the amount we already have.
We expect and . That is, during the fire alarm period it is even cheaper to get influence once you already have some than before this phase and that this effect is greater during the fire alarm period (the first inequality). In the case there is panic, it seems people will be looking for trustworthy organisations to defer to and execute a plan.
We expect That is, during the fire alarm period the amount of existing research decreases the cost of new research.
During the fire alarm period, it seems likely that only a few highly skilled researchers  perhaps within the AI lab  will have access to the information and tools necessary to conduct further useful research. The research at this point is likely highly serial: the researchers trying to focus on the biggest problems. Existing research may allow these few researchers to build on existing work effectively.
We take both and to be 0.3, implying that a doubling of research before the firealarm period increases the stock output during the firealarm period by times.
Preparedness at time is a function of the firealarm adjusted research and influence that takes two parameters, the share parameter that controls the relative importance of research and influence and parameter , that controls the substitutability of research and influence.
Decreasing decreases the subsitutability of research and influence. In the limit as , our preparedness can be entirely bottlenecked by the stock we have the least (weighted by
gives the CobbDouglas production function, though to avoid a casebycase situation in the programming, you cannot set and instead can choose value close to .
Again, we recommend picking values and running the cell to see the graph. We choose . We think that the problem is mainly a technical problem, but in practise cannot be solved without influencing AI developers.
The probability of success at time
The first input is the probability of success if AGI arrived this year. That is, given our existing stocks of research and influence  this input doesn't consider any fire alarm spending. The second input determines the steepness of the Sshaped curve.
We take (10%, 0.15).
A note on the probability of success
Suppose you think we are in one the following three worlds
Then, in the input you should imagine you should give your inputs as they are in your world B model. We keep the probability of success curve between 0 and 1, but one could linear transform it to be greater than the probability of success in the A world and less than the probability of success in the C world. Since the objective function is linear in the probability of success, such a transformation has no effect on the optimal spending schedule.
In an alternate model, we suppose the funder has a stock of things that grow which includes things such as skills, people, some types of knowledge and trust. They can choose to spend this stock at some rate to produce a stock of things that depreciate that are immediately helpful in increasing the probability of success. This could comprise things such as the implementation of safety ideas on current systems or the product of asking for favours of AI developers or policymakers.
We say that spending capacity to create is crunching and the periods with high are crunch time.
The probability we succeed at time is a function of which is plus any lastminute spending that occurs with a fire alarm. Specifically, it is another Sshaped curve.
Time evolution of things that grow  
Time evolution of things that depreciate  
Postfire alarm total of things that depreciate  
The probability of success given AGI at  
Objective function 
Recall that is the expected fraction of money spendable postfire alarm and is the expected duration of the fire alarm. The equation for is thus simply the result of spending at rate for duration .
The alternate model shares the following parameters and machinery with the researchinfluence model
The new inputs include
We expect the growth in capacity to decrease over time since some of our capacity is money and the same reasons will apply as in the former model. We suppose , and .
The factor , which in the former model controlled how influence becomes more expensive over time, controls how the cost of doing direct work  producing  becomes more expensive over time. . Only some spending to produce is in competitive domains (such as influencing AI developers) while some is noncompetitive, such as implementing safety features in stateoftheart systems.
We suppose that has a minimum 0.5 and otherwise has the same factors as in the former researchinfluence model.
This controls the degree of diminishing returns to ‘crunching’. For reasons similar to those given for and in the main model, we take . Our 90% confidence interval is .
This controls how long our crunch time activities are useful for i.e. the speed of depreciation. We take which implies that after one of the direct work is still useful.
To derive the constants used in the Sshaped curve, we ask for the probability of success after some hypothetical where we've spent some fraction of our capacity for one year.
Our guess is that after spending half of our resources this year, we’d have a 25% chance of alignment success if AGI arrived this year. Note that this input does not account for any postfirealarm spending.
Unsurprisingly we see that we should spend our capacity of thingsthatgrow most around the time we expect AGI to appear. For the 2040 and 2050 timelines, this implies spending very little on things that depreciate, up to around 3% a year. For 2030 timelines, we should be spending between 5 and 10% of our capacity each year on ‘crunching’ for the arrival of AGI. Further, for all results, we begin maximum crunching after the modal AGI arrival date, which is understandable while the rate of growth of the movement is greater than the rate of decrease in probability of AGI (times the discount factor).
This result is relatively sensitive to the probability we think AGI will appear in the next few years. We fit a lognormal distribution to the AGI timeline with which leads to being small for the next few years. Considering a probability distribution that gave some weight to AGI in the next few years would inevitably imply a higher initial spending rate, though likely a similar overall spending schedule in sufficiently many years time.
Median AGI arrival  
Difficulty of AGI success  2030^{38}  2040^{39}  2050^{40} 
Easy^{41}  
Medium^{42}  
Hard^{43} 
Many of the limitations we describe apply to both models.
For example, there is no exogenous increase in which we may expect if other actors' work on AI risk at some time in the future. One could, for example, adjust such that spending on direct work receives more units of per unit spending in the future due to others' also working on the problem.
Like the first model, our work and AI capabilities are independent. One could, again, use to model direct work becoming cheaper as time goes on and new AI capabilities are developed.
Added 20221129, see discussion here
Here I consider the most robust spending policies and supposes uncertainty over nearly all parameters in the main model^{44}, rather than pointestimates. I find that the community’s current spending rate on AI risk interventions is too low.
My distributions over the the model parameters imply that
I recommend entering your own distributions for the parameters in the Python notebook here^{46}. Further, these preliminary results use few samples: more reliable results would be obtained with more samples (and more computing time).
I allow for postfirealarm spending (i.e., we are certain AGI is soon and so can spend some fraction of our capital). Without this feature, the optimal schedules would likely recommend a greater spending rate.
The results from a simple optimiser^{47}, when allowing for four spending regimes: 20222027, 20272032, 20322037 and 2037 onwards. This result should not be taken too seriously: more samples should be used, the optimiser runs for a greater number of steps and more intervals used. As with other results, this is contingent on the distributions of parameters. 
This short extension started due to a conversation with David Field and comment from Vasco Grilo; I’m grateful to both for the suggestion.
Tristan and Guillaume defined the problem, designed the model and its numerical resolution, interpreted the results, wrote and reviewed the article. Tristan coded the Python notebook and carried out the numerical computations with feedback from Guillaume. Tristan designed, coded, solved the alternate model and interpreted its results.
We’d both like to thank Lennart Stern and Daniel Kokotajlo for their comments and guidance during the project. We’re grateful to John Mori for comments.
Guillaume thanks the SERI summer fellowship 2021 where this project started with some excellent mentorship from Lennart Stern, the CEEALAR organisation for a stimulating working and living environment during the summer 2021 and the CLR for providing funding to support parttime working with Tristan to make substantial progress on this project.
The post The optimal timing of spending on AGI safety work; why we should probably be spending more now appeared first on Center on LongTerm Risk.
]]>The post When is intent alignment sufficient or necessary to reduce AGI conflict? appeared first on Center on LongTerm Risk.
]]>In the previous post, we outlined possible causes of conflict and directions for intervening on those causes. Many of the causes of conflict seem like they would be addressed by successful AI alignment. For example: if AIs acquire conflictprone preferences from their training data when we didn’t want them to, that is a clear case of misalignment. One of the suggested solutions: improving adversarial training and interpretability, just is alignment research, albeit directed at a specific type of misaligned behavior. We might naturally ask, does all work to reduce conflict risk follow this pattern? That is, is intent alignment sufficient to avoid unendorsed conflict?
Intent Alignment isn't Sufficient is a claim about unendorsed conflict. We’re focusing on unendorsed conflict because we want to know whether technical interventions on AGIs to reduce the risks of conflict make a difference. These interventions mostly make sense for preventing conflict that isn’t desired by the overseers of the systems. (If the only conflict between AGIs is endorsed by their overseers, then conflict reduction is a problem of ensuring that AGI overseers aren’t motivated to start conflicts.)
Let H be a human principal and A be its AGI agent. “Unendorsed” conflict, in our sense, is conflict which would not have been endorsed on reflection by H at the time A was deployed. This notion of “unendorsed” is a bit complicated. In particular, it doesn’t just mean “not endorsed by a human at the time the agent decided to engage in conflict”. We chose it because we think it should include the following cases:
We’ll use Evan Hubinger’s decomposition of the alignment problem. In Evan’s decomposition, an AI is aligned with humans (i.e., doesn’t take any actions we would consider bad/problematic/dangerous/catastrophic) if it is intentaligned and capability robust. (An agent is capability robust if it performs well by its own lights once it is deployed.) So the question for us is: What aspects of capability robustness determine whether unendorsed conflict occurs, and will these be present by default if intent alignment succeeds?
Let’s decompose conflictavoiding “capability robustness” into the capabilities necessary and sufficient for avoiding unendorsed conflict into two parts:
Two conditions need to hold for unendorsed conflict to occur if the AGIs are intent aligned (summarized in Figure 1): (1) the AIs lack some cooperative capability or have misunderstood their overseer’s cooperationrelevant preferences, and (2) conflict is not prevented by the AGI consulting with its overseer.
These conditions may sometimes hold. In the next section, we list scenarios in which consultation with overseers would fail to prevent conflict. We then look at “conflictcausing capabilities failures”.
One reason to doubt that intentaligned AIs will engage in unendorsed conflict is that these AIs should be trying to figure out what their overseers want. Whenever possible, and especially before taking any irreversible action like starting a destructive conflict, the AI should check whether its understanding of overseer preferences is accurate. Here are some reasons why we still might see catastrophic decisions, despite this^{1}:
Let’s return to our causes of conflict and see how intentaligned AGIs might fail to have the capabilities necessary to avoid unendorsed conflict due to these factors.
We break cooperationrelevant preferences into “objectlevel preferences” (such as how bad a particular conflict would be) and “metapreferences” (such as how to reflect about how one wants to approach complicated bargaining problems).
One objection to doing work specific to reducing conflict between intentaligned AIs now is that this work can be deferred to a time when we have highly capable and aligned AI assistants. We’d plausibly be able to do technical research drastically faster then. While this is a separate question to whether Intent Alignment isn't Sufficient, this is an important objection to conflictspecific work, so we briefly address it here.
Some reasons we might benefit from work on conflict reduction now, even in worlds where we get intentaligned AGIs, include:
Still, the fact that intentaligned AGI assistants may be able to do much of the research on conflict reduction that we would do now has important implications for prioritization. We should prioritize thinking about how to use intentaligned assistants to reduce the risks of conflict, and deprioritize questions that are likely to be deferrable.
On the other hand, AI systems might be incorrigibly misaligned before they are in a position to substantially contribute to research on conflict reduction. We might still be able to reduce the chances of particularly bad outcomes involving misaligned AGI, without the help of intentaligned assistants.
Whether or not Intent Alignment isn't Sufficient to prevent unendorsed conflict, we may not get intentaligned AGIs in the first place. But it might still be possible to prevent worsethanextinction outcomes resulting from an intentmisaligned AGI engaging in conflict. On the other hand, it seems difficult to steer a misaligned AGI’s conflict behavior in any particular direction.
Coarsegrained interventions on AIs’ preferences to make them less conflictprone seem prima facie more likely to be effective given misalignment than trying to make more finegrained interventions on how they approach bargaining problems (such as biasing AIs towards more cautious reasoning about commitments, as discussed previously). Let’s look at one reason to think that coarsegrained interventions on misaligned AIs’ preferences may succeed and thus that Intent Alignment isn't Necessary.
Assume that at some point during training, the AI begins 'playing the training game'. Some time before it starts playing the training game, it has started pursuing a misaligned goal. What, if anything, can we predict about the conflictproneness of this from the AI’s training data?
A key problem is that there are many objective functions such that trying to optimize is consistent with good early training performance, even if the agent isn’t playing the training game. However, we may not need to predict in much detail to know that a particular training regime will tend to select for more or less conflictprone . For example, consider a 2agent training environment, let be agent ’s reward signal. Suppose we have reason to believe that a training process selects for spiteful agents, that is, agents who act as if optimizing for on the training distribution.^{2} This gives us reason to think that agents will learn to optimize for for some objectives correlated with on the training distribution. Importantly, we don’t need to predict to worry that agent 1 will learn a spiteful objective.^{3}
Concretely, imagine an extension of the SmartVault example from the ELK report, in which multiple SmartVault reporters are trained in a shared environment. And suppose that the human overseers iteratively select the SmartVault system that gets the highest reward out of several in the environment. This creates incentives for the SmartVault systems to reduce each other’s reward. It may lead to them acquiring a terminal preference for harming (some proxy for) their counterpart’s reward. But this reasoning doesn’t rest on a specific prediction about what proxies for human approval the reporters are optimizing for. As long as SmartVault1 is harming some good proxy for SmartVault2’s approval, they will be more likely to be selected. (Again, this is only true because we are assuming that the SmartVaults are not yet playing the training game.)
What this argument shows is that choosing not to reward SmartVault1 or 2 competitively eliminates a training signal towards conflictproneness, regardless of whether either is truthful. So there are some circumstances under which we might not be able to select for truthful reporters in the SmartVault but could still avoid selecting for agents that are conflictprone.
Human evolution is another example. It may have been difficult for someone observing human evolution to predict precisely what proxies for inclusive fitness humans would end up caring about. But the gametheoretic structure of human evolution may have allowed them to predict that, whatever proxies for inclusive fitness humans ended up caring about, they would sometimes want to harm or help (proxies for) other humans’ fitness. And otherregarding human preferences (e.g., altruism, inequity aversion, spite) do still seem to play an important role in highstakes human conflict.
The examples above focus on multiagent training environments. This is not to suggest that multiagent training, or training analogous to evolution, is the only regime in which we have any hope of intervening if intent alignment fails. Even in training environments in which a single agent is being trained, it will likely be exposed to “virtual” other agents, and these interactions may still select for dispositions to help or harm other agents. And, just naively rewarding agents for prosocial behavior and punishing them for antisocial behavior early in training may still be lowhanging fruit worth picking, in the hopes that this still exerts some positive influence over agents’ mesaobjective before they start playing the training game.
We’ve argued that Capabilities aren't Sufficient, Intent Alignment isn't Necessary and Intent Alignment isn't Sufficient, and therefore technical work specific to AGI conflict reduction could make a difference. It could still be that alignment research is a better bet for reducing AGI conflict. But we currently believe that there are several research directions that are sufficiently tractable, neglected, and likely to be important for conflict reduction that they are worth dedicating some portion of the existential AI safety portfolio to.
First, work on using intentaligned AIs to navigate cooperation problems. This would involve conceptual research aimed at preventing intentaligned AIs from locking in bad commitments or other catastrophic decisions early on, and preventing the corruption of AIassisted deliberation about bargaining. One goal of this research would be to produce a manual for the overseers of intentaligned AGIs with instructions on how to train their AI systems to avoid the failures of cooperation discussed in this sequence.
Second, research into how to train AIs in ways that don’t select for CPPs and inflexible commitments. Research into how to detect and select against CPPs or inflexible commitments could be useful (1) if intent alignment is solved, as part of the preparatory work to enable us to better understand what cooperation failures are common for AIs and how to avoid them, or (2) if intent alignment is not solved, it can be directly used to incentivise misaligned AIs to be less conflictprone. This could involve conceptual work on mechanisms for preventing CPPs that could survive misalignment. It might also involve empirical work, e.g., to understand the scaling of analogs of conflictproneness in contemporary language models.
There are several tractable directions for empirical work that could support both of these research streams. Improving our ability to measure cooperationrelevant features of foundation models, and carrying out these measurements, is one. Better understanding the kinds of feedback humans give to AI systems in conflict situations, and how to improve that feedback, is another. Finally, getting practice training powerful contemporary AI systems to behave cooperatively also seems valuable, for reasons similar to those given by Ajeya in The case for aligning narrowly superhuman models.
The post When is intent alignment sufficient or necessary to reduce AGI conflict? appeared first on Center on LongTerm Risk.
]]>The post When would AGIs engage in conflict? appeared first on Center on LongTerm Risk.
]]>First we’ll focus on conflict that is costly by the AGIs’ lights. We’ll define “costly conflict” as (ex post) inefficiency: There is an outcome that all of the agents involved in the interaction prefer to the one that obtains.^{1} This raises the inefficiency puzzle of war: Why would intelligent, rational actors behave in a way that leaves them all worse off than they could be?
We’ll operationalize “rational and intelligent” actors as expected utility maximizers.^{2} We believe that the following taxonomy of the causes of inefficient outcomes between rational actors is exhaustive, except for a few implausible edge cases. (We give the full taxonomy, and an informal argument that it is exhaustive, in the appendix.) That is, expected value maximization can lead to inefficient outcomes for the agents only if one of the following conditions (or an implausible edge case) holds. This taxonomy builds on Fearon’s (1995) influential “rationalist explanations for war”.^{3}
Private information and incentives not to disclose. Here, “private information” means information about one’s willingness or ability to engage in conflict — e.g., how costly one considers conflict to be, or how strong one’s military is — about which other agents are uncertain. This uncertainty creates a riskreward tradeoff: For example, Country A might think it’s sufficiently likely that Country B will give up without much of a fight that it’s worthwhile in expectation for A to fight B, even if they’ll end up fighting a war if they are wrong.
In these cases, removing uncertainty — e.g., both sides learning exactly how willing the other is to fight — opens up peaceful equilibria. This is why conflict due to private information requires “incentives not to disclose”. Whether there are incentives to disclose will depend on a few factors.
First, the technical feasibility of different kinds of verifiable disclosure matters. For example, if I have an explicitlyspecified utility function, how hard is it for me to prove to you how much my utility function values conflict relative to peace?
Second, different games will create different incentives for disclosure. Sometimes the mere possibility of verifiable disclosure ends up incentivizing all agents to disclose all of their private information (Grossman 1981; Milgrom 1981). But in other cases, more sophisticated disclosure schemes are needed. For example: Suppose that an agent has some vulnerability such that unconditionally disclosing all of their private information would place them at a decisive disadvantage. The agents could then make copies of themselves, allow these copies to inspect one another, and transmit back to the agents only the private information that’s necessary to reach a bargain. (See the appendix and DiGiovanni and Clifton (2022) for more discussion of conditional information revelation and other conditions for the rational disclosure of conflictrelevant private information.)
For the rest of the sequence, we’ll use “informational problem” as shorthand for this mechanism of conflict.
Inability to credibly commit to peaceful settlement. Agents might fight even though they would like to be able to commit to peace. The Prisoner’s Dilemma is the classic example: Both prisoners would like to be able to write a binding contract to cooperate, but if they can’t, then the gametheoretically rational thing to do is defect.
Similarly sometimes one agent will be tempted to launch a preemptive attack against another. For example, if Country A thinks Country B will soon become significantly more powerful, Country A might be tempted to attack Country B now. This could be solved with credible commitment: Country B could commit not to becoming significantly more powerful, or to compensate Country A for their weakened bargaining position. But without the ability to make such commitments, Country A may be rationally compelled to fight.
Another example is randomly dividing a prize. Suppose Country A and Country B are fighting over an indivisible holy site. They might want to randomly allocate the holy site to one of them, rather than fighting. The problem is that, once the winner has been decided by the random lottery, the loser has no reason to concede rather than fight, unless they have some commitment in place to honor the outcome of the lottery.
For the rest of the sequence, we’ll call use “commitment inability problem”^{4} as shorthand for this mechanism of conflict.
Miscoordination. When there are no informational or commitment inability problems, and agents’ preferences aren’t entirely opposed (see below), there will be equilibria in which agents avoid conflict. But the existence of such equilibria isn’t enough to guarantee peace, even between rational agents. Agents can still fail to coordinate on a peaceful solution.
A central example of catastrophic conflict due to miscoordination is incompatible commitments. Agents may make commitments to accepting only certain peaceful settlements, and otherwise punishing their counterpart. This can happen when agents have uncertainty about what commitments their counterparts will make. Depending on what you think about the range of outcomes your counterpart has committed to demanding, you might commit to a wider or narrow range of demands. There are situations in which the agents’ uncertainty is such that the optimal thing for each of them to do is commit to a narrow range of demands, which end up being incompatible. See this post on “the commitment races problem” for more discussion.
One reason for optimism about AGI conflict is that AGIs may be much better at credible commitment and disclosure of private information. For example, AGIs could make copies of themselves and let their counterparts inspect these copies until they are satisfied that they understand what kinds of commitments their counterpart has in place. Or, to credibly commit to a treaty, AGIs could do a “value handshake” and build a successor AGI system whose goal is to act according to the treaty. So, what are some reasons why AGIs would still engage in conflict, given these possibilities? Three stand out to us:
Strategic pressures early in AGI takeoff. Consider AGI agents that are opaque to one another, but are capable of selfmodifying / designing successor agents who can implement the necessary forms of disclosure. Would such agents ever fail to implement these solutions? If, say, designing more transparent successor agents is difficult and timeconsuming, then agents might face a tradeoff between trying to implement more cooperationconducive architectures and placing themselves at a critical strategic disadvantage. This seems most plausible in the early stages of a multipolar AI takeoff.
Lack of capability early in AGI takeoff. Early in a slow multipolar AGI takeoff, preAGI AIs or early AGIs might be capable of starting destructive conflicts but lack the ability to design successor agents, scrutinize the inner workings of opponent AGIs, or reflect on their own cognition in ways that would let them anticipate future conflicts. If AGI capabilities come in this order, such that the ability to launch destructive conflicts comes a while before the ability to design complete successor agents or selfmodify, then early AGIs may not be much better than humans at solving informational and commitment problems.
Fundamental computational limits. There may be fundamental limits on the ability of complex AGIs to implement the necessary forms of verifiable disclosure. For example, in interactions between complex AGI civilizations in the far future, these civilizations’ willingness to fight may be determined by factors that are difficult to compress. (That is, the only way in which you can find out how willing to fight they are is to run expensive simulations of what they would do in different hypothetical scenarios.) Or it may be difficult to verify that the other civilization has disclosed their actual private information.
These considerations apply to informational and commitment inability problems. But there is also the problem of incompatible commitments, which is not solved by sufficiently strong credibility or disclosure ability. Regardless of commitment or disclosure ability, agents will sometimes have to make commitments under uncertainty about others’ commitments.
Still, the ability to make conditional commitments could still help to ameliorate the risks from incompatible commitments. For example, agents could have a hierarchy of conditional commitments of the form: “If our order commitments are incompatible, try resolving these via an order bargaining process.” See also safe Pareto improvements, which is a particular kind of failsafe for incompatible commitments, and (the version in the linked paper) relies on strong commitment and disclosure ability.
Another way conflict can be rational is if conflict actually isn’t costly for at least one agent, i.e., there isn’t any outcome that all parties prefer to conflict. That is, Conflict isn't Costly is false. Some ways this could happen:
These cases, in which conflict is literally costless for one agent, are prima facie quite unlikely. They are extremes on a spectrum of what we’ll call conflictprone preferences (CPPs). By shrinking the range of outcomes agents prefer to conflict, these preferences exacerbate the risks of conflict due to informational, commitment inability, or miscoordination problems. For instance, riskseeking preferences will lead to a greater willingness to risk losses from conflict (see Shulman (2010) for some discussion of implications for conflict between AIs and humans). And spite will make conflict less subjectively costly, as the material costs that a conflict imposes on a spiteful agent are partially offset by the positivelyvalued material harms to one’s counterpart.
We argued above that Capabilities aren’t Sufficient. AGIs may sometimes engage in conflict that is costly, even if they are extremely capable. But it remains to be seen whether Intent Alignment isn't Sufficient to prevent unendorsed conflict, or Intent Alignment isn't Necessary to reduce the risk of conflict. Before we look at those claims, it may be helpful to review some approaches to AGI design that might reduce the risks reviewed in the previous section. In the next post, we ask whether these interventions are redundant with work on AI alignment.
Let’s look at interventions directed at each of the causes of conflict in our taxonomy.
Informational and commitment inability problems. One could try to design AGIs that are better able to make credible commitments and better able to disclose their private information. But it’s not clear whether this reduces the net losses from conflict. First, greater commitment ability could increase the risks of conflict from incompatible commitments. Second, even if the risks of informational and commitment inability problems would be eliminated in the limit of perfect commitment and disclosure, marginal increases in these capabilities could worsen conflict due to informational and commitment inability problems. For example, increasing the credibility of commitments could embolden actors to commit to carrying out threats more often, in a way that leads to greater losses from conflict overall.^{6}
Miscoordination. One direction here is building AGIs that reason in more cautious ways about commitment, and take measures to mitigate the downsides from incompatible commitments. For example, we could develop instructions for human overseers as to what kinds of reasoning about commitments they should encourage or discourage in their (intentaligned) AI. The design of this “overseer’s manual” might be improved by doing more conceptual thinking about sophisticated approaches to commitments.^{7} Examples of this kind of work include Yudkowsky’s solution for bargaining between agents with different standards of fairness; surrogate goals and Oesterheld and Conitzer’s safe Pareto improvements; and Stastny et al.’s notion of normadaptive policies. It may also be helpful to consider what kinds of reasoning about commitments we should try to prevent altogether in the early stages of AGI development.
The goal of such work is not necessarily to fully solve tricky conceptual problems in bargaining. One path to impact is to improve the chances that early humanintentaligned AI teams are in a “basin of attraction of good bargaining”. The initial conditions of their deliberation about how to bargain should be good enough to avoid locking in catastrophic commitments early on, and to avoid pathdependencies which cause deliberation to be corrupted. We return to this in our discussion of Intent Alignment isn't Sufficient.
Lastly, surrogate goals are a proposal for mitigating the downsides of executed threats, which might occur due to either miscoordination or informational problems. The idea is to design an AI to treat threats to carry out some benign action (e.g., simulating clouds) the same way that they treat threats against the overseer’s terminal values.
Conflictprone preferences. Consider a few ways in which AI systems might acquire CPPs. First, CPPs might be strategically useful in some training environments. Evolutionary game theorists have studied how CPPs like spite (Hamilton 1970; Possajennikov 2000; Gardner and West 2004; Forber and Smead 2016) and aggression towards outgroup members (Choi and Bowles 2007) can be selected for. Analogous selection pressures could appear in AI training.^{8} For example, selection for the agents that perform the best relative to opponents creates similar pressures to the evolutionary pressures hypothetically responsible for spite: Agents will have reason to sacrifice absolute performance to harm other agents, so that they can increase the chances that their relative score is highest. So, identifying and removing training environments which incentivize CPPs (while not affecting agents’ competitiveness) is one direction for intervention.
Second, CPPs might result from poor generalization from human preference data. An AI might fail to correct for biases that cause a human to behave in a more conflictconducive way than they would actually endorse, for instance. Inferring human preferences is hard. It is especially difficult in multiagent settings, where a preferenceinferrer has to account for a preferenceinferree’s models of other agents, as well as biases specific to mixedmotive settings.^{9}
Finally, a generic direction for preventing CPPs is developing adversarial training and interpretability methods tailored to rooting out conflictprone behavior.
Here we present our complete taxonomy of causes of costly conflict between rational agents, and give an informal argument that it is exhaustive. Remember that by “rational” we mean “maximizes subjective expected utility” and by “costly conflict” we mean “inefficient outcome”.
For the purposes of the informal exposition, it will be helpful to distinguish between equilibriumcompatible and equilibriumincompatible conflict. Equilibriumcompatible conflicts are those which are naturally modeled as occurring in some (Bayesian) Nash equilibrium. That is, we can model them as resulting from agents (i) knowing each others’ strategies exactly (modulo private information) and (ii) playing a best response. Equilibriumincompatible conflicts cannot be modeled in this way. Note, however, that the equilibriumcompatible conditions for conflict can hold even when agents are not in equilibrium.
This breakdown is summarized as a fault tree diagram in Figure 3.
Here is an argument that items 1a1c capture all games in which conflict occurs in every (Bayesian) Nash equilibrium. To start, consider games of complete information. Some complete information games have only inefficient equilibria. The Prisoner’s Dilemma is the canonical example. But we also know that any efficient outcome that is achievable by some convex combination of strategies and is better for each player that what they can unilaterally guarantee themselves can be attained in equilibrium, when agents are capable of conditional commitments to cooperation and correlated randomization (Kalai et al. 2010). This means that, for a game of complete information to have only inefficient equilibria, it has to be the case that either they are unable to make credible conditional commitments to an efficient profile (1b) or an efficient and individually rational outcome is only attainable with randomization (because the contested object is indivisible), but randomization isn’t possible (1c).
Even if efficiency in complete information is always possible given commitment and randomization ability, players might not have complete information. It is wellknown that private information can lead to inefficiency in equilibrium, due to agents making riskreward tradeoffs under uncertainty about their counterpart’s private information (1a). It is also necessary that agents can’t or won’t disclose their private information — we give a breakdown of reasons for nondisclosure below.
This all means that a game has no efficient equilibria only if one of items 1a1c holds. But it could still be the case that agents coordinate on an inefficient equilibrium, even if an efficient one is available (1d). E.g., agents might both play Hare in a Stag Hunt. (Coordinating on an equilibrium but failing to coordinate on an efficient one seems unlikely, which is why we don’t discuss it in the main text. But it isn’t ruled out by the assumptions of accurate beliefs and maximizing expected utility with respect to those beliefs alone.)
This exhausts explanations of situations in which conflict happens in equilibrium. But rationality does not imply that agents are in equilibrium. How could imperfect knowledge of other players’ strategies drive conflict? There are two possibilities:
Suppose that agents have private information such that nondisclosure of the information will lead to a situation in which conflict is rational, but conflict would no longer be rational if the information were disclosed.
We can decompose reasons not to disclose into reasons not to unconditionally disclose and reasons not to conditionally disclose. Here, “conditional” disclosure means “disclosure conditional on a commitment to a particular agreement by the other player”. For example, suppose my private information is , where measures my military strength, such that is my chance of winning a war, and is information about secret military technology that I don’t want you to learn. A conditional commitment would be: I disclose , so that we can decide the outcome of the contest according to a costless lottery which I win with probability , conditional on a commitment from you not to use your knowledge of to harm me.
Here is the decomposition:
DiGiovanni, Anthony, and Jesse Clifton. 2022. “Commitment Games with Conditional Information Revelation.” arXiv [cs.GT]. arXiv. http://arxiv.org/abs/2204.03484.
Shulman. 2010. “Omohundro’s ‘basic AI Drives’ and Catastrophic Risks.” Manuscript.(singinst. Org/upload/aiResourceDrives. Pdf. http://www.hdjkn.com/files/BasicAIDrives.pdf.
The post When would AGIs engage in conflict? appeared first on Center on LongTerm Risk.
]]>The post When does technical work to reduce AGI conflict make a difference?: Introduction appeared first on Center on LongTerm Risk.
]]>
This is a pareddown version of a longer draft report. We went with a more concise version to get it out faster, so it ended up being more of an overview of definitions and concepts, and is thin on concrete examples and details. Hopefully subsequent work will help fill those gaps.
Some researchers are focused on reducing the risks of conflict between AGIs. In this sequence, we’ll present several necessary conditions for technical work on AGI conflict reduction to be effective, and survey circumstances under which these conditions hold. We’ll also present some tentative thoughts on promising directions for research and intervention to prevent AGI conflict.
This sequence assumes familiarity with intermediate game theory.
Could powerful AI systems engage in catastrophic conflict? And if so, what are the best ways to reduce this risk? Several recent research agendas related to safe and beneficial AI have been motivated, in part, by reducing the risks of largescale conflict involving artificial general intelligence (AGI). These include the Center on LongTerm Risk’s research agenda, Open Problems in Cooperative AI, and AI Research Considerations for Human Existential Safety (and this associated assessment of various AI research areas). As proposals for longtermist priorities, these research agendas are premised on a view that AGI conflict could destroy large amounts of value, and that a good way to reduce the risk of AGI conflict is to do work on conflict in particular. In this sequence, our goal is to assess conditions under which work specific to conflict reduction could make a difference, beyond nonconflictfocused work on AI alignment and capabilities.^{1}
Examples of conflict include existentially catastrophic wars between AGI systems in a multipolar takeoff (e.g., 'flash war') or even between different civilizations (e.g., Sandberg 2021). We’ll assume that expected losses from catastrophic conflicts such as these are sufficiently high for this to be worth thinking about at all, and we won’t argue for that claim here.
We’ll restrict attention to technical (as opposed to, e.g., governance) interventions aimed at reducing the risks of catastrophic conflict involving AGI. These include Cooperative AI interventions, where Cooperative AI is concerned with improving the cooperative capabilities of selfinterested actors (whether AI agents or AIassisted humans).^{2} Candidates for cooperative capabilities include the ability to implement mutual auditing schemes in order to reduce uncertainties that contribute to conflict, and the ability to avoid conflict due to incompatible commitments (see Yudkowsky (2013); Oesterheld and Conitzer (2021); (Oesterheld and Conitzer 2021; Stastny et al. 2021)). The interventions under consideration also include improving AI systems’ ability to understand humans’ cooperationrelevant preferences. Finally, they include shaping agents’ cooperationrelevant preferences, e.g., preventing AGIs from acquiring conflictprone preferences like spite. An overview of the kinds of interventions that we have in mind here is given in Table 1.
Class of technical interventions specific to reducing conflict  Examples 
Improving cooperative capabilities (Cooperative AI) 

Improving understanding of humans’ cooperationrelevant preferences 

Shaping cooperationrelevant preferences 

There are reasons to doubt the claim that (Technical Work Specific to) Conflict Reduction Makes a Difference.^{3} Conflict reduction won’t make a difference if the following conditions don’t hold: (a) AGIs won’t always avoid conflict, despite it being materially costly and (b) intent alignment is either insufficient or unnecessary for conflict reduction work to make a difference. In the rest of the sequence, we’ll look at what needs to happen for these conditions to hold.
Throughout the sequence, we will use “conflict” to refer to “conflict that is costly by our lights”, unless otherwise specified. Of course, conflict that is costly by our lights (e.g., wars that destroy resources that would otherwise be used to make things we value) are also likely to be costly by the AGIs’ lights, though this is not a logical necessity. For AGIs to fail to avoid conflict by default, one of these must be true:
Conflict isn’t costly by the AGIs’ lights. That is, there don’t exist outcomes that all of the disputant AGIs would prefer to conflict.
AGIs that are sufficiently capable to engage in conflict that is costly for them wouldn’t also be sufficiently capable to avoid conflict that is costly for them.^{4}
If either Conflict isn't Costly or Capabilities aren't Sufficient, then it may be possible to reduce the chances that AGIs engage in conflict. This could be done by improving their cooperationrelevant capabilities or by making their preferences less prone to conflict. But this is not enough for Conflict Reduction Makes a Difference to be true.
Intent alignment may be both sufficient and necessary to reduce the risks of AGI conflict that isn’t endorsed by human overseers, insofar as it is possible to do so. If that were true, technical work specific to conflict reduction would be redundant. This leads us to the next two conditions that we’ll consider.
Intent alignment — i.e., AI systems trying to do what their overseers want — combined with the capabilities that AI systems are very likely to have conditional on intent alignment, isn’t sufficient for avoiding conflict that is not endorsed (on reflection) by the AIs’ overseers.
Even if intent alignment fails, it is still possible to intervene on an AI system to reduce the risks of conflict. (We may still want to prevent conflict if intent alignment fails and leads to an unrecoverable catastrophe, as this could make worsethanextinction outcomes less likely.)
By unendorsed conflict, we mean conflict caused by AGIs that results from a sequence of decisions that none of the AIs’ human principals would endorse after an appropriate process of reflection.^{5} The reason we focus on unendorsed conflict is that we ultimately want to compare (i) conflictspecific interventions on how AI systems are designed and (ii) work on intent alignment.
Neither of these is aimed at solving problems that are purely about human motivations, like human overseers instructing their AI systems to engage in clearly unjustified conflict.
Contrary to what our framings here might suggest, disagreements about the effectiveness of technical work to reduce AI conflict relative to other longtermist interventions are unlikely to be about the logical possibility of conflict reduction work making a difference. Instead, they are likely to involve quantitative disagreements about the likelihood and scale of different conflict scenarios, the degree to which we need AI systems to be aligned to intervene on them, and the effectiveness of specific interventions to reduce conflict (relative to intent alignment, say). We regard mapping out the space of logical possibilities for conflict reduction to make a difference as an important initial step in the longerterm project of assessing the effectiveness of technical work on conflict reduction.^{6}
Thanks to Michael Aird, Jim Buhler, Steve Byrnes, Sam Clarke, Allan Dafoe, Daniel Eth, James Faville, Lukas Finnveden, Lewis Hammond, Julian Stastny, Daniel Kokotajlo, David Manheim, Rani Martin, Adam Shimi, Stefan Torges, and Francis Ward for comments on drafts of this sequence. Thanks to Beth Barnes, Evan Hubinger, Richard Ngo, and Carl Shulman for comments on a related draft.
Stastny, Julian, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. 2021. “Normative Disagreement as a Challenge for Cooperative AI.” arXiv [cs.MA]. arXiv. http://arxiv.org/abs/2111.13872.
The post When does technical work to reduce AGI conflict make a difference?: Introduction appeared first on Center on LongTerm Risk.
]]>The post Open Position: Community Manager appeared first on Center on LongTerm Risk.
]]>The Center on Longterm Risk is seeking a Community Manager, to work on growing and supporting the community around our mission and research. You will have a leveraged role in furthering our mission to address risks of astronomical suffering from the development and deployment of advanced AI systems.
In this role, you would become the third full member of our Communitybuilding team, reporting to Stefan Torges, the Director of Operations. Depending on your skill set, you will take on responsibilities across diverse areas such as event & project management, 1:1 outreach & advising calls, setting up & improving IT infrastructure, writing, giving talks, and attending inperson networking events – making this role ideal for quickly gaining experience across a range of domains. You will receive mentorship from an experienced team, and become familiar with existing processes in a wellrunning organization, as you work to improve and supplement them. You will also have the opportunity to engage with cuttingedge research in longtermism and AI safety as well as shaping our strategy.
To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 16th October (precisely: 7:30am British Summer Time on Monday 17th). We expect the form will take 3060 minutes to complete. It can be done in as little as 10 minutes if necessary by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.
We are recruiting for this role in order to provide additional capacity in our communitybuilding function. Precisely which areas you work on will depend on your strengths and interests, and we’ll decide this together with you once you start work.
As an illustration of the sorts of things you’ll work on, we expect that the successful candidate will take on several of the following tasks:
Examples of further responsibilities that a candidate who is a good fit for them could take on include:
Since we are a small team, all members have the opportunity to shape our strategy.
We think this role could provide suitable challenges for someone with 04 years of experience in a similar job: it might, for example, be suited to a recent graduate interested in quickly gaining experience in a professional communitybuilding role, and we also encourage more experienced candidates to apply.
The following abilities and qualities are what we’re looking for in candidates. No specific qualifications or experience are required – experience is one good way of demonstrating these skills, but we’re also open to candidates with no experience of similar roles. We encourage you to apply if you think you may be a good fit, even if you are very unsure of your strengths in some of these areas.
Given that we are a small organization, we also value candidates who are willing to do less glamorous tasks to bring a project over the finish line.
In this role, you can expect to grow our team and the community of people who are committed to reducing risks of astronomical suffering from the development of AI systems. That makes it a highly leveraged opportunity to contribute to that effort.
Due to the small size of our organization, your work will be varied and you will be asked to take ownership of projects quickly. Our community is still at an early stage, so we regularly test new projects, which can help you master a variety of skills and provide you with space to propose your own ideas.
You will join an experienced communitybuilding team who will provide you with mentorship. You will work alongside and interact regularly with our researchers. So you have many opportunities to engage with ideas related to risks of astronomical suffering as well as effective altruism, longtermism, and AI safety.
CLR will also actively support your professional development. While we are looking for a candidate who is interested in working with CLR for a substantial period of time, as part of the effective altruism community we are interested in helping you increase your career’s impact even beyond your performance in the current role. Alongside mentorship from our experienced operations team, you will be joining a wellnetworked longtermist organization. You will receive a budget of £8,000 per year to spend on whatever you think best furthers your professional development, and be supported to attend EA Global conferences.
Stage 1: To apply for this role, please submit this application form. The deadline for applications is the end of Sunday 16th October (precisely: 7:30am British Summer Time on Monday 17th).
We expect the form will take 3060 minutes to complete. If necessary, the form can be done in as little as 10 minutes by skipping the descriptive questions: this may significantly disadvantage your application, but may make sense if you wouldn’t apply otherwise.
We aim to communicate the results of stage 1, inviting candidates to the second stage, by the end of Friday 21st October.
Stage 2 will be a remote work test, to be completed on your own computer, which we anticipate will take up to 4 hours of your time. Applicants will have 2 weeks to complete the test, and will be compensated with £120 in return for their work. We plan to communicate the results of stage 2 by the end of Friday 11th November.
Stage 3 will consist of one or more interviews with CLR staff. We plan to hold interviews in the week of 21st November, and aim to communicate the results of stage 3 by the end of Friday 25th November.
Stage 4: The final stage of the recruitment process will be a work trial, held inperson if possible, of between 110 working days depending on candidate availability. We will cover travel expenses and compensate candidates £200 per day for the work trial. We will also seek references at this stage.
We expect final recruitment decisions to be made by the end of the year. If you require a faster decision than this, please feel free to contact us at the address below.
The above timelines are our aim and we fully intend to stick to them. However, we don’t firmly commit to them, and a delay of, for example, 12 weeks by the end of stage 3 is possible. We will communicate to candidates promptly if we expect there to be any delays.
If you have any questions about the process, please contact us at hiring@longtermrisk.org. If you’d like to send an email that’s not accessible to the hiring committee, please contact tristan.cook@longtermrisk.org.
Diversity and equal opportunity employment: CLR is an equal opportunity employer, and we value diversity at our organization. We don’t want to discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, social background/class, mental or physical health or disability, or any other basis for unreasonable discrimination, whether legally protected or not. If you're considering applying to this role and would like to discuss any personal needs that might require adjustments to our application process or workplace, please feel very free to contact us.
The post Open Position: Community Manager appeared first on Center on LongTerm Risk.
]]>The post Safe Pareto Improvements for Delegated Game Playing appeared first on Center on LongTerm Risk.
]]>
A set of players delegate playing a game to a set of representatives, one for each player. We imagine that each player trusts their respective representative’s strategic abilities. Thus, we might imagine that per default, the original players would simply instruct the representatives to play the original game as best as they can. In this paper, we ask: are there safe Pareto improvements on this default way of giving instructions? That is, we imagine that the original players can coordinate to tell their representatives to only consider some subset of the available strategies and to assign utilities to outcomes differently than the original players. Then can the original players do this in such a way that the payoff is guaranteed to be weakly higher than under the default instructions for all the original players? In particular, can they Paretoimprove without probabilistic assumptions about how the representatives play games? In this paper, we give some examples of safe Pareto improvements. We prove that the notion of safe Pareto improvements is closely related to a notion of outcome correspondence between games. We also show that under some specific assumptions about how the representatives play games, finding safe Pareto improvements is NPcomplete.
Keywords: program equilibrium; delegation; bargaining; Pareto efficiency; smart contracts.
Between Aliceland and Bobbesia lies a sparsely populated desert. Until recently, neither of the two countries had any interest in the desert. However, geologists have recently discovered that it contains large oil reserves. Now, both Aliceland and Bobbesia would like to annex the desert, but they worry about a military conflict that would ensue if both countries insist on annexing.
Table 1 models this strategic situation as a normalform game. The strategy DM (short for “Demand with Military”) denotes a military invasion of the desert, demanding annexation. If both countries send their military with such an aggressive mission, the countries fight a devastating war. The strategy RM (for “Refrain with Military”) denotes yielding the territory to the other country, but building defenses to prevent an invasion of one’s current territories. Alternatively, the countries can choose to not raise a military force at all, while potentially still demanding control of the desert by sending only its leader (DL, short for “Demand with Leader”). In this case, if both countries demand the desert, war does not ensue. Finally, they could neither demand nor build up a military (RL). If one of the two countries has their military ready and the other does not, the militarized country will know and will be able to invade the other country. In gametheoretic terms, militarizing therefore strictly dominates not militarizing.
Instead of making the decision directly, the parliaments of Aliceland and Bobbesia appoint special commissions for making this strategic decision, led by Alice and Bob, respectively. The parliaments can instruct these representatives in various ways. They can explicitly tell them what to do – for example, Aliceland could directly tell Alice to play DM. However, we imagine that the parliaments trust the commissions’ judgments more than they trust their own and hence they might prefer to give an instruction of the type, “make whatever demands you think are best for our country” (perhaps contractually guaranteeing a reward in proportion to the utility of the final outcome). They might not know what that will entail, i.e., how the commissions decide what demands to make given that instruction. However – based on their trust in their representatives – they might still believe that this leads to better outcomes than giving an explicit instruction.
We will also imagine these instructions are (or at least can be) given publicly and that the commissions are bound (as if by a contract) to follow these instructions. In particular, we imagine that the two commissions can see each other’s instructions. Thus, in instructing their commissions, the countries play a game with bilateral precommitment. When instructed to play a game as best as they can, we imagine that the commissions play that game in the usual way, i.e., without further abilities to credibly commit or to instruct subcommittees and so forth.
It may seem that without having their parliaments ponder equilibrium selection, Aliceland and Bobbesia cannot do better than leave the game to their representatives. Unfortunately, in this default equilibrium, war is still a possibility. Even the brilliant strategists Alice and Bob may not always be able to resolve the difficult equilibrium selection problem to the same pure Nash equilibrium.
In the literature on commitment devices and in particular the literature on program equilibrium, important ideas have been proposed for avoiding such bad outcomes. Imagine for a moment that Alice and Bob will play a Prisoner’s Dilemma (Table 3) (rather than the Demand Game of Table 1). Then the default of (Defect, Defect) can be Paretoimproved upon. Both original players (Aliceland and Bobbesia) can use the following instruction for their representatives: “If the opponent’s instruction is equal to this instruction, Cooperate; otherwise Defect.” [33, 22, 46, Sect. 10.4, 55] Then it is a Nash equilibrium for both players to use this instruction. In this equilibrium, (Cooperate, Cooperate) is played and it is thus Paretooptimal and Paretobetter than the default.
In cases like the Demand Game, it is more difficult to apply this approach to improve upon the default of simply delegating the choice. Of course, if one could calculate the expected utility of submitting the default instructions, then one could similarly commit the representatives to follow some (joint) mix over the Paretooptimal outcomes ((RM, DM), (DM, RM), (RM, RM), (DL, DL), etc.) that Paretoimproves on the default expected utilities.^{1} However, we will assume that the original players are unable or unwilling to form probabilistic expectations about how the representatives play the Demand Game, i.e., about what would happen with the default instructions. If this is the case, then this type of Pareto improvement on the default is unappealing.
The goal of this paper is to show and analyze how even without forming probabilistic beliefs about the representatives, the original players can Paretoimprove on the default equilibrium. We will call such improvements safe Pareto improvements (SPIs). We here briefly give an example in the Demand Game.
The key idea is for the original players to instruct the representatives to select only from {DL,RL}, i.e., to not raise a military. Further, they tell them to disvalue the conflict outcome without military (DL, DL) as they would disvalue the original conflict outcome of war in the default equilibrium. Overall, this means telling them to play the game of Table 2. (Again, we could imagine that the instructions specify Table 2 to be how Aliceland and Bobbesia financially reward Alice and Bob.) Importantly, Aliceland’s instruction to play that game must be conditional on Bobbesia also instructing their commission to play that game, and vice versa. Otherwise, one of the countries could profit from deviating by instructing their representative to always play DM or RM (or to play by the original utility function).
The game of Table 2 is isomorphic to the DMRM part of the original Demand Game of Table 1. Of course, the original players know neither how the original Demand Game nor the game of Table 2 will be played by the representatives. However, since these games are isomorphic, one should arguably expect them to be played isomorphically. For example, one should expect that (RM,DM) would be played in the original game if and only if (RL, DL) would be played in the modified game. However, the conflict outcome (DM,DM) is replaced in the new game with the outcome (DL, DL). This outcome is harmless (Paretooptimal) for the original players.
Contributions. Our paper generalizes this idea to arbitrary normalform games and is organized as follows. In Section 2, we introduce some notation for games and multivalued functions that we will use throughout this paper. In Section 3, we introduce the setting of delegated game playing for this paper. We then formally define and further motivate the concept of safe Pareto improvements. We also define and give an example of unilateral SPIs. These are SPIs that require only one of the players to commit their representative to a new action set and utility function. In Section 3.2, we briefly review the concepts of program games and program equilibrium and show that SPIs can be implemented as program equilibria. In Section 4.2, we introduce a notion of outcome correspondence between games. This relation expresses the original players’ beliefs about similarities between how the representatives play different games. In our example, the Demand Game of Table 1 (arguably) corresponds to the game of Table 2 in that the representatives (arguably) would play (DM,DM) in the original game if and only if they play (DL, DL) in the new game, and so forth. We also show some basic results (reflexivity, transitivity, etc.) about the outcome correspondence relation on games. In Section 4.3 we show that the notion of outcome correspondence is central to deriving SPIs. In particular, we show that a game is an SPI on another game if and only if there is a Paretoimproving outcome correspondence relation between and .
To derive SPIs, we need to make some assumptions about outcome correspondence, i.e., about which games are played in similar ways by representatives. We give two very weak assumptions of this type in Section 4.4. The first is that the representatives’ play is invariant under the removal of strictly dominated strategies. For example, we assume that in the Demand Game the representatives only play DM and RM. Moreover we assume that we could remove DL and RL from the game and the representatives would still play the same strategies as in the original Demand Game with certainty. The second assumption is that the representatives play isomorphic games isomorphically. For example, once DL and RL are removed for both players from the Demand Game, the Demand Game is isomorphic to the game in Table 2 such that we might expect them to be played isomorphically. In Section 4.5, we derive a few SPIs – including our SPI for the Demand Game – using these assumptions. Section 4.6 shows that determining whether there exists an SPI based on these assumptions is NPcomplete. Section 5 considers a different setting in which we allow the original players to let the representatives choose from newly constructed strategies whose corresponding outcomes map arbitrarily onto feasible payoff vectors from the original game. In this new setting, finding SPIs can be done in polynomial time. We conclude by discussing the problem of selecting between different SPIs on a given game (Section 6) and giving some ideas for directions for future work (Section 7).
We here give some basic gametheoretic definitions. We assume the reader to be familiar with most of these concepts and with game theory more generally.
An player (normalform) game is a tuple of a set of (pure) strategy profiles (or outcomes) and a function that assigns to each outcome a utility for each player. The Prisoner's Dilemma shown in Table 3 is a classic example of a game. The Demand Game of Table 1 is another example of a game that we will use throughout this paper.
Instead of we will also write . We also write for , i.e., for the Cartesian product of the action sets of all players other than . We similarly write and for vectors containing utility functions and actions, respectively, for all players but . If is a utility function and is a vector of utility functions for all players other than , then (even if ) we use for the full vector of utility functions where Player has utility function and the other players have utility functions as specified by . We use and analogously.
We say that strictly dominates if for all , . For example, in the Prisoner's Dilemma, Defect strictly dominates Cooperate for both players. As noted earlier, and strictly dominate and for both players.
For any given game , we will call any game a subset game of if for . Note that a subset game may assign different utilities to outcomes than the original game. For example, the game of Table 2 is a subset game of the Demand Game.
We say that some utility vector is a Pareto improvement on (or is Paretobetter than) if for . We will also denote this by . Note that, contrary to convention, we allow . Whenever we require one of the inequalities to be strict, we will say that is a strict Pareto improvement on . In a given game, we will also say that an outcome is a Pareto improvement on another outcome if . We say that is Paretooptimal or Paretoefficient relative to some if there is no element of that strictly Paretodominates .
Let and be two player games. Then we call an tuple of functions a (game) isomorphism between and if there are vectors and such that
for all and . If there is an isomorphism between and , we call and isomorphic. For example, if we let be the Demand Game and the subset game of Table 2, then is isomorphic to via the isomorphism with and and the constants and .
We consider a setting in which a given game is played through what we will call representatives. For example, the representatives could be humans whose behavior is determined or incentivized by some contract à la the principal–agent literature [28]. Our principals’ motivation for delegation is the same as in that literature (namely, the agent being in a better (epistemic) position to make the choice). However, the main question asked by the principal–agent literature is how to deal with agents that have their own preferences over outcomes, by constraining the agent’s choice [e.g. 21, 25], setting up appropriate payment schemes [e.g. 23, 29, 37, 53], etc. In contrast, we will throughout this paper assume that the agent has no conflicting incentives.
We imagine that one way in which the representatives can be instructed is to in turn play a subset game of the original game, without necessarily specifying a strategy or algorithm for solving such a game. We emphasize, again, that is allowed to be a vector of entirely different utility functions. For any subset game , we denote by the outcome that arises if the representatives play the subset game of . Because it is unclear what the right choice is in many games, the original players might be uncertain about . We will therefore model each as a random variable. We will typically imagine that the representatives play in the usual simultaneous way, i.e., that they are not able to make further commitments or delegate again. For example, we imagine that if is the Prisoner's Dilemma, then with certainty.
The original players trust their representatives to the extent that we take to be a default way for the game to played for any . That is, by default the original players tell their representatives to play the game as given. For example, in the Demand Game, it is not clear what the right action is. Thus, if one can simply delegate the decision to someone with more relevant expertise, that is the first option one would consider.
We are interested in whether and how the original players can jointly Paretoimprove on the default. Of course, one option is to first compute the expected utilities under default delegation, i.e., to compute . The players could then let the representatives play a distribution over outcomes whose expected utilities exceed the default expected utilities. However, this is unrealistic if is a complex game with potentially many Nash equilibria. For one, the precise point of delegation is that the original players are unable or unwilling to properly evaluate . Second, there is no widely agreed upon, universal procedure for selecting an action in the face of equilibrium selection problems. In such cases, the original players may in practice be unable to form a probability distribution over . This type of uncertainty is sometimes referred to as Knightian uncertainty, following Knight's [26] distinction between the concepts of risk and uncertainty.
We address this problem in a typical way. Essentially, we require of any attempted improvement over the default that it incurs no regret in the worstcase. That is, we are interested in subset games that are Pareto improvements with certainty under weak and purely qualitative assumptions about .^{2} In particular, in Section 4.4, we will introduce the assumptions that the representatives do not play strictly dominated actions and play isomorphic games isomorphically.
Definition 1. Let be a subset game of . We say is a safe Pareto improvement (SPI) on if with certainty. We say that is a strict SPI if furthermore, there is a player s.t. with positive probability.
For example, in the introduction we have argued that the subset game in Table 2 is a strict SPI on the Demand Game (Table 1). Less interestingly, if we let be the Prisoner's Dilemma (Table 3), then we would expect to be an SPI on . After all, we might expect that with certainty, while it must be
with certainty, for lack of alternatives. Both players prefer mutual cooperation over mutual defection.
Both SPIs given above require both players to let their representatives choose from restricted strategy sets to maximize something other than the original player's utility function.
Definition 2. We will call a subset game of unilateral if for all but one it holds that and . Consequently, if a unilateral subset game of is also an SPI for , we call a unilateral SPI.
We now give an example of a unilateral SPI using the Complicated Temptation Game. (We give the notsocomplicated Temptation Game – in which we can only give a trivial example of SPIs – in Section 4.5.) Two players each deploy a robot. Each of the robots faces two choices in parallel. First, each can choose whether to work on Project 1 or Project 2. Player 1 values Project 1 higher and Player 2 values Project 2 higher, but the robots are more effective if they work on the same project. To complete the task, the two robots need to share a resource. Robot 2 manages the resource and can choose whether to control Robot 1’s access tightly (e.g., by frequently checking on the resource, or requiring Robot 1 to demonstrate a need for the resource) or give Robot 1 relatively free access. Controlling access tightly decreases the efficiency of both robots, though the exact costs depend on which projects the robots are working on. Robot 1 can choose between using the resource as intended by Robot 2; or give in to the temptation of trying to steal as much of the resource as possible to use it for other purposes. Regardless of what Robot 2 does (in particular, regardless of whether Robot 2 controls access or not), Player 1 prefers trying to steal. In fact, if Robot 2 controls access and Robot 1 refrains from theft, they never get anything done. Given that Robot 1 tries to steal, Player 2 prefers his Robot 2 to control access. As usual we assume that the original players can instruct their robots to play arbitrary subset games of (without specifying an algorithm for solving such a game) and that they can give such instructions conditional on the other player providing an analogous instruction.
We formalize this game as a normalform game in Table 4. Each action consists of a number and letter. The number indicates the project that the agent pursues. The letters indicates the agent’s policy towards the resource. In Player 2’s action labels, C indicates tight control over the resource, while F indicates free access. In Player 1’s action labels, T indicates giving in to the temptation to steal as much of the resource as possible, while R indicates refraining from doing so.
Player 1 has a unilateral SPI in the Complicated Temptation Game. Intuitively, if Player 1 commits to refrain, then Player 2 need not control the use of the resource. Thus, inefficiencies from conflict over the resource are avoided. However, Player 1’s utilities in the resulting game of choosing between projects 1 and 2 are not isomorphic to the original game of choosing between projects 1 and 2. The players might therefore worry that this new game will result in a worse outcome for them. For example, Player 2 might worry that in this new game the project 1 equilibrium () becomes more likely than the project 2 equilibrium. To address this, Player has to commit her representative to a different utility function that makes this new game isomorphic to the original game.
We now describe the unilateral SPI in formal detail. Player 1 can commit her representative to play only from and and to assign utilities , , , and ; otherwise does not differ from . The resulting SPI is given in Table 5. In this subset game, Player 2's representative – knowing that Player 1's representative will only play from and – will choose from and (since and strictly dominate and in Table 5). Now notice that the remaining subset game is isomorphic to the subset game of the original Complicated Temptation Game, where maps to and maps to for both Player 1, and maps to and maps to for Player 2. Player 1's representative's utilities have been set to be the same between the two; and Player 2's utilities happen to be the same up to a constant () between the two subset games. Thus, we might expect that if , then , and so on. Finally, notice that and so on. Hence, Table 5 is indeed an SPI on the Complicated Temptation Game.
Such unilateral changes are particularly interesting because they only require one of the players to be able to credibly delegate. That is, it is enough for a single player to instruct their representative to choose from a restricted action set to maximize a new utility function. The other players can simply instruct their representatives to play the game in the normal way (i.e., maximizing the respective players' original utility functions without restrictions on the action set). In fact, we may also imagine that only one player delegates at all, while the other players choose an action themselves, after observing Player 's instruction to her representative.
One may object that in a situation where only one player can credibly commit and the others cannot, the player who commits can simply play the meta game as a standard unilateral commitment (Stackelberg) game [as studied by, e.g., 11, 52, 59] or perhaps as a first mover in a sequential game (as solved by subgameperfect equilibrium), without bothering with any (safe) Pareto conditions, i.e., without ensuring that all players are guaranteed a utility at least as high as their default . For example, in the Complicated Temptation Game, Player 1 could simply commit her representative to play if she assumes that Player 2's representative will be instructed to best respond.
The Stackelberg sequential play perspective is appropriate in many cases. However, we think that in many cases the player with finegrained commitment ability cannot assume that the other players' representatives will simply best respond. Instead, players often need to consider the possibility of a hostile response if their commitment forces an unfair payoff on the other players. In such cases, unilateral SPIs are relevant.
The Ultimatum game is a canonical example in which standard solution concepts of sequential play fail to predict human behavior. In this game, subgameperfect equilibrium has the secondmoving player walk away with arbitrarily close to nothing. However, experiments show that people often resolve the game to an equal split, which is the symmetric equilibrium of the simultaneous version of the game [38].
A policy of retaliating for unfair payoffs imposed by a first mover's commitments can arise in a variety of ways within standard gametheoretic models. For one, we may imagine a scenario in which only one Player has the finegrained commitment and delegation abilities needed for SPIs but that the other players can still credibly commit their representatives to retaliate against any “commitment trickery” that clearly leaves them worse off. We may also imagine that other players or representatives come into the scenario having already made such commitments. For example, many people appear credibly committed by intuitions about fairness and retributivist instincts and emotions [see, e.g., 44, Chapter 6, especially the section “The Doomsday Machine”]. Perhaps these features of human psychology allow human second players in the Ultimatum game empirically outperform subgameperfect equilibrium. Second, we may imagine that the players who cannot commit are subject to reputation effects. Then they might want to build a reputation of resisting coercion. In contrast, it is beneficial to have a reputation of accepting SPIs on whatever game would have otherwise been played.
So far, we have been vague about the details of the strategic situation that the original players face in instructing their representatives. From what sets of actions can they choose? How can they jointly let the representatives play some new subset game ? Are SPIs Nash equilibria of the meta game played by the original players? If I instruct my representative to play the SPI of Table 2 in the Demand Game, could my opponent not instruct her representative to play ?
In this section, we briefly describe one way to fill this gap by discussing the concept of program games and program equilibrium [46, Sect. 10.4, 55, 15, 5, 13, 36]. This section is essential to understanding why SPIs (especially omnilateral ones) are relevant. However, the remaining technical content of this paper does not rely on this section and the main ideas presented here are straightforward from previous work. We therefore only give an informal exposition. For formal detail, see Appendix A.
For any game , the program equilibrium literature considers the following meta game. First, each player writes a computer program. Each program then receives as input a vector containing everyone else's chosen program. Each player 's program then returns an action from , player 's set of actions in . Together these actions then form an outcome of the original game. Finally, the utilities are realized according to the utility function of . The meta game can be analyzed like any other game. Its Nash equilibria are called program equilibria. Importantly, the program equilibria can implement payoffs not implemented by any Nash equilibria of itself. For example, in the Prisoner’s Dilemma, both players can submit a program that says: “If the opponent’s chosen computer program is equal to this computer program, Cooperate; otherwise Defect.” [33, 22, 46, Sect. 10.4, 55] This is a program equilibrium which implements mutual cooperation.
In the setting for our paper, we similarly imagine that each player can write a program that in turn chooses from . However, the types of programs that we have in mind here are more sophisticated than those typically considered in the program equilibrium literature. Specifically we imagine that the programs are executed by intelligent representatives who are themselves able to competently choose an action for player in any given game , without the original player having to describe how this choice is to be made. The original player may not even understand much about this program other than that it generally plays well. Thus, in addition to the elementary instructions used in a typical computer program (branches, comparisons, arithmetic operations, return, etc.), we allow player to use instructions of type “Play ” in the program she submits. This instruction lets the representative choose and return an action for the game . Apart from the addition of this instruction type, we imagine the set of instructions to be the same as in the program equilibrium literature. To jointly let the representatives play, e.g., the SPI of Table 2 on the Demand Game of Table 1, the representatives can both use an instruction that says, “If the opponent's chosen program is equal to this one, play ; otherwise play ”. Assuming some minimal rationality requirements on the representatives (i.e., on how the representative resolves the “play ” instruction), this is a Nash equilibrium. Figure 1 illustrates how (in the twoplayer case) the meta game between the original players is intended to work.
For illustration consider the following two realworld instantiations of this setup. First, we might imagine that the original players hire human representatives. Each player specifies, e.g., via monetary incentives, how she wants her representative to act by some contract. For example, a player might contract her representative to play a particular action; or she might specify in her contract a function () over outcomes according to which she will pay the representative after an outcome is obtained. Moreover, these contracts might refer to one another. For example, Player 1's contract with her representative might specify that if Player 2 and his representative use an analogous contract, then she will pay her representative according to Table 2. As a second, more futuristic scenario, you could imagine that the representatives are software agents whose goals are specified by socalled smart contracts, i.e., computer programs implemented on a blockchain to be publicly verifiable [8, 47].
To justify our study of SPIs, we prove that every SPI is played in some program equilibrium:
Theorem 1. Let be a game and be an SPI of . Now consider a program game on , where each player can choose from a set of computer programs that output actions for . In addition to the normal kind of instructions, we allow the use of the command "play " for any subset game of . Finally, assume that guarantees each player at least that player's minimax utility (a.k.a. threat point) in the base game . Then is played in a program equilibrium, i.e., in a Nash equilibrium of the program game.
We prove this in Appendix A.
As an alternative to having the original players choose contracts separately, we could imagine the use of jointly signed contracts which only come into effect once signed by all players [cf. 24, 34]. Another approach to bilateral commitment was pursued by Raub [45] based on earlier work by Sen [51]. Raub and Sen use preference modification as a mechanism for commitment. For example, in the Prisoner’s Dilemma, each player can separately instruct their representative to prefer cooperating over defecting if and only if the opponent also cooperates. If both players use this instruction, then mutual cooperation becomes the unique Paretooptimal Nash equilibrium. On the other hand, if only one player instructs their representative to adopt these preferences and the other maintains the usual Prisoner’s Dilemma preferences, the unique equilibrium remains mutual defection. Thus, the preference modification is used to commit to cooperating conditional on the other player making an analogous commitment. Because this is slightly confusing in the context of our work – seeing as our work involves both modifying one’s preferences and mutual commitment, but generally without using the former as a means to the latter – we discuss Raub’s and Sen’s work and its relation to ours in more detail in Appendix B.
For sets and , a multivalued function is a function which maps each element to a set . For a subset , we define
Note that and that . For any set , we define the identity function . Also, for two sets and , we define . We define the inverse
Note that for any multivalued function . For sets , and and functions and , we define the composite . As with regular functions, composition of multivalued functions is associative. We say that is singlevalued if for all . Whenever a multivalued function is singlevalued, we can apply many of the terms for regular functions. For example, we will take injectivity, surjectivity, and bijectivity for singlevalued functions to have the usual meaning. We will never apply these notions to nonsinglevalued functions.
In this section, we introduce a notion of outcome correspondence, which we will see is essential to constructing SPIs.
Definition 3. Consider two games and . We write for if with certainty.
Note that is a statement about , i.e., about how the representatives choose. Whether such a statement holds generally depends on the specific representatives being used. In Section 4.4, we describe two general circumstances under which it seems plausible that . For example, if two games and are isomorphic, then one might expect , where is the isomorphism between the two games.
We now illustrate this notation using our discussion from the Demand Game. Let be the Demand Game of Table 1. First, it seems plausible that is in some sense equivalent to , where is the game that results from removing and for both players from . Again, strict dominance could be given as an argument. We can now formalize this as , where if and otherwise. Next, it seems plausible that , where is the game of Table 2 and is the isomorphism between and .
We now state some basic facts about the relation , many of which we will use throughout this paper.
Lemma 2. Let , , and , .
Proof. 1. By reflexivity of equality, with certainty. Hence, by definition of . Therefore, by definition of , as claimed.
2. means that with certainty. Thus,
where equality is by the definition of the inverse of multivalued functions. We conclude (by definition of ) that as claimed.
3. If , , then by definition of , (i) and (ii) , both with certainty. The former (i) implies . Hence,
With ii, it follows that with certainty. By definition, as claimed.
4. It is
with certainty. Thus, by definition .
5. By definition of , it is with certainty. By definition of , it is with certainty. Hence, with certainty. We conclude that as claimed.
6. With certainty, (by assumption). Also, with certainty . Hence, with certainty. We conclude that with certainty.
7. If , then by reflexivity of (Lemma 2.1) . If , then by Lemma 2.6, with certainty.
Items 13 show that has properties resembling those of an equivalence relation. Note, however, that since is not a binary relationship, itself cannot be an equivalence relation in the usual sense. We can construct equivalence relations, though, by existentially quantifying over the multivalued function. For example, we might define an equivalence relation on games, where if and only if there is a singlevalued bijection such that .^{3}
Item 4 states that if we can make an outcome correspondence claim less precise, it will still hold true. Item 5 states that in the extreme, it is always , where is the trivial, maximally imprecise outcome correspondence function that confers no information. Item 6 shows that can be used to express the elimination of outcomes, i.e., the belief that a particular outcome (or strategy) will never occur.
Besides an equivalence relation, we can also use with quantification over the respective outcome correspondence function to construct (nonsymmetric) preorders over games, i.e., relations that are transitive and reflexive (but not symmetric or antisymmetric). Most importantly, we can construct a preorder on games where if for a that always increases every player's utilities.
We now show that as advertised, outcome correspondence is closely tied to SPIs. The following theorem shows not only how outcome correspondences can be used to find (and prove) SPIs. It also shows that any SPI requires an outcome correspondence relation via a Paretoimproving correspondence function.
Definition 4. Let be a game and be a subset game of . Further let be such that . We call a Paretoimproving outcome correspondence (function) if for all and all .
Theorem 3. Let be a game and be a subset game of . Then is an SPI on if and only if there is a Paretoimproving outcome correspondence from to .
Proof. : By definition, with certainty. Hence, for ,
with certainty. Hence, by assumption about , with certainty, .
: Assume that with certainty for . We define
It is immediately obvious that is Paretoimproving as required. Also, whenever and for any and , it is (by assumption) with certainty . Thus, by definition of , it holds that . We conclude that as claimed.
Note that the theorem concerns weak SPIs and therefore allows the case where with certainty . To show that some is a strict SPI, we need additional information about which outcomes occur with positive probability. This, too, can be expressed via our outcome correspondence relation. However, since this is cumbersome, we will not formally address strictness much to keep things simple.^{4}
We now illustrate how outcome correspondences can be used to derive the SPI for the Demand Game from the introduction as per Theorem 3. Of course, at this point we have not made any assumptions about when games are equivalent. We will introduce some in the following section. Nevertheless, we can already sketch the argument using the specific outcome correspondences that we have given intuitive arguments for. Let again be the Demand Game of Table 1. Then, as we have argued, , where is the game that results from removing and for both players; and if and otherwise. In a second step, , where is the game of Table 2 and is the isomorphism between and . Finally, transitivity (Lemma 2.3) implies that . To see that is Paretoimproving for the original utility functions of , notice that does not change utilities at all. The correspondence function maps the conflict outcome onto the outcome , which is better for both original players. Other than that, , too, does not change the utilities. Hence, is Paretoimproving. By Theorem 3, is therefore an SPI on .
In principle, Theorem 3 does not hinge on and resulting from playing games. An analogous result holds for any random variables over and . In particular, this means that Theorem 3 applies also if the representatives receive other kinds of instructions (cf. Section 3.2). However, it seems hard to establish nontrivial outcome correspondences between and other types of instructions. Still, the use of more complicated instructions can be used to derive different kinds of SPIs. For example, if there are different game SPIs, then the original players could tell their representatives to randomize between them in a coordinated way.
To make any claims about how the original players should play the metagame, i.e., about what instructions they should submit, we generally need to make assumptions about how the representatives choose and (by Theorem 3) about outcome correspondence in particular.^{5} We here make two fairly weak assumptions.
Our first is that the representatives never play strictly dominated actions and that removing them does not affect what the representatives would choose.
Assumption 1. Let be an arbitrary player game where are pairwise disjoint, and let be strictly dominated by some other strategy in . Then , where for all , and whenever .
Assumption 1 expresses that representatives should never play strictly dominated strategies. Moreover, it states that we can remove strictly dominated strategies from a game and the resulting game will be played in the same way by the representatives. For example, this implies that when evaluating a strategy , the representatives do not take into account how many other strategies strictly dominates. Assumption 1 also allows (via Transitivity of as per Lemma 2.3) the iterated removal of strictly dominated strategies. The notion that we can (iteratively) remove strictly dominated strategies is common in game theory [41, 27, 39, Section 2.9, Chapter 12] and has rarely been questioned. It is also implicit in the solution concept of Nash equilibrium – if a strategy is removed by iterated strict dominance, that strategy is played in no Nash equilibrium. However, like the concept of Nash equilibrium, the elimination of strictly dominated strategies becomes implausible if the game is not played in the usual way. In particular, for Assumption 1 to hold, we will in most games have to assume that the representatives cannot in turn make credible commitments (or delegate to further subrepresentatives) or play the game iteratively [4].
Our second assumption is that the representatives play isomorphic games isomorphically when those games are fully reduced.
Assumption 2. Let and be two games that do not contain strictly dominated actions. If and are isomorphic, then there exists an isomorphism between and such that .
Similar desiderata have been discussed in the context of equilibrium selection, e.g., by Harsanyi and Selten [20, Chapter 3.4] [cf. 56, for a discussion in the context of fully cooperative multiagent reinforcement learning].
Note that if there are multiple game isomorphisms, then we assume outcome correspondence for only one of them. This is necessary for the assumption to be satisfiable in the case of games with action symmetries. (Of course, such games are not the focus of this paper.) For example, let be Rock–Paper–Scissors. Then is isomorphic to itself via the function that for both players maps Rock to Paper, Paper to Scissors, and Scissors to Rock. But if it were , then this would mean that if the representatives play Rock in Rock–Paper–Scissors, they play Paper in Rock–Paper–Scissors. Contradiction! We will argue for the consistency of our version of the assumption in Section 4.4.3. Notice also that we make the assumption only for reduced games. This relates to the previous point about actionsymmetric games. For example, consider two versions of Rock–Paper–Scissors and assume that in both versions both players have an additional strictly dominated action that breaks the action symmetries e.g., the action, “resign and give the opponent if they play Rock/Paper”. Then there would only be one isomorphism between these two games (which maps Rock to Paper, Paper to Scissors, and Scissors to Rock for both players). However, in light of Assumption 1, it seems problematic to assume that these strictly dominated actions restrict the outcome correspondences between these two games.^{6}
One might worry that reasoning about the existence of multiple isomorphisms renders it intractable to deal with outcome correspondences as implied by Assumption 2, and in particular that it might make it impossible to tell whether a particular game is an SPI. However, one can intuitively see that the different isomorphisms between two games do analogous operations. In particular, it turns out that if one isomorphism is Paretoimproving, then they all are:
Lemma 4. Let and be isomorphisms between and . If is (strictly) Paretoimproving, then so is .
We prove Lemma 4 in Appendix C.
Lemma 4 will allow us to conclude from the existence of a Paretoimproving isomorphism that there is a Paretoimproving s.t. by Assumption 2, even if there are multiple isomorphisms between and . In the following, we can therefore afford to be lax about our ignorance (in some games) about which outcome isomorphism induces outcome equivalence. We will therefore generally write “ by Assumption 2” as short for “ is a game isomor”hism between and and hence by Assumption 2 there exists an isomorphism such that .
One could criticize Assumption 2 by referring to focal points (introduced by Schelling [49, 48, pp. 54–58] [cf., e.g., 30, 18, 54, 9]) as an example where context and labels of strategies matter. A possible response might be that in games where context plays a role, that context should be included as additional information and not be considered part of . Assumption 2 would then either not apply to such games with (relevant) context or would require one to, in some way, translate the context along with the strategies. However, in this paper we will not formalize context, and assume that there is no decisionrelevant context.
We will now argue that there exist representatives that indeed satisfy Assumptions 1 and 2, both to provide intuition and because our results would not be valuable if Assumptions 1 and 2 were inconsistent. We will only sketch the argument informally. To make the argument formal, we would need to specify in more detail what the set of games looks like and in particular what the objects of the action sets are.
Imagine that for each player there is a book^{7} that on each page describes a normalform game that does not have any strictly dominated strategies. The actions have consecutive integer labels. Importantly, the book contains no pair of games that are isomorphic to each other. Moreover, for every fully reduced game, the book contains a game that is isomorphic to this game. (Unless we strongly restrict the set of games under consideration, the book must therefore have infinitely many pages.) We imagine that each player's book contains the same set of games. On each page, the book for Player recommends one of the actions of Player to be taken deterministically.^{8}
Each representative owns a potentially different version of this book and uses it as follows to play a given game . First the given game is fully reduced by iterated strict dominance to obtain a game . They then look up the unique game in the book that is isomorphic to and map the action labels in onto the integer labels of the game in the book via some isomorphism. If there are multiple isomorphisms from to the relevant page in the book, then all representatives decide between them using the same deterministic procedure. Finally they choose the action recommended by the book.
It is left to show a pair of representatives thus specified satisfies Assumptions 1 and 2. We first argue that Assumption 1 is satisfied. Let be a game and let be a game that arises from removing a strictly dominated action from . By the well known path independence of iterated elimination of strictly dominated strategies [1, 19, 41], fully reducing and results in the same game. Hence, the representatives play the same actions in and .
Second, we argue that Assumption 2 is satisfied. Let us say and are fully reduced and isomorphic. Then it is easy to see that each player plays and based on the same page of their book. Let the game on that book page be . Let and be the bijections used by the representatives to translate actions in and , respectively, to labels in . Then if the representatives take actions in , the actions are the ones specified by the book for , and hence the actions are played in . Thus . It is easy to see that is a game isomorphism between and .
One could try to use principles other than Assumptions 1 and 2. We here give some considerations. First, game theorists have also considered the iterated elimination of weakly dominated strategies [17, 31, Section 4.11]. Unfortunately, the iterated removal of weakly dominated strategies is pathdependent [27, Section 2.7.B, 7, Section 5.2, 39, Section 12.3]. That is, for some games, iterated removal of weakly dominated strategies can lead to different subset games, depending on which weakly dominated strategy one chooses to eliminate at any stage. A straightforward extension of Assumption 1 to allow the elimination of weakly dominated strategies would therefore be inconsistent in such games, which can be seen as follows.
Work on the path dependence of iterated removal of weakly dominated strategies has shown that there are games with two different outcomes such that by iterated removal of weakly dominated strategies from , we can obtain both and . If we had an assumption analogous to Assumption 1 but for weak dominance, then (with Lemma 2.3 – transitivity), we would obtain both that and that , where for all and for all . The former would mean (by Lemma 2.6) that for all we have that with certainty; while the latter would mean that that we have that with certainty. But jointly this means that for all , we have that with certainty, which cannot be the case as by definition. Thus, we cannot make an assumption analogous to Assumption 1 for weak dominance.
As noted above, the iterated removal of strictly dominated strategies, on the other hand, is pathindependent, and in the 2player case always eliminates exactly the nonrationalizable strategies [1, 19, 41]. Many other dominance concepts have been shown to have path independence properties. For an overview, see Apt [1]. We could have made an independence assumption based any of these pathindependent dominance concepts. For example, elimination of strategies that are strictly dominated by a mixed strategy (or, equivalently, of socalled neverbest responses) is also path independent [40, Section 4.2].
With Assumptions 1 and 2, all our outcome correspondence functions are either 1to1 or 1to0. Other elimination assumptions could involve the use of manyto1 or even manytomany functions. In general, such functions are needed when a strategy can be eliminated to obtain a strategically equivalent game, but in the original game may still be played. The simplest example would be the elimination of payoffequivalent strategies. Imagine that in some game for all opponent strategies it is the case that and that there are no other strategies that are similarly payoffequivalent to and . Then one would assume that , where maps onto and otherwise is just the identity function. As an example, imagine a variant of the Demand Game in which Player 1 has an additional action that results in the same payoffs as for both players against Player 2's and but potentially slightly different payoffs against and . With our current assumptions we would be unable to derive a nontrivial SPI for this game. However, with an assumption about the elimination of duplicate actions in hand, we could (after removing and as usual) remove or and thereby derive the usual SPI. Manyto1 elimination assumptions can also arise from some dominance concepts if they have weaker path independence properties. For example, iterated elimination by socalled nice weak dominance [32] is only pathindependent up to strategic equivalence. Like the assumption about payoffequivalent strategies, an elimination assumption based on nice weak dominance therefore cannot assume that the eliminated action is not played in the original game at all.
In this section, we use Lemma 2, Theorem 3, and Assumptions 1 and 2 to formally prove a few SPIs.
Proposition (Example) 5. Let be the Prisoner's Dilemma (Table 3) and be any subset game of with . Then under Assumption 1, is a strict SPI on .
Proof. By applying Assumption 1 twice and Transitivity once, , where and and for all . By Lemma 2.5, we further obtain , where is as described in the proposition. Hence, by transitivity, . It is easy to verify that the function is Paretoimproving.
Proposition (Example) 6. Let be the Demand Game of Table 1 and be the subset game described in Table 2. Under Assumptions 1 and 2 , is an SPI on . Further, if , then is a strict SPI.
Proof. Let . We can repeatedly apply Assumption 1 to eliminate from the strategies and for both players. We can then apply Lemma 2.3 (Transitivity) to obtain , where and
Next, by Assumption 2, , where and for . We can then apply Lemma 2.3 (Transitivity) again, to infer . It is easy to verify that for all , it is for all the case that .
Next, we give two examples of unilateral SPIs. We start with an example that is trivial in that the original player instructs her resentatives to take a specific action. We then give the SPI for the Complicated Temptation game as a nontrivial example.
Consider the Temptation Game given in Table 6. In this game, Player 1's (for Temptation) strictly dominates . Once is removed, Player 2 prefers . Hence, this game is strictdominance solvable to . Player 1 can safely Paretoimprove on this result by telling her representative to play , since Player 2's best response to is and . We now show this formally.
Proposition (Example) 7. Let be the game of Table 6. Under Assumption 1, is a strict SPI on .
Proof. First consider . We can apply Assumption 1 to eliminate Player 1's and then apply Assumption 1 again to the resulting game to also eliminate Player 2's . By transitivity, we find , where and and .
Next, consider . We can apply Assumption 1 to remove Player 2's strategy and find , where and and .
Third, by Lemma 2.5, where .
Finally, we can apply transitivity to conclude , where . It is easy to verify that and . Hence, is Paretoimproving and so by Theorem 3, is an SPI on .
Note that in this example, Player 1 simply commits to a particular strategy and Player 2 maximizes their utility given Player 1's choice. Hence, this SPI can be justified with much simpler unilateral commitment setups [11, 52, 59]. For example, if the Temptation Game was played as a sequential game in which Player 1 plays first, its unique subgameperfect equilibrium is .
In Table 4 we give the Complicated Temptation Game, which better illustrates the features specific to our setup. Roughly, it is an extension of the simpler Temptation Game of Table 6. In addition to choosing versus and versus , the players also have to make an additional choice (1 versus 2), which is difficult in that it cannot be solved by strict dominance. As we have argued in Section 3.1, the game in Table 5 is a unilateral SPI on Table 4. We can now show this formally.
Proposition (Example) 8. Let be the Complicated Temptation Game (Table 4) and be the subset game in Table 5. Under Assumptions 1 and 2, is a unilateral SPI on .
Proof. In , for Player 1, and strictly dominate and . We can thus apply Assumption 1 to eliminate Player 1's and . In the resulting game, Player 2's and strictly dominate and , so one can apply Assumption 1 again to the resulting game to also eliminate Player 2's and . By transitivity, we find , where and
Next, consider (Table 5). We can apply Assumption 1 to remove Player 2's strategies and and find , where and
Third, by Assumption 2, where decomposes into and , corresponding to the two players, respectively, where and for .
Finally, we can apply transitivity and the rule about symmetry and inverses (Lemma 2.2) to conclude . It is easy to verify that is Paretoimproving.
In this section, we ask how computationally costly it is for the original players to identify for a given game a nontrivial SPI . Of course, the answer to this question depends on what the original players are willing to assume about how their representatives act. For example, if only trivial outcome correspondences (as per Lemma 2.1 and 2.5) are assumed, then the decision problem is easy. Similarly, if for given is hard to decide (e.g., because it requires solving for the Nash equilibria of and ), then this could trivially also make the safe Pareto improvement problem hard to decide. We specifically are interested in deciding whether a given game has a nontrivial SPI that can be proved using only Assumptions 1 and 2, the general properties of game correspondence (in particular Transitivity (Lemma 2.3), Symmetry (Lemma 2.2) and Theorem 3).
Definition 5. The SPI decision problem consists in deciding for any given , whether there is a game and a sequence of outcome correspondences and a sequence of subset games of s.t.:
Many variants of this problem may be considered. For example, to match Definition 1, the definition of the strict SPI problem assumes that all outcomes that survive iterated elimination occur with positive probability. Alternatively we could have required that for demonstrating strictness, there must be a player such that for all that survive iterated elimination, . Similarly one may wish to find SPIs that are strict improvements for all players. We may also wish to allow the use of the elimination of duplicate strategies (as described in Section 4.4.4) or trivial outcome correspondence steps as per Lemma 2.5. These modifications would not change the computational complexity of the problem, nor would they require new proof ideas. One may also wish to compute all SPIs, or – in line with multicriteria optimization [14, 58] – all SPIs that cannot in turn be safely Paretoimproved upon. However, in general there may exist exponentially many such SPIs. To retain any hope of developing an efficient algorithm, one would therefore have to first develop a more efficient representation scheme [cf. 42, Sect. 16.4].
Theorem 9. The (strict) (unilateral) SPI decision problem is NPcomplete, even for 2player games.
Proposition 10. For games with that can be reduced (via iterative application of Assumption 1) to a game with , the (strict) (unilateral) SPI decision problem can be solved in .
The full proof is tedious (see Appendix D), but the main idea is simple, especially for omnilateral SPIs. To find an omnilateral SPI on based on Assumptions 1 and 2, one has to first iteratively remove all strictly dominated actions to obtain a reduced game , which the representatives would play the same as the original game. This can be done in polynomial time. One then has to map the actions onto the original in such a way that each outcome in is mapped onto a weakly Paretobetter outcome in . Our proof of NPhardness works by reducing from the subgraph isomorphism problem, where the payoff matrices of and represent the adjacency matrices of the graphs.
Besides being about a specific set of assumptions about , note that Theorem 9 and Proposition 10 also assume that the utility function of the game is represented explicitly in normal form as a payoff matrix. If we changed the game representation (e.g., to boolean circuits, extensive form game trees, quantified boolean formulas, or even Turing machines), this can affect the complexity of the SPI problem. For example, Gabarró, García, and Serna [16] show that the game isomorphism problem on normalform games is equivalent to the graph isomorphism problem, while it is equivalent to the (likely computationally harder) boolean circuit isomorphism problem for a weighted boolean formula game representation. Solving the SPI problem requires solving a subset game isomorphism problem (see the proof of Lemma 28 in Appendix D for more detail). We therefore suspect that the SPI problem analogously increases in computational complexity (perhaps to being complete) if we treat games in a weighted boolean formula representation. In fact, even reducing a game using strict dominance by pure strategies – which contributes only insignificantly to the complexity of the SPI problem for normalform games – is difficult in some game representations [10, Section 6]. Note, however, that for any game representation to which 2player normalform games can be efficiently reduced – such as, for example, extensiveform games – the hardness result also applies.
In this section, we imagine that the players are able to simply invent new token strategies with new payoffs that arise from mixing existing feasible payoffs. To define this formally, we first define for any game ,
to be the set of payoff vectors that are feasible by some correlated strategy. The underlying notion of correlated strategies is the same as in correlated equilibrium [2, 3], but in this paper it will not be relevant whether any such strategy is a correlated equilibrium of . Instead their use will hinge on the use of commitments [cf. 34]. Note that is exactly the convex closure of , i.e., the convex closure of the set of deterministically achievable utilities of the original game.
For any game , we then imagine that in addition to subset games, the players can let the representatives play a perfectcoordination token game , where for all , and are arbitrary utility functions to be used by the representatives and are the utilities that the original players assign to the token strategies.
The instruction lets the representatives play the game as usual. However, the strategies are imagined to be meaningless token strategies which do not resolve the given game . Once some token strategies are selected, these are translated into some probability distribution over , i.e., into a correlated strategy of the original game. This correlated strategy is then played by the original players, thus giving rise to (expected) utilities . These distributions and thus utilities are specified by the original players.
Definition 6. Let be a game. A perfectcoordination SPI for is a perfectcoordination token game for s.t. with certainty. We call a strict perfectcoordination SPI if there furthermore is a player for whom with positive probability.
As an example, imagine that is just the  subset game of the Demand Game of Table 1. Then, intuitively, an SPI under improved coordination could consist of the original players telling the representatives, “Play as if you were playing the  subset game of the Demand Game, but whenever you find yourself playing , randomize [according to some given distribution] between the other (Paretooptimal) outcomes instead”. Formally, and would then consist of tokenized versions of the original strategies. The utility functions and are then simply the same as in the original Demand Game except that they are applied to the token strategies. For example, . The utilities for the original players remove the conflict outcome. For example, the original players might specify , representing that the representatives are supposed to play in the case. For all other outcomes , it must be the case that because the other outcomes cannot be Paretoimproved upon. As with our earlier SPIs for the Demand Game, Assumption 2 implies that , where maps the original conflict outcome onto the Paretooptimal (,).
Relative to the SPIs considered up until now, these new types of instructions put significant additional requirements on how the representatives interact. They now have to engage in a tworound process of first choosing and observing one another's token strategies and then playing a correlated strategy for the original game. Further, it must be the case that this additional coordination does not affect the payoffs of the original outcomes. The latter may not be the case in, e.g., the Game of Chicken. That is, we could imagine a Game of Chicken in which coordination is possible but that the rewards of the game change if the players do coordinate. After all, the underlying story in the Game of Chicken is that the positive reward – admiration from peers – is attained precisely for accepting a grave risk.
With these more powerful ways to instruct representatives, we can now replace individual outcomes of the default game ad libitum. For example, in the reduced Demand Game, we singled out the outcome as Paretosuboptimal and replaced it by a Paretooptimal outcome, while keeping all other outcomes the same. This allows us to construct SPIs in many more games than before.
Definition 7. The strict fullcoordination SPI decision problem consists in deciding for any given whether under Assumption 2 there is a perfectcoordination SPI for .
Lemma 11. For a given player game and payoff vector , it can be decided by linear programming and thus in polynomial time whether is Paretooptimal in .
For an introduction to linear programming, see, e.g., Schrijver [50]. In short, a linear program is a specific type of constrained optimization problem that can be solved efficiently.
Proof. Finding a Pareto improvement on a given can be formulated as the following linear program:
Based on Lemma 11, Algorithm 1 decides whether there is a strict perfectcoordination SPI for a given game .
It is easy to see that this algorithm runs in polynomial time (in the size of, e.g., the normal form representation of the game). It is also correct: if it returns True, simply replace the Paretosuboptimal outcome while keeping all other outcomes the same; if it returns False, then all outcomes are Paretooptimal within and so there can be no strict SPI. We summarize this result in the following proposition.
Proposition 12. Assuming is known and that Assumption 2 holds, it can be decided in polynomial time whether there is a strict perfectcoordination SPI.
From the problem of deciding whether there are strict SPIs under improved coordination at all, we move on to the question of what different perfectcoordination SPIs there are. In particular, one might ask what the cost is of only considering safe Pareto improvements relative to acting on a probability distribution over and the resulting expected utilities . We start with a lemma that directly provides a characterization. So far, all the considered perfectcoordination SPIs for a game have consisted in letting the representatives play a game that is isomorphic to the original game, but Paretoimproves (from the original players' perspectives, i.e., ) at least one of the outcomes. It turns out that we can restrict attention to this very simple type of SPI under improved coordination.
Lemma 13. Let be any game. Let be a perfectcoordination SPI on . Then we can define with values in such that under Assumption 2 the game
is also an SPI on , with
for all and consequently .
Proof. First note that is isomorphic to . Thus by Assumption 2, there is isomorphism s.t. . WLOG assume that simply maps . Then define as follows:
Here describes the utilities that the original players assign to the outcomes of . Since maps onto and is convex, as defined also maps into as required. Note that for all it is by assumption with certainty. Hence,
as required.
Because of this result, we will focus on these particular types of SPIs, which simply create an isomorphic game with different (Paretobetter) utilities. Note, however, that without assigning exact probabilities to the distributions of , the original players will in general not be able to construct a that satisfies the expected payoff equalities. For this reason, one could still conceive of situations in which a different type of SPI would be chosen by the original players and the original players are unable to instead choose an SPI of the type described in Lemma 13.
Lemma 13 directly implies a characterization of the expected utilities that can be achieved with perfectcoordination SPIs. Of course, this characterization depends on the exact distribution of . We omit the statement of this result. However, we state the following implication.
Corollary 14. Under Assumption 2, the set of Pareto improvements that are safely achievable with perfect coordination
is a convex polygon.
Because of this result, one can also efficiently optimize convex functions over the set of perfectcoordination SPIs. Even without referring to the distribution , many interesting questions can be answered efficiently. For example, we can efficiently identify the perfectcoordination SPI that maximizes the minimum improvements across players and outcomes .
In the following, we aim to use Lemma 13 and Corollary 14 to give maximally strong positive results about what Pareto improvements can be safely achieved, without referring to exact probabilities over . To keep things simple, we will do this only for the case of two players. To state our results, we first need some notation: We use
to denote the Pareto frontier of a convex polygon (or more generally convex, closed set). For any real number , we use to denote the which maximizes under the constraint (Recall that we consider 2player games, so is a single real number.) Note that such a exists if and only if is 's utility in some feasible payoff vector. We first state our result formally. Afterwards, we will give a graphical explanation of the result, which we believe is easier to understand.
Theorem 15. Make Assumption 2. Let be a twoplayer game. Let be some potentially unsafe Pareto improvement on . For , let . Then:
A) If there is some element in which Paretodominates all of and if is Paretodominated by an element of at least one of the following three sets:
Then there is an SPI under improved coordination such that .
B) If there is no element in which Paretodominates all of and if is Paretodominated by an element each of and as defined above, then there is a perfectcoordination SPI such that .
We now illustrate the result graphically. We start with Case A, which is illustrated in Figure 2. The Paretofrontier is the solid line in the north and east. The points marked x indicate outcomes in . The point marked by a filled circle indicates the expected value of the default equilibrium . The vertical dashed lines starting at the two extreme x marks illustrate the application of to project onto the Pareto frontier. The dotted line between these two points is . Similarly, the horizontal dashed lines starting at x marks illustrate the application of to project onto the Pareto frontier. The line segment between these two points is . In this case, this line segments lies on the Pareto frontier. The set is simply that part of the Pareto frontier, which Paretodominates all elements of , i.e., the part of the Pareto frontier to the northeast between the two intersections with the northern horizontal dashed line and eastern vertical dashed line. The theorem states that for some to be a Pareto improvement, it must be in the gray area.
Case B of Theorem 15 is depicted in Figure 3. Note that here the two line segments and intersect. To ensure that a Pareto improvement is safely achievable, the theorem requires that it is below both of these lines, as indicated again by the gray area.
For a full proof, see Appendix E. Roughly, Theorem 15 is proven by remapping each of the outcomes of the original game as per Lemma 13. For example, the projection of the default equilibrium (i.e., the filled circle) onto is obtained as an SPI by projecting all the outcomes (i.e., all the x marks) onto . In Case A, any utility vector that Paretoimproves on all outcomes of the original game can be obtained by remapping all outcomes onto . Other kinds of are handled similarly.
As a corollary of Theorem 15, we can see that all (potentially unsafe) Pareto improvements in the  subset game of the Demand Game of Table 1 are equivalent to some perfectcoordination SPI. However, this is not always the case:
Proposition 16. There is a game , representatives that satisfy Assumptions 1 and 2, and an outcome s.t. for all players , but there is no perfectcoordination SPI s.t. for all players , .
As an example of such a game, consider the game in Table 7. Strategy can be eliminated by strict dominance (Assumption 1) for both players, leaving a typical Chickenlike payoff structure with two pure Nash equilibria ( and ), as well as a mixed Nash equilibrium .
Now let us say that in the resulting game for some with . Then one (unsafe) Pareto improvement would be to simply always have the representatives play for a certain payoff of . Unfortunately, there is no safe Pareto improvement with the same expected payoff. Notice that is the unique element of that maximizes the sum of the two players' utilities. By linearity of expectation and convexity of , if for any it is , it must be with certainty. Unfortunately, in any safe Pareto improvement the outcomes and must corresponds to outcomes that still gives utilities of and , respectively, because these are Paretooptimal within the set of feasible payoff vectors. We illustrate this as an example of Case B of Theorem 15 in Figure 4.
In the Demand Game, there happens to be a single nontrivial SPI. However, in general (even without the type of coordination assumed in Section 5) there may be multiple SPIs that result in different payoffs for the players. For example, imagine an extension of the Demand Game imagine that both players have an additional action , which is like , except that under , Aliceland can peacefully annex the desert. Aliceland prefers this SPI over the original one, while Bobbesia has the opposite preference. In other cases, it may be unclear to some or all of the players which of two SPIs they prefer. For example, imagine a version of the Demand Game in which one SPI mostly improves on and another mostly improves on the other three outcomes, then outcome probabilities are required for comparing the two. If multiple SPIs are available, the original players would be left with the difficult decision of which SPI to demand in their instruction.^{9}
This difficulty of choosing what SPI to demand cannot be denied. However, we would here like to emphasize that players can profit from the use of SPIs even without addressing this SPI selection problem. To do so, a player picks an instruction that is very compliant (“doveish”) w.r.t. what SPI is chosen, e.g., one that simply goes with whatever SPI the other players demand as long as that SPI cannot further be safely Paretoimproved upon.^{10} In many cases, all such SPIs benefit all players. For example, optimal SPIs in bargaining scenarios like the Demand Game remove the conflict outcome, which benefits all parties. Thus, a player can expect a safe improvement even under such maximally compliant demands on the selected SPI.
In some cases there may also be natural choices of demands (a là Schelling [48, pp. 54–58] or focal points). If the underlying game is symmetric, a symmetric safe Pareto improvement may be a natural choice. For example, the fully reduced version of the Demand Game of Table 1 is symmetric. Hence, we might expect that even if multiple SPIs were available, the original players would choose a symmetric one.
Safe Pareto improvements are a promising new idea for delegating strategic decision making. To conclude this paper, we discuss some ideas for further research on SPIs.
Straightforward technical questions arise in the context of the complexity results of Section 4.6. First, what impact on the complexity does varying the assumptions have? Our NPcompleteness proof is easy to generalize at least to some other types of assumptions. It would be interesting to give a generic version of the result. We also wonder whether there are plausible assumptions under which the complexity changes in interesting ways. Second, one could ask how the complexity changes if we use more sophisticated game representations (see the remarks at the end of that section). Third, one could impose additional restrictions on the sought SPI. Fourth, we could restrict the games under consideration. Are there games in which it becomes easy to decide whether there is an SPI?
It would also be interesting to see what realworld situations can already be interpreted as utilizing SPIs, or could be Paretoimproved upon using SPIs.
This work was supported by the National Science Foundation under Award IIS1814056. Some early work on this topic was conducted by Caspar Oesterheld while working at the Foundational Research Institute (now the Center on LongTerm Risk). For valuable comments and discussions, we are grateful to Keerti Anand, Tobias Baumann, Jesse Clifton, Max Daniel, Lukas Gloor, Adrian Hutter, Vojtěch Kovařík, Anni Leskelä, Brian Tomasik and Johannes Treutlein, and our wonderful anonymous referees. We also thank attendees of a 2017 talk at the Future of Humanity Institute at the University of Oxford, a talk at the May 2019 Effective Altruism Foundation research retreat, and our talk at AAMAS 2021.
[1] Krzysztof R. Apt. “Uniform Proofs of Order Independence for Various Strategy Elimination Procedures”. In: The B.E. Journal of Theoretical Economics 4.1 (2004), pp. 1–48. DOI: 10.2202/15345971.1141.
[2] Robert J. Aumann. “Correlated Equilibrium as an Expression of Bayesian Rationality”. In: Econometrica 55.1 (Jan. 1987), pp. 1–18. DOI: 10.2307/1911154.
[3] Robert J. Aumann. “Subjectivity and Correlation in Randomized Strategies”. In: Journal of Mathematical Economics 1.1 (Mar. 1974), pp. 67–97. DOI: 10.1016/03044068(74)900378.
[4] Robert Axelrod. The Evolution of Cooperation. New York: Basic Books, 1984.
[5] Mihaly Barasz et al. Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic. Jan. 2014. url: https://arxiv.org/abs/1401.5577.
[6] Ken Binmore. Game Theory – A Very Short Introduction. Oxford University Press, 2007.
[7] Tilman Börgers. “Pure Strategy Dominance”. In: Econometrica 61.2 (Mar. 1993), pp. 423–430.
[8] Vitalik Buterin. Ethereum White Paper – A Next Generation Smart Contract & Decentralized Application Platform. Updated version available at https://github.com/ethereum/wiki/wiki/WhitePaper. 2014. URL: https : //cryptorating . eu / whitepapers / Ethereum /Ethereum_white_paper.pdf.
[9] Andrew M. Colman. “Salience and focusing in pure coordination games”. In: Journal of Economic Methodology 4.1 (1997), pp. 61–81. DOI: 10.1080/13501789700000004.
[10] Vincent Conitzer and Tuomas Sandholm. “Complexity of (Iterated) Dominance”. In: Proceedings of the 6th ACM conference on Electronic commerce. Vancouver, Canada: Association for Computing Machinery, June 2005, pp. 88–97. DOI: 10.1145/1064009.1064019.
[11] Vincent Conitzer and Tuomas Sandholm. “Computing the Optimal Strategy to Commit to”. In: Proceedings of the ACM Conference on Electronic Commerce (EC). Ann Arbor, MI, USA: Association for Computing Machinery, 2006, pp. 82–90.
[12] Stephen A. Cook. “The complexity of theoremproving procedures”. In: STOC ’71: Proceedings of the third annual ACM symposium on Theory of computing. New York: Association for Computing Machinery, May 1971, pp. 151–158. DOI: 10.1145/800157.805047.
[13] Andrew Critch. “A Parametric, ResourceBounded Generalization of Löb’s Theorem, and a Robust Cooperation Criterion for OpenSource Game Theory”. In: Journal of Symbolic Logic 84.4 (Dec. 2019), pp. 1368–1381. DOI: 10.1017/jsl.2017.42.
[14] Matthias Ehrgott. Multicriteria Optimization. 2nd ed. Berlin: Springer, 2005.
[15] Lance Fortnow. “Program equilibria and discounted computation time”. In: Proceedings of the 12th Conference on Theoretical Aspects of Rationality and Knowledge (TARK ’09). July 2009, pp. 128–133. DOI: 10.1145/1562814.1562833.
[16] Joaquim Gabarró, Alina García, and Maria Serna. “The complexity of game isomorphism”. In: Theoretical Computer Science 412.48 (Nov. 2011), pp. 6675–6695. DOI: 10.1016/j.tcs.2011.07.022.
[17] David Gale. “A Theory of NPerson Games with Perfect Information”. In: Proceedings of the National Academy of Sciences of the United States of America 39.6 (June 1953), pp. 496–501. DOI: 10.1073/pnas.39.6.496.
[18] David Gauthier. “Coordination”. In: Dialogue 14.2 (June 1975), pp. 195–221. DOI: 10.1017/S0012217300043365.
[19] Itzhak Gilboa, Ehud Kalai, and Eitan Zemel. “On the order of eliminating dominated strategies”. In: Operations Research Letters 9.2 (Mar. 1990), pp. 85–89. DOI: 10.1016/01676377(90)900468.
[20] John C. Harsanyi and Reinhard Selten. A General Theory of Equilibrium Selection in Games. Cambridge, MA: The MIT Press, 1988.
[21] Bengt Robert Holmstr¨om. “On Incentives and Control in Organizations”. PhD thesis. Stanford University, Dec. 1977.
[22] J. V. Howard. “Cooperation in the Prisoner’s Dilemma”. In: Theory and Decision 24 (May 1988), pp. 203–213. DOI: 10.1007/BF00148954.
[23] Leonid Hurwicz and Leonard Shapiro. In: The Bell Journal of Economics 9.1 (1978), pp. 180–191. DOI: 10.2307/3003619.
[24] Adam Tauman Kalai et al. “A commitment folk theorem”. In: Games and Economic Behavior 69 (2010), pp. 127–137. DOI: 10.1016/j.geb.2009.09.008.
[25] Jon Kleinberg and Robert Kleinberg. “Delegated Search Approximates Efficient Search”. In: Proceedings of the 19th ACM Conference on Economics and Computation (EC). 2018.
[26] Frank H. Knight. Risk, Uncertainty, and Profit. Boston, MA, USA: Houghton Mifflin Company, 1921.
[27] Elon Kohlberg and JeanFrancois Mertens. “On the Strategic Stability of Equilibria”. In: Econometrica 54.5 (Sept. 1986), pp. 1003–1037. DOI: 10.2307/1912320.
[28] JeanJacques Laffont and David Martimort. The Theory of Incentives – The PrincipalAgent Model. Princeton, NJ: Princeton University Press, 2002.
[29] Richard A. Lambert. “Executive Effort and Selection of Risky Projects”. In: RAND J. Econ. 17.1 (1986), pp. 77–88.
[30] David Lewis. Convention. Harvard University Press, 1969.
[31] R. Duncan Luce and Howard Raiffa. Games and Decisions. Introduction and Critical Survey. New York: Dover Publications, 1957.
[32] Leslie M. Marx and Jeroen M. Swinkels. “Order Independence for Iterated Weak Dominance”. In: Games and Economic Behavior 18 (1997), pp. 219–245. DOI: 10.1006/game.1997.0525.
[33] R. Preston McAfee. “Effective Computability in Economic Decisions”. May 1984. URL: https://www.mcafee.cc/Papers/PDF/EffectiveComputability.pdf.
[34] Dov Monderer and Moshe Tennenholtz. “Strong mediated equilibrium”. In: Artificial Intelligence 173.1 (Jan. 2009), pp. 180–195. DOI: 10.1016/j.artint.2008.10.005.
[35] John von Neumann. “Zur Theorie der Gesellschaftsspiele”. In: Mathematische Annalen 100 (1928), pp. 295–320. DOI: https://doi.org/10.1007/BF01448847.
[36] Caspar Oesterheld. “Robust Program Equilibrium”. In: Theory and Decision 86.1 (Feb. 2019), pp. 143–159.
[37] Caspar Oesterheld and Vincent Conitzer. “Minimumregret contracts for principalexpert problems”. In: Proceedings of the 16th Conference on Web and Internet Economics (WINE). 2020.
[38] Hessel Oosterbeek, Randolph Sloof, and Gijs van de Kuilen. “Cultural Differences in Ultimatum Game Experiments: Evidence from a MetaAnalysis”. In: Experimental Economics 7 (June 2004), pp. 171–188. DOI: 10.1023/B:EXEC.0000026978.14316.74.
[39] Martin J. Osborne. An Introduction to Game Theory. New York: Oxford University Press, 2004.
[40] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994.
[41] David G. Pearce. “Rationalizable Strategic Behavior and the Problem of Perfection”. In: Econometrica 54.4 (July 1984), pp. 1029–1050.
[42] Guillaume Perez. “Decision diagrams: constraints and algorithms”. PhD thesis. Université Côte d’Azur, 2017. URL: https : / / tel .archivesouvertes.fr/tel01677857/document.
[43] Martin Peterson. An Introduction to Decision Theory. Cambridge University Press, 2009.
[44] Steven Pinker. How the Mind Works. W. W. Norton, 1997.
[45] Werner Raub. “A General GameTheoretic Model of Preference Adaptions in Problematic Social Situations”. In: Rationality and Society 2.1 (Jan. 1990), pp. 67–93.
[46] Ariel Rubinstein. Modeling Bounded Rationality. Ed. by Karl Gunnar Persson. Zeuthen Lecture Book Series. The MIT Press, 1998.
[47] Alexander Savelyev. “Contract law 2.0: ‘Smart’ contracts as the beginning of the end of classic contract law”. In: Information & Communications Technology Law 26.2 (2017), pp. 116–134. DOI: 10.1080/13600834.2017.1301036.
[48] Thomas C. Schelling. The Strategy of Conflict. Cambridge, MA: Harvard University Press, 1960.
[49] Thomas C. Schelling. “The Strategy of Conflict Prospectus for a Reorientation of Game Theory”. In: The Journal of Conflict Resolution 2.3 (Sept. 1958), pp. 203–264.
[50] Alexander Schrijver. Theory of Linear and Integer Programming. Chichester, UK: John Wiley & Sons, 1998.
[51] Amartya Sen. “Choice, orderings and morality”. In: Practical Reason. Ed. by Stephan Körner. New Haven, CT, USA: Basil Blackwell, 1974. Chap. II, pp. 54–67.
[52] Heinrich von Stackelberg. “Marktform und Gleichgewicht”. In: Vienna: Springer, 1934, pp. 58–70.
[53] Neal M. Stoughton. “Moral Hazard and the Portfolio Management Problem”. In: The Journal of Finance 48.5 (Dec. 1993), pp. 2009–2028. DOI: 10.1111/j.15406261.1993.tb05140.x.
[54] Robert Sugden. In: The Economic Journal 105.430 (May 1995), pp. 533–550. DOI: 10.2307/2235016.
[55] Moshe Tennenholtz. “Program equilibrium”. In: Games and Economic Behavior 49.2 (Nov. 2004), pp. 363–373.
[56] Johannes Treutlein et al. “A New Formalism, Method and Open Issues for ZeroShot Coordination”. In: Proceedings of the Thirtyeighth International Conference on Machine Learning (ICML’21). 2021.
[57] Wiebe van der Hoek, Cees Witteveen, and Micheal Wooldridge. “Program equilibrium – a program reasoning approach”. In: International Journal of Game Theory 42 (3 Aug. 2013), pp. 639–671.
[58] Luc N. van Wassenhove and Ludo F. Gelders. “Solving a bicriterion scheduling problem”. In: European Journal of Operational Research 4 (1980), pp. 42–48.
[59] Bernhard Von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. Tech. rep. LSECDAM200401. London School of Economics, 2004. URL: http://www.cdam.lse.ac.uk/Reports/Files/cdam200401.pdf.
This paper considers the metagame of delegation. SPIs are a proposed way of playing these games. However, throughout most of this paper, we do not analyze the metagame directly as a game using the typical tools of game theory. We here fill that gap and in particular prove Theorem 1, which shows that SPIs are played in Nash equilibria of the meta game, assuming sufficiently strong contracting abilities. As noted, this result is essential. However, since it is mostly an application of existing ideas from the literature on program equilibrium, we left a detailed treatment out of the main text.
A program game for is defined via a set and a nondeterministic mapping . We obtain a new game with action sets and utility function
Though this definition is generic, one generally imagines in the program equilibrium literature that for all , consists of computer programs in some programming language, such as Lisp, that take as input vectors in and return an action . The function on input then executes each player 's program on to assign an action. The definition implicitly assumes that only contains programs that halt when fed one another as input (or that not halting is mapped onto some action). As is usually done in the program equilibrium literature, we will leave unspecified what constraints are used to ensure this. A program equilibrium is then simply a Nash equilibrium of the program game.
For the present paper, we add the following feature to the underlying programming language. A program can call a “black box subroutine” for any subset game of , where is a random variable over and .
We need one more definition. For any game and player , we define Player 's threat point (a.k.a. minimax utility) as
In words, is the minimum utility that the players other than can force onto , under the assumption that reacts optimally to their strategy. We further will use to denote the strategy for Player that is played in the minimizer of the above. Of course, in general, there might be multiple minimizers . In the following, we will assume that the function breaks such ties in some consistent way, such that for all ,
Note that for , each player's threat point is computable in polynomial time via linear programming; and that by the minimax theorem [35], the threat point is equal to the maximin utility, i.e.,
so is also the minimum utility that Player can guarantee for herself under the assumption that the opponent sees her mixed strategy and reacts in order to minimize Player 's utility.
Tennenholtz’ [55] main result on program games is the following:
Theorem 17 (Tennenholtz 2004 [55]). Let be a game and let be a (feasible) payoff vector. If for , then is the utility of some program equilibrium of a program game on
Throughout the rest of this section, our goal is to use similar ideas as Tennenholtz did for Theorem 17 to construct for any SPI on , a program equilibrium that results in the play of . As noted in the main text, the Player 's instruction to her representative to play the game will usually be conditional on the other player telling her representative to also play her part of and vice versa. After all, if Player simply tells her representative to maximize from regardless of Player 's instruction, then Player will often be able to profit from deviating from the instruction. For example, in the safe Pareto improvement on the Demand Game, each player would only want their representative to choose from rather than if the other player's representative does the same. It would then seem that in a program equilibrium in which is played, each program would have to contain a condition of the type, “if the opponent code plays as in against me, I also play as I would in .” But in a naive implementation of this, each of the programs would have to call the other, leading to an infinite recursion.
In the literature on program equilibrium, various solutions to this problem have been discovered. We here use the general scheme proposed by Tennenholtz [55], because it is the simplest. We could similarly use the variant proposed by Fortnow [15], techniques based on Löb's theorem [5, 13], or grounded mutual simulation [36] or even (meta) Assurance Game preferences (see Appendix B).
In our equilibrium, we let each player submit code as sketched in Algorithm 2. Roughly, each player uses a program that says, “if everyone else submitted the same source code as this one, then play . Otherwise, if there is a player who submits a different source code, punish player by playing her strategy”. Note that for convenience, Algorithm 2 receives the player number as input. This way, every player can use the exact same source code. Otherwise the original players would have to provide slightly different programs and in line 2 of the algorithm, we would have to use a more complicated comparison, roughly: “if are the same, except for the player index used”.
Proposition 18. Let be a game and let be an SPI on . Let be the program profile consisting only of Algorithm 2 for each player. Assume that guarantees each player at least threat point utility in expectation. Then is a program equilibrium and .
Proof. By inspection of Algorithm 2, we see that . It is left to show that is a Nash equilibrium. So let be any player and . We need to show that . Again, by inspection of , is the threat point of Player . Hence,
as required.
Theorem 1 follows immediately.
We here discuss Raub’s [45] paper in some detail, which in turn elaborates on an idea by Sen [51]. Superficially, Raub’s setting seems somewhat similar to ours, but we here argue that it should be thought of as closer to the work on program equilibrium and bilateral precommitment. In Sections 1, 3 and 3.2, we briefly discuss multilateral commitment games, which have been discussed before in various forms in the gametheoretic literature. Our paper extends this setting by allowing instructions that let the representatives play a game without specifying an algorithm for solving that game. On first sight, it appears that Raub pursues a very similar idea. Translated to our setting, Raub allows that as an instruction, each player chooses a new utility function , where is the set of outcomes of the original game . Given instructions , the representatives then play the game . In particular, each representative can see what utility functions all the other representatives have been instructed to maximize. However, what utility function representative maximizes is not conditional on any of the instructions by other players. In other words, the instructions in Raub's paper are raw utility functions without any surrounding control structures, etc. Raub then asks for equilibria of the metagame that Paretoimprove on the default outcome.
To better understand how Raub's approach relates to ours, we here give an example of the kind of instructions Raub has in mind. (Raub uses the same example in his paper.) As the underlying game , we take the Prisoner's Dilemma. Now the main idea of his paper is that the original players can instruct their representatives to adopt socalled Assurance Game preferences. In the Prisoner's Dilemma, this means that the representatives prefer to cooperate if the other representative cooperates, and prefer to defect if the other player defects. Further, they prefer mutual cooperation over mutual defection. An example of such Assurance Game preferences is given in Table 8. (Note that this payoff matrix resembles the classic Stag Hunt studied in game theory.)
The Assurance Game preferences have two important properties.
The first important difference between Raub's approach and ours is related to item 2. We have ignored the issue of making SPIs Nash equilibria of our meta game. As we have explained in Section 3.2 and Appendix A, we imagine that this is taken care of by additional bilateral commitment mechanisms that are not the focus of this paper. For Raub's paper, on the other hand, ensuring mutual cooperation to be stable in the new game is arguably the key idea. Still, we could pursue the approach of the present paper even when we limit assumptions to those that consist only of a utility function.
The second difference is even more important. Raub assumes that – as in the PD – the default outcome of the game ( in the formalism of this paper) is known. (Less significantly, he also assumes that it is known how the representatives play under assurance game preferences.) Of course, the key feature of the setting of this paper is that the underlying game might be difficult (through equilibrium selection problems) and thus that the original players might be unable to predict .
These are the reasons why we cite Raub in our section on bilateral commitment mechanisms. Arguably, Raub's paper could be seen as very early work on program equilibrium, except that he uses utility functions as a programming language for representative. In this sense, Raub's Assurance Game preferences are analogous to the program equilibrium schemes of Tennenholtz [55], Oesterheld [55], Barasz et al. [5] and van der Hoek, Witteveen, and Wooldridge [57], ordered in increasing order of similarity of the main idea of the scheme.
Lemma 4. Let and be isomorphisms between and . If is (strictly) Paretoimproving, then so is .
Proof.
First, we argue that if and are isomorphisms, then they are isomorphisms relative to the same constants and . For each player , we distinguish two cases. First the case where all outcomes in have the same utility for Player is trivial. Now imagine that the outcomes of do not all have the same utility. Then let and be the lowest and highest utilities, respectively, in . Further, let and be the lowest and highest utilities, respectively, in . It is easy to see that if is a game isomorphism, it maps outcomes with utility in onto outcomes with utility in , and outcomes with utility in onto outcomes with utility in . Thus, if and are to be the constants for , then
Since , this system of linear equations has a unique solution. By the same pair of equations, the constants for are uniquely determined.
It follows that for all ,
Furthermore, if is strictly Paretoimproving for some , then by bijectivity of