Publications
Contents
The following are selected publications from our researchers.
Cooperation, conflict, and transformative AI
Multi-agent systems
Oesterheld, Caspar; Conitzer, Vincent. Safe Pareto Improvements for Delegated Game Playing. AAMAS, 2021. Links | BibTeX @conference{safe-pareto-improvements, |
Stastny, Julian; Riché, Maxime; Lyzhov, Alexander; Treutlein, Johannes; Dafoe, Allan; Clifton, Jesse. Normative Disagreement as a Challenge for Cooperative AI. Cooperative AI workshop and the Strategic ML workshop at NeurIPS, 2021. Abstract | Links | BibTeX @conference{multi-agent-learning, Cooperation in settings where agents have both common and conflicting interests (mixed-motive environments) has recently received considerable attention in multi-agent learning. However, the mixed-motive environments typically studied have a single cooperative outcome on which all agents can agree. Many real-world multi-agent environments are instead bargaining problems (BPs): they have several Pareto-optimal payoff profiles over which agents have conflicting preferences. We argue that typical cooperation-inducing learning algorithms fail to cooperate in BPs when there is room for normative disagreement resulting in the existence of multiple competing cooperative equilibria, and illustrate this problem empirically. To remedy the issue, we introduce the notion of norm-adaptive policies. Norm-adaptive policies are capable of behaving according to different norms in different circumstances, creating opportunities for resolving normative disagreement. We develop a class of norm-adaptive policies and show in experiments that these significantly increase cooperation. However, norm-adaptiveness cannot address residual bargaining failure arising from a fundamental tradeoff between exploitability and cooperative robustness. |
Sauerberg, Nathaniel; Oesterheld, Caspar. Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments. 2025. Links | BibTeX @conference{Sauerberg2025, |
DiGiovanni, Anthony; Clifton, Jesse. Commitment games with conditional information revelation. AAAI 2023, 2022. Abstract | Links | BibTeX @conference{DiGiovanni2022, The conditional commitment abilities of mutually transparent computer agents have been studied in previous work on commitment games and program equilibrium. This literature has shown how these abilities can help resolve Prisoner's Dilemmas and other failures of cooperation in complete information settings. But inefficiencies due to private information have been neglected thus far in this literature, despite the fact that these problems are pervasive and might also be addressed by greater mutual transparency. In this work, we introduce a framework for commitment games with a new kind of conditional commitment device, which agents can use to conditionally reveal private information. We prove a folk theorem for this setting that provides sufficient conditions for ex post efficiency, and thus represents a model of ideal cooperation between agents without a third-party mediator. Connecting our framework with the literature on strategic information revelation, we explore cases where conditional revelation can be used to achieve full cooperation while unconditional revelation cannot. Finally, extending previous work on program equilibrium, we develop an implementation of conditional information revelation. We show that this implementation forms program ϵ-Bayesian Nash equilibria corresponding to the Bayesian Nash equilibria of these commitment games. |
DiGiovanni, Anthony; Macé, Nicolas; Clifton, Jesse. Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs. Learning, Evolution, and Games, 2022. Abstract | Links | BibTeX @conference{DiGiovanni2022b, The evolution of preferences that account for other agents’ fitness, or other-regarding preferences, has been modeled with the “indirect approach” to evolutionary game theory. Under the indirect evolutionary approach, agents make decisions by optimizing a subjective utility function. Evolution may select for subjective preferences that differ from the fitness function, and in particular, subjective preferences for increasing or reducing other agents’ fitness. However, indirect evolutionary models typically artificially restrict the space of strategies that agents might use (assuming that agents always play a Nash equilibrium under their subjective preferences), and dropping this restriction can undermine the finding that other-regarding preferences are selected for. Can the indirect evolutionary approach still be used to explain the apparent existence of other-regarding preferences, like altruism, in humans? We argue that it can, by accounting for the costs associated with the complexity of strategies, giving (to our knowledge) the first account of the relationship between strategy complexity and the evolution of preferences. Our model formalizes the intuition that agents face tradeoffs between the cognitive costs of strategies and how well they interpolate across contexts. For a single game, these complexity costs lead to selection for a simple fixed-action strategy, but across games, when there is a sufficiently large cost to a strategy's number of context-specific parameters, a strategy of maximizing subjective (other-regarding) utility is stable again. Overall, our analysis provides a more nuanced picture of when other-regarding preferences will evolve. |
Clifton, Jesse. Collaborative game specification: arriving at common models in bargaining. Working paper, March 2021. Links | BibTeX @online{clifton-collaborative-game-2021, |
Clifton, Jesse. Weak identifiability and its consequences in strategic settings. Working paper, February 2021. Links | BibTeX @online{clifton-weak-identifiability-2021, |
Clifton, Jesse; Riché, Maxime. Towards cooperation in learning games. Working paper, October 2020. Links | BibTeX @online{clifton-towards-cooperation-in-learning-games, |
Oesterheld, Caspar. Robust program equilibrium. In: Theory and Decision, vol. 86, no. 1, 2018. Links | BibTeX @article{oesterheld-robust-program-2018, |
Strategic considerations
Taylor, Mia. Measurement Research Agenda. , June 2024. Links | BibTeX @online{taylor-measurement-research-agenda-2024, |
Clifton, Jesse. CLR's Research Agenda on Cooperation, Conflict, and TAI. Alignment Forum, December 2019. Links | BibTeX @online{clifton-clrs-research-2019, |
Martin, Samuel. Overview and Analysis of Transformative AI Misuse. , October 2024. Links | BibTeX @online{martin-overview-TAI-misuse-2024, |
Clifton, Jesse. Equilibrium and prior selection problems in multipolar deployment. AI Alignment Forum, April 2020. Links | BibTeX @online{clifton-equilibrium-selection-2020, |
Kokotajlo, Daniel. The "Commitment Races" problem. Alignment Forum, August 2019. Links | BibTeX @online{kokotajlo-the-commitment-2019, |
Decision theory
Kollin, Sylvester; Clifton, Jesse; DiGiovanni, Anthony; Macé, Nicolas. Bracketing Cluelessness. In: 2025. Links | BibTeX @article{Kollin2025, |
Nathaniel Sauerberg, Caspar Oesterheld. Computing Optimal Commitments to Strategies and Outcome-conditional Utility Transfers. In: 2024. Links | BibTeX @article{sauerberg2024computing, |
Treutlein, Johannes. Modeling evidential cooperation in large worlds. In: 2023. Links | BibTeX @article{Treutlein2023, |
MacAskill, William; Vallinder, Aron; Oesterheld, Caspar; Shulman, Carl; Treutlein, Johannes. The Evidentialist’s Wager. In: The Journal of Philosophy, 2021. Links | BibTeX @article{macaskill-the-evidentialists-2019, |
Bell, James; Linsefors, Linda; Oesterheld, Caspar; Skalse, Joar. Reinforcement Learning in Newcomblike Environments. In: NeurIPS, 2021. Links | BibTeX @article{Bell2021, |
Oesterheld, Caspar. Approval-directed agency and the decision theory of Newcomb-like problems. In: Synthese, 2019, (Runner-up in the "AI alignment prize"). Links | BibTeX @article{oesterheld-approvaldirected-agency-2019, |
Oesterheld, Caspar. Doing what has worked well in the past leads to evidential decision theory. In: 2018. Links | BibTeX @article{Oesterheld2018, |
Oesterheld, Caspar. Multiverse-wide Cooperation via Correlated Decision Making. In: 2017. Links | BibTeX @article{oesterheld-multiversewide-cooperation-2017, |
Oesterheld, Caspar. Decision Theory and the Irrelevance of Impossible Outcomes. In: 2017. Links | BibTeX @article{Oesterheld2017, |
Treutlein, Johannes. Anthropic uncertainty in the Evidential Blackmail. In: 2017. Links | BibTeX @article{Treutlein2017, |
Malevolence
Althaus, David; Baumann, Tobias. Reducing long-term risks from malevolent actors. Effective Altruism Forum, April 2020. Links | BibTeX @online{Althaus2020, |
Ethics & meta-ethics
Gloor, Lukas. Sequence on moral anti-realism. Effective Altruism Forum, June 2020. Links | BibTeX @online{gloor-sequence-moral-anti-realism, |
Gloor, Lukas. Tranquilism. CLR Website, July 2017. Links | BibTeX @online{gloor--2017, |
Knutsson, Simon; Munthe, Christian. A Virtue of Precaution Regarding the Moral Status of Animals with Uncertain Sentience. In: Journal of Agricultural and Environmental Ethics, vol. 30, no. 2, 2017. Links | BibTeX @article{knutsson-a-virtue-2017, |
Daniel, Max. Bibliography of Suffering-Focused Views. CLR Website, August 2016. Links | BibTeX @online{daniel-bibliography-of-2016, |
Tomasik, Brian. The Importance of Wild-Animal Suffering. In: Relations, vol. 3, no. 2, 2015. Links | BibTeX @article{tomasik-the-importance-2015, |
Tomasik, Brian. Should We Base Moral Judgments on Intentions or Outcomes?. CLR Website, July 2013. Links | BibTeX @online{tomasik-should-we-2013, |
Tomasik, Brian. Dealing with Moral Multiplicity. CLR Website, December 2013. Links | BibTeX @online{tomasik-dealing-with-2013, |
Prioritization & macrostrategy
Cook, Tristan. Replicating and extending the grabby aliens model. Effective Altruism Forum, April 2022. Links | BibTeX @online{cook-2022, |
Cook, Tristan; Corlouer, Guillaume. The optimal timing of spending on AGI safety work; why we should probably be spending more now. Effective Altruism Forum, November 2022. Links | BibTeX @online{Cook2022, |
Baum, Seth D; Armstrong, Stuart; Ekenstedt, Timoteus; Häggström, Olle; Hanson, Robin; Kuhlemann, Karin; Maas, Matthijs M; Miller, James D; Salmela, Markus; Sandberg, Anders; Sotala, Kaj; Torres, Phil; Turchin, Alexey; Yampolskiy, Roman V. Long-term trajectories of human civilization. In: Foresight, vol. 21, no. 1, pp. 53-83, 2019. Links | BibTeX @article{d-longterm-trajectories-2019, |
Kokotajlo, Daniel. Soft takeoff can still lead to decisive strategic advantage. Alignment Forum, August 2019. Links | BibTeX @online{kokotajlo-soft-takeoff-2019, |
Gloor, Lukas. Rebuttal of Christiano and AI Impacts on takeoff speeds?. LessWrong, April 2019. Links | BibTeX @online{gloor-rebuttal-of-2019, |
Gloor, Lukas. Cause prioritization for downside-focused value systems. Effective Altruism Forum, January 2018. Links | BibTeX @online{gloor-cause-prioritization-2018, |
Althaus, David. Descriptive Population Ethics and Its Relevance for Cause Prioritization. Effective Altruism Forum, April 2018. Links | BibTeX @online{althaus-descriptive-population-2018, |
Sotala, Kaj. How feasible is the rapid development of artificial superintelligence?. In: Physica Scripta, vol. 92, no. 11, 2017. Links | BibTeX @article{sotala-how-feasible-2017, |
Sotala, Kaj; Gloor, Lukas. Superintelligence as a Cause or Cure for Risks of Astronomical Suffering. In: Informatica, vol. 41, no. 4, 2017. Links | BibTeX @article{sotala-superintelligence-as-2017, |
Oesterheld, Caspar. Complications in evaluating neglectedness. The Universe from an Intentional Stance Blog, June 2017. Links | BibTeX @online{oesterheld-complications-in-2017, |
Tomasik, Brian. How the Simulation Argument Dampens Future Fanaticism. CLR Website, June 2016. Links | BibTeX @online{tomasik-how-the-2016, |
AI Forecasting
Kokotajlo, Daniel. What 2026 looks like. LessWrong, August 2021. Links | BibTeX @online{Kokotajlo2021, |
Kokotajlo, Daniel. Fun with +12 OOMs of Compute. LessWrong, March 2021. Links | BibTeX @online{Kokotajlo2021b, |
Kokotajlo, Daniel. Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain. LessWrong, January 2021. Links | BibTeX @online{Kokotajlo2021c, |
Kokotajlo, Daniel. Against GDP as a metric for timelines and takeoff speeds. LessWrong, December 2020. Links | BibTeX @online{Kokotajlo2020, |
Other
DiGiovanni, Anthony. Beginner’s guide to reducing s-risks. CLR Website, September 2023. Links | BibTeX @online{DiGiovanni-2023, |
Kokotajlo, Daniel. Persuasion Tools: AI takeover without AGI or agency?. LessWrong, November 2020. Links | BibTeX @online{Kokotajlo2020b, |
Althaus, David; Kokotajlo, Daniel. Incentivizing forecasting via social media. Effective Altruism Forum, December 2020. Links | BibTeX @online{Althaus2020b, |
Sotala, Kaj. Sequence on non-agent and multiagent models of mind. LessWrong, January 2019. Links | BibTeX @online{sotala-non-agent-multiagent-models-mind, |
Oesterheld, Caspar. Moral realism and AI alignment. LessWrong, September 2018. Links | BibTeX @online{oesterheld-moral-realism-2018, |
Gloor, Lukas. Suffering-Focused AI Safety: In Favor of “Fail-Safe” Measures. CLR Website, June 2016. Links | BibTeX @online{gloor-sufferingfocused-ai-2016, |
Gloor, Lukas. Room for Other Things: How to adjust if EA seems overwhelming. Effective Altruism Forum, March 2015. Links | BibTeX @online{gloor-room-for-2015, |