Publications

Contents

Cooperation, conflict, and transformative AI
Malevolence
Ethics & meta-ethics
Prioritization & macrostrategy
AI Forecasting
Other

The following are selected publications from our researchers.

Cooperation, conflict, and transformative AI

Empirical Research Stream

Sorry, no publications matched your criteria.

Model Personas

Warncke, Niels; Riché, Maxime; Tan, Daniel. Concrete Research Ideas on AI Personas. , February 2026.
Links | BibTeX

Riché, Maxime; Warncke, Niels. Conditionalization Confounds Inoculation Prompting Results. , February 2026.
Links | BibTeX

Jose, Arun; Taylor, Mia; Warncke, Niels. Strategic Obfuscation of Deceptive Reasoning in Language Models. ICLR, 2026.
Links | BibTeX

Betley, Jan; Warncke, Niels; Sztyber-Betley, Anna; Tan, Daniel; Bao, Xuchan; Soto, Martín; Srivastava, Megha; Labenz, Nathan; Evans, Owain. Training large language models on narrow tasks can lead to broad misalignment. In: Nature, vol. 649, iss. 584-589, 2026.
Links | BibTeX

Tan, Daniel; Woodruff, Anders; Warncke, Niels; Jose, Arun; Riché, Maxime; Africa, David Demitri; Taylor, Mia. Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time. Arxiv, 2025.
Links | BibTeX

Taylor, Mia; Chua, James; Betley, Jan; Treutlein, Johannes; Evans, Owain. School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs. , September 2025.
Links | BibTeX

Woodruff, Anders. Aesthetic Preferences Can Cause Emergent Misalignment. , August 2025.
Links | BibTeX

Betley, Jan; Tan, Daniel; Warncke, Niels; Sztyber-Betley, Anna; Bao, Xuchan; Soto, Martín; Labenz, Nathan; Evans, Owain. Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. ICML, 2025.
Links | BibTeX

Multi-agent systems

Oesterheld, Caspar; Conitzer, Vincent. Safe Pareto Improvements for Delegated Game Playing. AAMAS, 2021.
Links | BibTeX

Stastny, Julian; Riché, Maxime; Lyzhov, Alexander; Treutlein, Johannes; Dafoe, Allan; Clifton, Jesse. Normative Disagreement as a Challenge for Cooperative AI. Cooperative AI workshop and the Strategic ML workshop at NeurIPS, 2021.
Abstract | Links | BibTeX

Sauerberg, Nathaniel; Oesterheld, Caspar. Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments. 2025.
Links | BibTeX

DiGiovanni, Anthony; Clifton, Jesse. Commitment games with conditional information revelation. AAAI 2023, 2022.
Abstract | Links | BibTeX

DiGiovanni, Anthony; Macé, Nicolas; Clifton, Jesse. Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs. Learning, Evolution, and Games, 2022.
Abstract | Links | BibTeX

@conference{DiGiovanni2022b,

title = {Evolutionary Stability of Other-Regarding Preferences Under Complexity Costs},

author = {Anthony DiGiovanni and Nicolas Macé and Jesse Clifton},

url = {https://longtermrisk.org/evolutionary-stability-of-other-regarding-preferences-under-complexity-costs/, HTML

https://arxiv.org/pdf/2207.03178, PDF

},

year  = {2022},

date = {2022-07-07},

booktitle = {Learning, Evolution, and Games},

abstract = {The evolution of preferences that account for other agents’ fitness, or other-regarding preferences, has been modeled with the “indirect approach” to evolutionary game theory. Under the indirect evolutionary approach, agents make decisions by optimizing a subjective utility function. Evolution may select for subjective preferences that differ from the fitness function, and in particular, subjective preferences for increasing or reducing other agents’ fitness. However, indirect evolutionary models typically artificially restrict the space of strategies that agents might use (assuming that agents always play a Nash equilibrium under their subjective preferences), and dropping this restriction can undermine the finding that other-regarding preferences are selected for. Can the indirect evolutionary approach still be used to explain the apparent existence of other-regarding preferences, like altruism, in humans? We argue that it can, by accounting for the costs associated with the complexity of strategies, giving (to our knowledge) the first account of the relationship between strategy complexity and the evolution of preferences. Our model formalizes the intuition that agents face tradeoffs between the cognitive costs of strategies and how well they interpolate across contexts. For a single game, these complexity costs lead to selection for a simple fixed-action strategy, but across games, when there is a sufficiently large cost to a strategy's number of context-specific parameters, a strategy of maximizing subjective (other-regarding) utility is stable again. Overall, our analysis provides a more nuanced picture of when other-regarding preferences will evolve.},

howpublished = {Peer-reviewed},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Clifton, Jesse. Collaborative game specification: arriving at common models in bargaining. Working paper, March 2021.
Links | BibTeX

Clifton, Jesse. Weak identifiability and its consequences in strategic settings. Working paper, February 2021.
Links | BibTeX

Clifton, Jesse; Riché, Maxime. Towards cooperation in learning games. Working paper, October 2020.
Links | BibTeX

Oesterheld, Caspar. Robust program equilibrium. In: Theory and Decision, vol. 86, no. 1, 2018.
Links | BibTeX

Strategic considerations

Taylor, Mia. Measurement Research Agenda. , June 2024.
Links | BibTeX

Clifton, Jesse. CLR's Research Agenda on Cooperation, Conflict, and TAI. Alignment Forum, December 2019.
Links | BibTeX

Martin, Samuel. Overview and Analysis of Transformative AI Misuse. , October 2024.
Links | BibTeX

Clifton, Jesse. Equilibrium and prior selection problems in multipolar deployment. AI Alignment Forum, April 2020.
Links | BibTeX

Kokotajlo, Daniel. The "Commitment Races" problem. Alignment Forum, August 2019.
Links | BibTeX

Decision theory

Kollin, Sylvester; Clifton, Jesse; DiGiovanni, Anthony; Macé, Nicolas. Bracketing Cluelessness. In: 2025.
Links | BibTeX

Nathaniel Sauerberg, Caspar Oesterheld. Computing Optimal Commitments to Strategies and Outcome-conditional Utility Transfers. In: 2024.
Links | BibTeX

Treutlein, Johannes. Modeling evidential cooperation in large worlds. In: 2023.
Links | BibTeX

MacAskill, William; Vallinder, Aron; Oesterheld, Caspar; Shulman, Carl; Treutlein, Johannes. The Evidentialist’s Wager. In: The Journal of Philosophy, 2021.
Links | BibTeX

Bell, James; Linsefors, Linda; Oesterheld, Caspar; Skalse, Joar. Reinforcement Learning in Newcomblike Environments. In: NeurIPS, 2021.
Links | BibTeX

Oesterheld, Caspar. Approval-directed agency and the decision theory of Newcomb-like problems. In: Synthese, 2019, (Runner-up in the "AI alignment prize").
Links | BibTeX

Oesterheld, Caspar. Doing what has worked well in the past leads to evidential decision theory. In: 2018.
Links | BibTeX

Oesterheld, Caspar. Multiverse-wide Cooperation via Correlated Decision Making. In: 2017.
Links | BibTeX

Oesterheld, Caspar. Decision Theory and the Irrelevance of Impossible Outcomes. In: 2017.
Links | BibTeX

Treutlein, Johannes. Anthropic uncertainty in the Evidential Blackmail. In: 2017.
Links | BibTeX

Malevolence

Althaus, David; Baumann, Tobias. Reducing long-term risks from malevolent actors. Effective Altruism Forum, April 2020.
Links | BibTeX

Ethics & meta-ethics

Gloor, Lukas. Sequence on moral anti-realism. Effective Altruism Forum, June 2020.
Links | BibTeX

Gloor, Lukas. Tranquilism. CLR Website, July 2017.
Links | BibTeX

Knutsson, Simon; Munthe, Christian. A Virtue of Precaution Regarding the Moral Status of Animals with Uncertain Sentience. In: Journal of Agricultural and Environmental Ethics, vol. 30, no. 2, 2017.
Links | BibTeX

Daniel, Max. Bibliography of Suffering-Focused Views. CLR Website, August 2016.
Links | BibTeX

Tomasik, Brian. The Importance of Wild-Animal Suffering. In: Relations, vol. 3, no. 2, 2015.
Links | BibTeX

Tomasik, Brian. Should We Base Moral Judgments on Intentions or Outcomes?. CLR Website, July 2013.
Links | BibTeX

Tomasik, Brian. Dealing with Moral Multiplicity. CLR Website, December 2013.
Links | BibTeX

Prioritization & macrostrategy

Cook, Tristan. Replicating and extending the grabby aliens model. Effective Altruism Forum, April 2022.
Links | BibTeX

Cook, Tristan; Corlouer, Guillaume. The optimal timing of spending on AGI safety work; why we should probably be spending more now. Effective Altruism Forum, November 2022.
Links | BibTeX

Baum, Seth D; Armstrong, Stuart; Ekenstedt, Timoteus; Häggström, Olle; Hanson, Robin; Kuhlemann, Karin; Maas, Matthijs M; Miller, James D; Salmela, Markus; Sandberg, Anders; Sotala, Kaj; Torres, Phil; Turchin, Alexey; Yampolskiy, Roman V. Long-term trajectories of human civilization. In: Foresight, vol. 21, no. 1, pp. 53-83, 2019.
Links | BibTeX

Kokotajlo, Daniel. Soft takeoff can still lead to decisive strategic advantage. Alignment Forum, August 2019.
Links | BibTeX

Gloor, Lukas. Rebuttal of Christiano and AI Impacts on takeoff speeds?. LessWrong, April 2019.
Links | BibTeX

Gloor, Lukas. Cause prioritization for downside-focused value systems. Effective Altruism Forum, January 2018.
Links | BibTeX

Althaus, David. Descriptive Population Ethics and Its Relevance for Cause Prioritization. Effective Altruism Forum, April 2018.
Links | BibTeX

Sotala, Kaj. How feasible is the rapid development of artificial superintelligence?. In: Physica Scripta, vol. 92, no. 11, 2017.
Links | BibTeX

Sotala, Kaj; Gloor, Lukas. Superintelligence as a Cause or Cure for Risks of Astronomical Suffering. In: Informatica, vol. 41, no. 4, 2017.
Links | BibTeX

Oesterheld, Caspar. Complications in evaluating neglectedness. The Universe from an Intentional Stance Blog, June 2017.
Links | BibTeX

Tomasik, Brian. How the Simulation Argument Dampens Future Fanaticism. CLR Website, June 2016.
Links | BibTeX

AI Forecasting

Kokotajlo, Daniel. What 2026 looks like. LessWrong, August 2021.
Links | BibTeX

Kokotajlo, Daniel. Fun with +12 OOMs of Compute. LessWrong, March 2021.
Links | BibTeX

Kokotajlo, Daniel. Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain. LessWrong, January 2021.
Links | BibTeX

Kokotajlo, Daniel. Against GDP as a metric for timelines and takeoff speeds. LessWrong, December 2020.
Links | BibTeX

Other

DiGiovanni, Anthony. Beginner’s guide to reducing s-risks. CLR Website, September 2023.
Links | BibTeX

Kokotajlo, Daniel. Persuasion Tools: AI takeover without AGI or agency?. LessWrong, November 2020.
Links | BibTeX

Althaus, David; Kokotajlo, Daniel. Incentivizing forecasting via social media. Effective Altruism Forum, December 2020.
Links | BibTeX

Sotala, Kaj. Sequence on non-agent and multiagent models of mind. LessWrong, January 2019.
Links | BibTeX

Oesterheld, Caspar. Moral realism and AI alignment. LessWrong, September 2018.
Links | BibTeX

Gloor, Lukas. Suffering-Focused AI Safety: In Favor of “Fail-Safe” Measures. CLR Website, June 2016.
Links | BibTeX

Gloor, Lukas. Room for Other Things: How to adjust if EA seems overwhelming. Effective Altruism Forum, March 2015.
Links | BibTeX