Center on Long-Term Risk

Our goal is to address worst-case risks from the development and deployment of advanced AI systems. We are currently focused on conflict scenarios as well as technical and philosophical aspects of cooperation.

We do interdisciplinary research, make and recommend grants, and build a community of professionals and other researchers around our priorities.

 

More about us

Model Persona Research Agenda

CLR’s overall mission is to reduce the risk of astronomical suffering from powerful AI, or s-risks. We’re primarily concerned with threat models involving the deliberate creation of suffering, and have identified a number of properties that may increase such risks if future powerful AIs have them. We call these s-risk conducive properties, short SRCPs. Our previous empirical research agenda has focused on characterizing and measuring these properties with a particular focus on agential suffering as a result from conflict. While we are still interested in measuring SRCPs we have shifted our focus from evaluating to understanding and steering their emergence, and now also consider threat models that involve motivations for creating suffering outside of conflict. This puts more emphasis on […]

Read online

Safe Pareto Improvements Research Agenda

Author: Anthony DiGiovanni Executive summary Introduction At the Center on Long-Term Risk (CLR), we're interested in preventing catastrophic cooperation failures between powerful AIs. These AIs might be able to make credible commitments,1 e.g., deploying subagents that are bound to auditable instructions. Such commitment abilities could open up new opportunities for cooperation in high-stakes negotiations. In particular, with the ability to commit to certain policies conditional on each other's commitments, AIs could use strategies like "I'll cooperate in this Prisoner's Dilemma if and only if you're committed to this same strategy" (as in open-source game theory). But credible commitments might also exacerbate conflict, by enabling multiple parties to lock in incompatible demands. For example, suppose two AIs can each lock a […]

Read online
Browse all CLR research

From our blog

Overview of Transformative AI Misuse Risks: What Could Go Wrong Beyond Misalignment

This  post provides an overview of this report. Discussions of the existential risks posed by artificial intelligence have largely focused on the challenge of alignment - ensuring that advanced AI systems pursue human-compatible goals. However, even if we solve alignment, humanity could still face catastrophic outcomes from how humans choose to use transformative AI technologies. A new analysis examines these "misuse risks" - scenarios where human decisions about AI deployment, rather than AI systems acting against human interests, lead to existential catastrophe. This includes both intentional harmful uses (like developing AI-enabled weapons) and reckless deployment without adequate safeguards. The analysis maps out how such human-directed applications of AI, even when technically aligned, could lead to permanent loss of human potential. […]

Read more

Individually incentivized safe Pareto improvements in open-source bargaining

Summary Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have individual incentives to avoid catastrophic bargaining failures. More precisely, DCM constructs strategies that are plausibly individually incentivized, and, if adopted by all, guarantee each player no less than their least preferred trade outcome. Figure 0 below illustrates this. This result is significant because artificial general intelligences (AGIs) might (i) be involved in high-stakes negotiations, (ii) be designed with the capabilities required for the type of strategy we’ll present, and (iii) bargain poorly by default (since bargaining competence isn’t […]

Read more

Making AIs less likely to be spiteful

This report is also posted to LessWrong here. Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at ​​intent alignment? We define spite as a terminal preference for frustrating others’ preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict, as well as risks from malevolence. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn […]

Read more
Browse CLR blog